### Genetic Algorithms (GA)

• Genetic Algorithm: based on an analogy to biological evolution
• An initial population is created consisting of randomly generated rules
• Each rule is represented by a string of bits
• E.g., if A1 and ¬A2 then C2 can be encoded as 100
• If an attribute has k > 2 values, k bits can be used
• Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offspring
• The fitness of a rule is represented by its classification accuracy on a set of training examples
• Offspring are generated by crossover and mutation
• The process continues until a population P evolves when each rule in P satisfies a prespecified threshold
• Slow but easily parallelizable

### Rough Set Approach

• Rough sets are used to approximately or “roughly” define equivalent classes
• A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)
• Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a discernibility matrix (which stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity

### Fuzzy Set Approaches

• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as in a fuzzy membership graph)
• Attribute values are converted to fuzzy values. Ex.:
• Income, x, is assigned a fuzzy membership value to each of the discrete categories {low, medium, high}, e.g. \$49K belongs to “medium income” with fuzzy value 0.15 but belongs to “high income” with fuzzy value 0.96
• Fuzzy membership values do not have to sum to 1.
• Each applicable rule contributes a vote for membership in the categories
• Typically, the truth values for each predicted category are summed, and these sums are combined

### Multiclass Classification

• Classification involving more than two classes (i.e., > 2 Classes)
• Method 1. One-vs.-all (OVA): Learn a classifier one at a time
• Given m classes, train m classifiers: one for each class
• Classifier j: treat tuples in class j as positive & all others as negative
• To classify a tuple X, the set of classifiers vote as an ensemble
• Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
• Given m classes, construct m(m-1)/2 binary classifiers
• A classifier is trained using tuples of the two classes
• To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote
• Comparison
• All-vs.-all tends to be superior to one-vs.-all
• Problem: Binary classifier is sensitive to errors, and errors affect vote count

### Error-Correcting Codes for Multiclass Classification

• Originally designed to correct errors during data transmission for communication tasks by exploring data redundancy
• Example
• A 7-bit codeword associated with classes 1-4
• Given a unknown tuple X, the 7-trained classifiers output: 0001010
• Hamming distance: # of different bits between two codewords
• H( X , C1) = 5, by checking # of bits between  & 
• H( X , C2) = 3, H( X , C3) = 3, H( X , C4) = 1, thus C4 as the label for X
• Error-correcting codes can correct up to (h-1)/h 1-bit error, where h is the minimum Hamming distance between any two codewords
• If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code are insufficient to self-correct
• When selecting error-correcting codes, there should be good row-wise and col.-wise separation between the codewords

### Semi-Supervised Classification

• Semi-supervised: Uses labeled and unlabeled data to build a classifier
• Self-training:
• Build a classifier using the labeled data
• Use it to label the unlabeled data, and those with the most confident label prediction are added to the set of labeled data
• Repeat the above process
• Co-training: Use two or more classifiers to teach each other
• Each learner uses a mutually independent set of features of each tuple to train a good classifier, say f1
• Then f1 and f2 are used to predict the class label for unlabeled data X
• Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice versa
• Other methods, e.g., joint probability distribution of features and labels

### Active Learning

• Class labels are expensive to obtain
• Active learner: query human (oracle) for labels
• Pool-based approach: Uses a pool of unlabeled data
• L: a small subset of D is labeled, U: a pool of unlabeled data in D
• Use a query function to carefully select one or more tuples from U and request labels from an oracle (a human annotator)
• The newly labeled samples are added to L, and learn a model
• Goal: Achieve high accuracy using as few labeled data as possible
• Evaluated using learning curves: Accuracy as a function of the number of instances queried (# of tuples to be queried should be small)
• Research issue: How to choose the data tuples to be queried?
• Uncertainty sampling: choose the least certain ones
• Reduce version space, the subset of hypotheses consistent w. the training data
• Reduce expected entropy over U: Find the greatest reduction in the total number of incorrect predictions

### Transfer Learning: Conceptual Framework

• Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target task
• Transfer learning: Build new classifier by applying existing knowledge learned from source tasks

### Transfer Learning: Methods and Applications

• Applications: Especially useful when data is outdated or distribution changes, e.g., Web document classification, e-mail spam filtering
• Instance-based transfer learning: Reweight some of the data from source tasks and use it to learn the target task