110091">
Genetic Algorithms (GA)
 Genetic Algorithm: based on an analogy to biological evolution
 An initial population is created consisting of randomly generated rules
 Each rule is represented by a string of bits
 E.g., if A1 and ¬A2 then C2 can be encoded as 100
 If an attribute has k > 2 values, k bits can be used
 Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offspring
 The fitness of a rule is represented by its classification accuracy on a set of training examples
 Offspring are generated by crossover and mutation
 The process continues until a population P evolves when each rule in P satisfies a prespecified threshold
 Slow but easily parallelizable
Rough Set Approach
 Rough sets are used to approximately or “roughly” define equivalent classes
 A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)
 Finding the minimal subsets (reducts) of attributes for feature reduction is NPhard but a discernibility matrix (which stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity
Fuzzy Set Approaches
 Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as in a fuzzy membership graph)
 Attribute values are converted to fuzzy values. Ex.:
 Income, x, is assigned a fuzzy membership value to each of the discrete categories {low, medium, high}, e.g. $49K belongs to “medium income” with fuzzy value 0.15 but belongs to “high income” with fuzzy value 0.96
 Fuzzy membership values do not have to sum to 1.
 Each applicable rule contributes a vote for membership in the categories
 Typically, the truth values for each predicted category are summed, and these sums are combined
Multiclass Classification
 Classification involving more than two classes (i.e., > 2 Classes)
 Method 1. Onevs.all (OVA): Learn a classifier one at a time
 Given m classes, train m classifiers: one for each class
 Classifier j: treat tuples in class j as positive & all others as negative
 To classify a tuple X, the set of classifiers vote as an ensemble
 Method 2. Allvs.all (AVA): Learn a classifier for each pair of classes
 Given m classes, construct m(m1)/2 binary classifiers
 A classifier is trained using tuples of the two classes
 To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote
 Comparison
 Allvs.all tends to be superior to onevs.all
 Problem: Binary classifier is sensitive to errors, and errors affect vote count
ErrorCorrecting Codes for Multiclass Classification

Originally designed to correct errors during data transmission for communication tasks by exploring data redundancy

Example
 A 7bit codeword associated with classes 14
 Given a unknown tuple X, the 7trained classifiers output: 0001010
 Hamming distance: # of different bits between two codewords

H(
X
, C1) = 5, by checking # of bits between [1111111] & [0001010]

H(
X
, C2) = 3, H(
X
, C3) = 3, H(
X
, C4) = 1, thus C4 as the label for
X
 Errorcorrecting codes can correct up to (h1)/h 1bit error, where h is the minimum Hamming distance between any two codewords
 If we use 1bit per class, it is equiv. to onevs.all approach, the code are insufficient to selfcorrect

When selecting errorcorrecting codes, there should be good rowwise and col.wise separation between the codewords
SemiSupervised Classification
 Semisupervised: Uses labeled and unlabeled data to build a classifier
 Selftraining:
 Build a classifier using the labeled data
 Use it to label the unlabeled data, and those with the most confident label prediction are added to the set of labeled data
 Repeat the above process
 Adv: easy to understand; disadv: may reinforce errors
 Cotraining: Use two or more classifiers to teach each other
 Each learner uses a mutually independent set of features of each tuple to train a good classifier, say f1
 Then f1 and f2 are used to predict the class label for unlabeled data X
 Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice versa
 Other methods, e.g., joint probability distribution of features and labels
Active Learning
 Class labels are expensive to obtain
 Active learner: query human (oracle) for labels
 Poolbased approach: Uses a pool of unlabeled data
 L: a small subset of D is labeled, U: a pool of unlabeled data in D
 Use a query function to carefully select one or more tuples from U and request labels from an oracle (a human annotator)
 The newly labeled samples are added to L, and learn a model
 Goal: Achieve high accuracy using as few labeled data as possible
 Evaluated using learning curves: Accuracy as a function of the number of instances queried (# of tuples to be queried should be small)
 Research issue: How to choose the data tuples to be queried?
 Uncertainty sampling: choose the least certain ones
 Reduce version space, the subset of hypotheses consistent w. the training data
 Reduce expected entropy over U: Find the greatest reduction in the total number of incorrect predictions
Transfer Learning: Conceptual Framework
 Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target task
 Traditional learning: Build a new classifier for each new task
 Transfer learning: Build new classifier by applying existing knowledge learned from source tasks
Transfer Learning: Methods and Applications
 Applications: Especially useful when data is outdated or distribution changes, e.g., Web document classification, email spam filtering
 Instancebased transfer learning: Reweight some of the data from source tasks and use it to learn the target task
 TrAdaBoost (Transfer AdaBoost)
 Assume source and target data each described by the same set of attributes (features) & class labels, but rather diff. distributions
 Require only labeling a small amount of target data
 Use source data in training: When a source tuple is misclassified, reduce the weight of such tupels so that they will have less effect on the subsequent classifier
 Research issues
 Negative transfer: When it performs worse than no transfer at all
 Heterogeneous transfer learning: Transfer knowledge from different feature space or multiple source domains
 Largescale transfer learning