Genetic Algorithms (GA)

  • Genetic Algorithm: based on an analogy to biological evolution
  • An initial population is created consisting of randomly generated rules
    • Each rule is represented by a string of bits
    • E.g., if A1 and ¬A2 then C2 can be encoded as 100
    • If an attribute has k > 2 values, k bits can be used
  • Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offspring
  • The fitness of a rule is represented by its classification accuracy on a set of training examples
  • Offspring are generated by crossover and mutation
  • The process continues until a population P evolves when each rule in P satisfies a prespecified threshold
  • Slow but easily parallelizable

Rough Set Approach

  • Rough sets are used to approximately or “roughly” define equivalent classes
  • A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)
  • Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a discernibility matrix (which stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity

Fuzzy Set Approaches

  • Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as in a fuzzy membership graph)
  • Attribute values are converted to fuzzy values. Ex.:
    • Income, x, is assigned a fuzzy membership value to each of the discrete categories {low, medium, high}, e.g. $49K belongs to “medium income” with fuzzy value 0.15 but belongs to “high income” with fuzzy value 0.96
    • Fuzzy membership values do not have to sum to 1.
  • Each applicable rule contributes a vote for membership in the categories
  • Typically, the truth values for each predicted category are summed, and these sums are combined

Multiclass Classification

  • Classification involving more than two classes (i.e., > 2 Classes)
  • Method 1. One-vs.-all (OVA): Learn a classifier one at a time
    • Given m classes, train m classifiers: one for each class
    • Classifier j: treat tuples in class j as positive & all others as negative
    • To classify a tuple X, the set of classifiers vote as an ensemble
  • Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
    • Given m classes, construct m(m-1)/2 binary classifiers
    • A classifier is trained using tuples of the two classes
    • To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote
  • Comparison
    • All-vs.-all tends to be superior to one-vs.-all
    • Problem: Binary classifier is sensitive to errors, and errors affect vote count

Error-Correcting Codes for Multiclass Classification

  • Originally designed to correct errors during data transmission for communication tasks by exploring data redundancy
  • Example
    • A 7-bit codeword associated with classes 1-4  
    • Given a unknown tuple X, the 7-trained classifiers output: 0001010 
    • Hamming distance: # of different bits between two codewords
    • H( X , C1) = 5, by checking # of bits between [1111111] & [0001010]
    • H( X , C2) = 3, H( X , C3) = 3, H( X , C4) = 1, thus C4 as the label for X
  • Error-correcting codes can correct up to (h-1)/h 1-bit error, where h is the minimum Hamming distance between any two codewords
  • If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code are insufficient to self-correct
  • When selecting error-correcting codes, there should be good row-wise and col.-wise separation between the codewords

Semi-Supervised Classification

  • Semi-supervised: Uses labeled and unlabeled data to build a classifier
  • Self-training:
    • Build a classifier using the labeled data
    • Use it to label the unlabeled data, and those with the most confident label prediction are added to the set of labeled data
    • Repeat the above process
    • Adv: easy to understand; disadv: may reinforce errors
  • Co-training: Use two or more classifiers to teach each other
    • Each learner uses a mutually independent set of features of each tuple to train a good classifier, say f1
    • Then f1 and f2 are used to predict the class label for unlabeled data X
    • Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice versa
  • Other methods, e.g., joint probability distribution of features and labels

Active Learning

  • Class labels are expensive to obtain
  • Active learner: query human (oracle) for labels
  • Pool-based approach: Uses a pool of unlabeled data
    • L: a small subset of D is labeled, U: a pool of unlabeled data in D
    • Use a query function to carefully select one or more tuples from U and request labels from an oracle (a human annotator)
    • The newly labeled samples are added to L, and learn a model
    • Goal: Achieve high accuracy using as few labeled data as possible
  • Evaluated using learning curves: Accuracy as a function of the number of instances queried (# of tuples to be queried should be small)
  • Research issue: How to choose the data tuples to be queried?
    • Uncertainty sampling: choose the least certain ones
    • Reduce version space, the subset of hypotheses consistent w. the training data
    • Reduce expected entropy over U: Find the greatest reduction in the total number of incorrect predictions

Transfer Learning: Conceptual Framework

  • Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target task
  • Traditional learning: Build a new classifier for each new task
  • Transfer learning: Build new classifier by applying existing knowledge learned from source tasks

Transfer Learning: Methods and Applications

  • Applications: Especially useful when data is outdated or distribution changes, e.g., Web document classification, e-mail spam filtering
  • Instance-based transfer learning: Reweight some of the data from source tasks and use it to learn the target task
  • TrAdaBoost (Transfer AdaBoost)
    • Assume source and target data each described by the same set of attributes (features) & class labels, but rather diff. distributions
    • Require only labeling a small amount of target data
    • Use source data in training: When a source tuple is misclassified, reduce the weight of such tupels so that they will have less effect on the subsequent classifier
  • Research issues
    • Negative transfer: When it performs worse than no transfer at all
    • Heterogeneous transfer learning: Transfer knowledge from different feature space or multiple source domains
    • Large-scale transfer learning