### Agenda

• Bayesian Belief Networks
• Classification by Backpropagation
• Support Vector Machines
• Classification by Using Frequent Patterns
• Lazy Learners (or Learning from Your Neighbors)
• Other Classification Methods
• Summary

### Bayesian Belief Networks

• Bayesian belief network (also known as Bayesian network, probabilistic network): allows class conditional independencies between subsets of variables
• Two components: (1) A directed acyclic graph (called a structure) and (2) a set of conditional probability tables (CPTs)
• A (directed acyclic) graphical model of causal influence relationships
• Represents dependency among the variables
• Gives a specification of joint probability distribution
• Nodes: random variables
• X and Y are the parents of Z, and Y is the parent of P
• No dependency between Z and P
• Has no loops/cycles

### How Are Bayesian Networks Constructed?

• Subjective construction: Identification of (direct) causal structure
• People are quite good at identifying direct causes from a given set of variables & whether the set contains all relevant direct causes
• Markovian assumption: Each variable becomes independent of its non-effects once its direct causes are known
• E.g., S ‹— F —› A ‹— T, path S—›A is blocked once we know F—›A
• HMM (Hidden Markov Model): often used to model dynamic systems whose states are not observable, yet their outputs are
• Synthesis from other specifications
• E.g., from a formal system design: block diagrams & info flow
• Learning from data
• E.g., from medical records or student admission record
• Learn parameters give its structure or learn both structure and parms
• Maximum likelihood principle: favors Bayesian networks that maximize the probability of observing the given data set

### Training Bayesian Networks: Several Scenarios

• Scenario 1: Given both the network structure and all variables observable: compute only the CPT entries
• Scenario 2: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function
• Weights are initialized to random probability values
• At each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtracking
• Weights are updated at each iteration & converge to local optimum
• Scenario 3: Network structure unknown, all variables observable: search through the model space to reconstruct network topology
• Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this purpose
• D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, 1999.

### Classification by Backpropagation

• Backpropagation: A neural network learning algorithm
• Started by psychologists and neurobiologists to develop and test computational analogues of neurons
• A neural network: A set of connected input/output units where each connection has a weight associated with it
• During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples
• Also referred to as connectionist learning due to the connections between units

### Neuron: A Hidden/Output Layer Unit

• An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
• The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it.

### How A Multi-Layer Neural Network Works

• The inputs to the network correspond to the attributes measured for each training tuple
• Inputs are fed simultaneously into the units making up the input layer
• They are then weighted and fed simultaneously to a hidden layer
• The number of hidden layers is arbitrary, although usually only one
• The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction
• The network is feed-forward: None of the weights cycles back to an input unit or to an output unit of a previous layer
• From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function

### Defining a Network Topology

• Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer
• Normalize the input values for each attribute measured in the training tuples to [0.0—1.0]
• One input unit per domain value, each initialized to 0
• Output, if for classification and more than two classes, one output unit per class is used
• Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights

### Backpropagation

• Iteratively process a set of training tuples & compare the network's prediction with the actual known target value
• For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value
• Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation
• Steps
• Initialize weights to small random numbers, associated with biases
• Propagate the inputs forward (by applying activation function)
• Backpropagate the error (by updating weights and biases)
• Terminating condition (when error is very small, etc.)

### Efficiency and Interpretability

• Efficiency of backpropagation: Each epoch (one iteration through the training set) takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in worst case
• For easier comprehension: Rule extraction by network pruning
• Simplify the network structure by removing weighted links that have the least effect on the trained network
• Then perform link, unit, or activation value clustering
• The set of input and activation values are studied to derive rules describing the relationship between the input and hidden unit layers
• Sensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from this analysis can be represented in rules

### Neural Network as a Classifier

• Weakness
• Long training time
• Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.”
• Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network
• Strength
• High tolerance to noisy data
• Ability to classify untrained patterns
• Well-suited for continuous-valued inputs and outputs
• Successful on an array of real-world data, e.g., hand-written letters
• Algorithms are inherently parallel
• Techniques have recently been developed for the extraction of rules from trained neural networks

### Classification: A Mathematical Mapping

• Classification: predicts categorical class labels
• E.g., Personal homepage classification
• xi = (x1, x2, x3, …), yi = +1 or –1
• x1 : # of word “homepage”
• x2 : # of word “welcome”
• Mathematically, x ∈ X = R^n, y ∈ Y = {+1, –1},
• We want to derive a function f: X → Y
• Linear Classification
• Binary Classification problem
• Data above the red line belongs to class ‘x’
• Data below red line belongs to class ‘o’
• Examples: SVM, Perceptron, Probabilistic Classifiers

### Discriminative Classifiers

• Prediction accuracy is generally high
• As compared to Bayesian methods – in general
• Robust, works when training examples contain errors
• Fast evaluation of the learned target function
• Bayesian networks are normally slow
• Criticism
• Long training time
• Difficult to understand the learned function (weights)
• Bayesian networks can be used easily for pattern discovery
• Not easy to incorporate domain knowledge
• Easy in the form of priors on the data or distributions

### SVM—Support Vector Machines

• A relatively new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a higher dimension
• With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)

### SVM—History and Applications

• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)
• Used for: classification and numeric prediction
• Applications:
• handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests

### SVM—Linearly Separable

• A separating hyperplane can be written as
• WX + b = 0
• where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
• w + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
• H1: w + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
• H2: w + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors
• This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints → Quadratic Programming (QP) → Lagrangian multipliers

### Why Is SVM Effective on High Dimensional Data?

• The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data
• The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)
• If all other training examples are removed and the training is repeated, the same separating hyperplane would be found
• The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality
• Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high

### SVM—Linearly Inseparable

• Transform the original input data into a higher dimensional space
• Search for a linear separating hyperplane in the new space

### SVM: Different Kernel functions

• Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
• Typical Kernel Functions
• SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters)

### Scaling SVM by Hierarchical Micro-Clustering

• SVM is not scalable to the number of data objects in terms of training time and memory usage
• H. Yu, J. Yang, and J. Han, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, KDD'03)
• CB-SVM (Clustering-Based SVM)
• Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the training speed
• Use micro-clustering to effectively reduce the number of points to be considered
• At deriving support vectors, de-cluster micro-clusters near “candidate vector” to ensure high classification accuracy

### CF-Tree: Hierarchical Micro-cluster

• Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory
• Micro-clustering: Hierarchical indexing structure
• provide finer samples closer to the boundary and coarser samples farther from the boundary

### Selective Declustering: Ensure High Accuracy

• CF tree is a suitable base structure for selective declustering
• De-cluster only the cluster Ei such that
• Di – Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei
• Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary
• “Support cluster”: The cluster whose centroid is a support vector

### CB-SVM Algorithm: Outline

• Construct two CF-trees from positive and negative data sets independently
• Need one scan of the data set
• Train an SVM from the centroids of the root entries
• De-cluster the entries near the boundary into the next level
• The children entries de-clustered from the parent entries are accumulated into the training set with the non-declustered parent entries
• Train an SVM again from the centroids of the entries in the training set
• Repeat until nothing is accumulated

### Accuracy and Scalability on Synthetic Dataset

• Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm

### SVM vs. Neural Network

• SVM
• Deterministic algorithm
• Nice generalization properties
• Hard to learn – learned in batch mode using quadratic programming techniques
• Using kernels can learn very complex functions
• Neural Network
• Nondeterministic algorithm
• Generalizes well but doesn’t have strong mathematical foundation
• Can easily be learned in incremental fashion
• To learn complex functions—use multilayer perceptron (nontrivial)

• SVM Website: http://www.kernel-machines.org/
• Representative implementations
• LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc.
• SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C
• SVM-torch: another recent implementation also written in C

### Associative Classification

• Associative classification: Major steps
• Mine data to find strong associations between frequent patterns (conjunctions of attribute-value pairs) and class labels
• Association rules are generated in the form of
• P1 ^ p2 … ^ pl →“Aclass = C” (conf, sup)
• Organize the rules to form a rule-based classifier
• Why effective?
• It explores highly confident associations among multiple attributes and may overcome some constraints introduced by decision-tree induction, which considers only one attribute at a time
• Associative classification has been found to be often more accurate than some traditional classification methods, such as C4.5

### Typical Associative Classification Methods

• CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)
• Mine possible association rules in the form of
• Cond-set (a set of attribute-value pairs) → class label
• Build classifier: Organize rules according to decreasing precedence based on confidence and then support
• CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
• Classification: Statistical analysis on multiple rules
• CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
• Generation of predictive rules (FOIL-like analysis) but allow covered rules to retain with reduced weight
• Prediction using best k rules
• High efficiency, accuracy similar to CMAR

### Frequent Pattern-Based Classification

• H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative Frequent Pattern Analysis for Effective Classification”, ICDE'07
• Accuracy issue
• Increase the discriminative power
• Increase the expressive power of the feature space
• Scalability issue
• It is computationally infeasible to generate all feature combinations and filter them with an information gain threshold
• Efficient method (DDPMine: FPtree pruning): H. Cheng, X. Yan, J. Han, and P. S. Yu, "Direct Discriminative Pattern Mining for Effective Classification", ICDE'08

### Frequent Pattern vs. Single Feature

• The discriminative power of some frequent patterns is higher than that of single features.

### Feature Selection

• Given a set of frequent patterns, both non-discriminative and redundant patterns exist, which can cause overfitting
• We want to single out the discriminative patterns and remove redundant ones
• The notion of Maximal Marginal Relevance (MMR) is borrowed
• A document has high marginal relevance if it is both relevant to the query and contains minimal marginal similarity to previously selected documents

### Lazy vs. Eager Learning

• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple
• Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify
• Lazy: less time in training but more time in predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function
• Eager: must commit to a single hypothesis that covers the entire instance space

### Lazy Learner: Instance-Based Methods

• Instance-based learning:
• Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference

### The k-Nearest Neighbor Algorithm

• All instances correspond to points in the n-D space
• The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
• Target function could be discrete- or real- valued
• For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq
• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

### Discussion on the k-NN Algorithm

• k-NN for real-valued prediction for a given unknown tuple
• Returns the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
• Weight the contribution of each of the k neighbors according to their distance to the query xq

$w=\frac{1}{d(x_{q}, x_{i})^2}$

• Give greater weight to closer neighbors
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes
• To overcome it, axes stretch or elimination of the least relevant attributes

### Case-Based Reasoning (CBR)

• CBR: Uses a database of problem solutions to solve new problems
• Store symbolic description (tuples or cases)—not points in a Euclidean space
• Applications: Customer-service (product-related diagnosis), legal ruling
• Methodology
• Instances represented by rich symbolic descriptions (e.g., function graphs)
• Search for similar cases, multiple retrieved cases may be combined
• Tight coupling between case retrieval, knowledge-based reasoning, and problem solving
• Challenges
• Find a good similarity metric
• Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases

### Genetic Algorithms (GA)

• Genetic Algorithm: based on an analogy to biological evolution
• An initial population is created consisting of randomly generated rules
• Each rule is represented by a string of bits
• E.g., if A1 and ¬A2 then C2 can be encoded as 100
• If an attribute has k > 2 values, k bits can be used
• Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offspring
• The fitness of a rule is represented by its classification accuracy on a set of training examples
• Offspring are generated by crossover and mutation
• The process continues until a population P evolves when each rule in P satisfies a prespecified threshold
• Slow but easily parallelizable

### Rough Set Approach

• Rough sets are used to approximately or “roughly” define equivalent classes
• A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)
• Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a discernibility matrix (which stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity

### Fuzzy Set Approaches

• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as in a fuzzy membership graph)
• Attribute values are converted to fuzzy values. Ex.:
• Income, x, is assigned a fuzzy membership value to each of the discrete categories {low, medium, high}, e.g. \$49K belongs to “medium income” with fuzzy value 0.15 but belongs to “high income” with fuzzy value 0.96
• Fuzzy membership values do not have to sum to 1.
• Each applicable rule contributes a vote for membership in the categories
• Typically, the truth values for each predicted category are summed, and these sums are combined

### Multiclass Classification

• Classification involving more than two classes (i.e., > 2 Classes)
• Method 1. One-vs.-all (OVA): Learn a classifier one at a time
• Given m classes, train m classifiers: one for each class
• Classifier j: treat tuples in class j as positive & all others as negative
• To classify a tuple X, the set of classifiers vote as an ensemble
• Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
• Given m classes, construct m(m-1)/2 binary classifiers
• A classifier is trained using tuples of the two classes
• To classify a tuple X, each classifier votes. X is assigned to the class with maximal vote
• Comparison
• All-vs.-all tends to be superior to one-vs.-all
• Problem: Binary classifier is sensitive to errors, and errors affect vote count

### Error-Correcting Codes for Multiclass Classification

• Originally designed to correct errors during data transmission for communication tasks by exploring data redundancy
• Example
• A 7-bit codeword associated with classes 1-4
• Given a unknown tuple X, the 7-trained classifiers output: 0001010
• Hamming distance: # of different bits between two codewords
• H( X , C1) = 5, by checking # of bits between [1111111] & [0001010]
• H( X , C2) = 3, H( X , C3) = 3, H( X , C4) = 1, thus C4 as the label for X
• Error-correcting codes can correct up to (h-1)/h 1-bit error, where h is the minimum Hamming distance between any two codewords
• If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code are insufficient to self-correct
• When selecting error-correcting codes, there should be good row-wise and col.-wise separation between the codewords

### Semi-Supervised Classification

• Semi-supervised: Uses labeled and unlabeled data to build a classifier
• Self-training:
• Build a classifier using the labeled data
• Use it to label the unlabeled data, and those with the most confident label prediction are added to the set of labeled data
• Repeat the above process
• Co-training: Use two or more classifiers to teach each other
• Each learner uses a mutually independent set of features of each tuple to train a good classifier, say f1
• Then f1 and f2 are used to predict the class label for unlabeled data X
• Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice versa
• Other methods, e.g., joint probability distribution of features and labels

### Active Learning

• Class labels are expensive to obtain
• Active learner: query human (oracle) for labels
• Pool-based approach: Uses a pool of unlabeled data
• L: a small subset of D is labeled, U: a pool of unlabeled data in D
• Use a query function to carefully select one or more tuples from U and request labels from an oracle (a human annotator)
• The newly labeled samples are added to L, and learn a model
• Goal: Achieve high accuracy using as few labeled data as possible
• Evaluated using learning curves: Accuracy as a function of the number of instances queried (# of tuples to be queried should be small)
• Research issue: How to choose the data tuples to be queried?
• Uncertainty sampling: choose the least certain ones
• Reduce version space, the subset of hypotheses consistent w. the training data
• Reduce expected entropy over U: Find the greatest reduction in the total number of incorrect predictions

### Transfer Learning: Conceptual Framework

• Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target task
• Transfer learning: Build new classifier by applying existing knowledge learned from source tasks

### Transfer Learning: Methods and Applications

• Applications: Especially useful when data is outdated or distribution changes, e.g., Web document classification, e-mail spam filtering
• Instance-based transfer learning: Reweight some of the data from source tasks and use it to learn the target task
• Assume source and target data each described by the same set of attributes (features) & class labels, but rather diff. distributions
• Require only labeling a small amount of target data
• Use source data in training: When a source tuple is misclassified, reduce the weight of such tupels so that they will have less effect on the subsequent classifier
• Research issues
• Negative transfer: When it performs worse than no transfer at all
• Heterogeneous transfer learning: Transfer knowledge from different feature space or multiple source domains
• Large-scale transfer learning

### What Is Prediction?

• (Numerical) prediction is similar to classification
• construct a model
• use model to predict continuous or ordered value for a given input
• Prediction is different from classification
• Classification refers to predict categorical class label
• Prediction models continuous-valued functions
• Major method for prediction: regression
• model the relationship between one or more independent or predictor variables and a dependent or response variable
• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees

### Linear Regression

• Linear regression: involves a response variable y and a single predictor variable x
$y=w_{0}+w_{1}x$
where w0 (y-intercept) and w1 (slope) are regression coefficients
• Method of least squares: estimates the best-fitting straight line

$w_{1}=\frac{\sum_{i=1}^{|D|}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{|D|}(x_{i}-\bar{x})^2}$

$w_{0}=\bar{y}-w_{1}\bar{x}$

• Multiple linear regression: involves more than one predictor variable
• Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
• Ex. For 2-D data, we may have:
$y=w+w_{1}x_{1}+w_{2}x_{2}$
• Solvable by extension of least square method or using SAS, S-Plus
• Many nonlinear functions can be transformed into the above

### Nonlinear Regression

• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear regression model. For example,

$y=w+w_{1}x+w_{2}x^{2}+w_{3}x^{3}$

• convertible to linear with new variables: $x_{2} = x^{2}, x_{3}= x^{3}$
$y=w+w_{1}x+w_{2}x_{2}+w_{3}x_{3}$
• Other functions, such as power function, can also be transformed to linear model
• Some models are intractable nonlinear (e.g., sum of exponential terms)
• possible to obtain least square estimates through extensive calculation on more complex formulae

### Other Regression-Based Models

• Generalized linear model:
• Foundation on which linear regression can be applied to modeling categorical response variables
• Variance of y is a function of the mean value of y, not a constant
• Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables
• Poisson regression: models the data that exhibit a Poisson distribution
• Log-linear models: (for categorical data)
• Approximate discrete multidimensional prob. distributions
• Also useful for data compression and smoothing
• Regression trees and model trees
• Trees to predict continuous values rather than class labels

### Regression Trees and Model Trees

• Regression tree: proposed in CART system (Breiman et al. 1984)
• CART: Classification And Regression Trees
• Each leaf stores a continuous-valued prediction
• It is the average value of the predicted attribute for the training tuples that reach the leaf
• Model tree: proposed by Quinlan (1992)
• Each leaf holds a regression model—a multivariate linear equation for the predicted attribute
• A more general case than regression tree
• Regression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model

### Predictive Modeling in Multidimensional Databases

• Predictive modeling: Predict data values or construct generalized linear models based on the database data
• One can only predict value ranges or category distributions
• Method outline:
• Minimal generalization
• Attribute relevance analysis
• Generalized linear model construction
• Prediction
• Determine the major factors which influence the prediction
• Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.
• Multi-level prediction: drill-down and roll-up analysis

### SVM—Introductory Literature

• “Statistical Learning Theory” by Vapnik: extremely hard to understand, containing many errors too.
• C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
• Better than the Vapnik’s book, but still written too hard for introduction, and the examples are so not-intuitive
• The book “An Introduction to Support Vector Machines” by N. Cristianini and J. Shawe-Taylor
• Also written hard for introduction, but the explanation about the mercer’s theorem is better than above literatures
• The neural network book by Haykins
• Contains one nice chapter of SVM introduction

• “Statistical Learning Theory” by Vapnik: difficult to understand, containing many errors.
• C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
• Easier than Vapnik’s book, but still not introductory level; the examples are not so intuitive
• The book An Introduction to Support Vector Machines by Cristianini and Shawe-Taylor
• Not introductory level, but the explanation about Mercer’s Theorem is better than above literatures
• Neural Networks and Learning Machines by Haykin
• Contains a nice chapter on SVM introduction

### A Closer Look at CMAR

• CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
• Efficiency: Uses an enhanced FP-tree that maintains the distribution of class labels among tuples satisfying each frequent itemset
• Rule pruning whenever a rule is inserted into the tree
• Given two rules, R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥ conf(R2), then prune R2
• Prunes rules for which the rule antecedent and class are not positively correlated, based on a χ2 test of statistical significance
• Classification based on generated/pruned rules
• If only one rule satisfies tuple X, assign the class label of the rule
• If a rule set S satisfies X, CMAR
• divides S into groups according to class labels
• uses a weighted χ2 measure to find the strongest group of rules, based on the statistical correlation of rules within a group
• assigns X the class label of the strongest group

### Summary

• Effective and advanced classification methods
• Bayesian belief network (probabilistic networks)
• Backpropagation (Neural networks)
• Support Vector Machine (SVM)
• Pattern-based classification
• Other classification methods: lazy learners (KNN, case-based reasoning), genetic algorithms, rough set and fuzzy set approaches
• Multiclass classification
• Semi-supervised classification
• Active learning
• Transfer learning

### References

• C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995
• C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998
• H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent pattern Analysis for Effective Classification, ICDE'07
• H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE'08
• N. Cristianini and J. Shawe-Taylor, Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000
• A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990
• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99
• R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
• T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001
• S. Haykin, Neural Networks and Learning Machines, Prentice Hall, 2008

### References (con't)

• D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 1995.
• V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic, MIT Press, 2001
• W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, ICDM'01
• T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000
• B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining, p. 80-86, KDD’98.
• T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
• D.E. Rumelhart, and J.L. McClelland, editors, Parallel Distributed Processing, MIT Press, 1986.
• P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
• S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
• I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2ed. Morgan Kaufmann, 2005.
• X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
• H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. KDD'03.