Pandu Nayak and Prabhakar Raghavan
Lecture 11: Text Classification;
Vector space classification
[Borrows slides from Ray Mooney]
Classify based on prior weight of class and conditional parameter for what each word says:
Training is done by counting and dividing:
Don’t forget to smooth
Today:
Vector space methods for Text Classification
Vector space classification using centroids (Rocchio)
K Nearest Neighbors
Decision boundaries, linear and nonlinear classifiers
Dealing with more than 2 classes
Later in the course
More text classification
Support Vector Machines
Textspecific issues in classification
Each document is a vector, one component for each term (= word).
Normally normalize vectors to unit length.
Highdimensional vector space:
Terms are axes
10,000+ dimensions, or even 100,000+
Docs are vectors in this space
How can we do classification in this space?
As before, the training set is a set of documents, each labeled with its class (e.g., topic)
In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space
Premise 1: Documents in the same class form a contiguous region of space
Premise 2: Documents from different classes don’t overlap (much)
We define surfaces to delineate classes in the space
Relevance feedback methods can be adapted for text categorization
As noted before, relevance feedback can be viewed as 2class classification
Relevant vs. nonrelevant documents
Use standard tfidf weighted vectors to represent text documents
For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.
Prototype = centroid of members of class
Assign test documents to the category with the closest prototype vector based on cosine similarity.
Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.
Note that centroid will in general not be a unit vector even when the inputs are unit vectors.
Forms a simple generalization of the examples in each class (a prototype).
Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.
Classification is based on similarity to class prototypes.
Does not guarantee classifications are consistent with the given training data.
Why not?
Rocchio forms a simple representation for each class: the centroid/prototype
Classification is based on similarity to / distance from the prototype/centroid
It does not guarantee that classifications are consistent with the given training data
It is little used outside text classification
It has been used quite effectively for text classification
But in general worse than Naïve Bayes
Again, cheap to train and test documents
Learning is just storing the representations of the training examples in D.
Testing instance x (under 1NN):
Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Does not explicitly compute a generalization or category prototypes.
Also called:
Casebased learning
Memorybased learning
Lazy learning
Rationale of kNN: contiguity hypothesis
Cover and Hart (1967)
Asymptotically, the error rate of 1nearestneighbor classification is less than twice the Bayes rate_{ [error rate of classifier knowing model that generated data]}
In particular, asymptotic error rate is 0 if Bayes rate is 0.
Assume: query point coincides with a training point.
Both query point and training point contribute error → 2 times Bayes rate
Using only the closest example (1NN) to determine the class is subject to errors due to:
A single atypical example.
Noise (i.e., an error) in the category label of a single training example.
More robust alternative is to find the k mostsimilar examples and return the majority category of these k examples.
Value of k is typically odd to avoid ties; 3 and 5 are most common.
kNN = k Nearest Neighbor
To classify a document d into class c:
Define kneighborhood N as k nearest neighbors of d
Count number of documents i in N that belong to c
Estimate P(cd) as i/k
Choose as class argmaxc P(cd) [ = majority class]
Nearest neighbor method depends on a similarity (or distance) metric.
Simplest for continuous mdimensional instance space is Euclidean distance.
Simplest for mdimensional binary instance space is Hamming distance (number of feature values that differ).
For text, cosine similarity of tf.idf weighted vectors is typically most effective.
Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB.
Naively, finding nearest neighbors requires a linear search through D documents in collection
But determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.
Use standard vector space inverted index methods to find the k nearest neighbors.
Testing Time: O(BVt) where B is the average number of training documents in which a testdocument word appears.
Typically B << D
No feature selection necessary
Scales well with large number of classes
Don’t need to train n classifiers for n classes
Classes can influence each other
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
No training necessary
Actually: perhaps not true. (Data editing, etc.)
May be expensive at test time
In most cases it’s more accurate than NB or Rocchio
Bias/Variance tradeoff
Variance ≈ Capacity
kNN has high variance and low bias.
Infinite memory
NB has low variance and high bias.
Decision surface has to be linear (hyperplane – see later)
Consider asking a botanist: Is an object a tree?
Too much capacity/variance, low bias
Botanist who memorizes
Will always say “no” to new object (e.g., different # of leaves)
Not enough capacity/variance, high bias
Lazy botanist
Says “yes” if the object is green
You want the middle ground
(Example due to C. Burges)
Consider 2 class problems
Deciding between two classes, perhaps, government and nongovernment
Oneversusrest classification
How do we define (and find) the separating surface?
How do we decide which region a test doc is in?
A strong highbias assumption is linear separability:
in 2 dimensions, can separate classes by a line
Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
ax + by > c for red points
In general, lots of possible solutions for a,b,c.
Lots of possible solutions for a,b,c.
Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]
E.g., perceptron
Most methods find an optimal separating hyperplane

Class: “interest” (as in interest rate)
Example features of a linear classifier
wi ti wi ti
0.70 prime −0.71 dlrs
0.67 rate −0.35 world
0.63 interest −0.33 sees
0.60 rates −0.25 year
0.46 discount −0.24 group
0.43 bundesbank −0.24 dlr
To classify, find dot product of feature vector and weights
Many common text classifiers are linear classifiers
Naïve Bayes
Perceptron
Rocchio
Logistic regression
Support vector machines (with linear kernel)
Linear regression with threshold
Despite this similarity, noticeable performance differences
For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?
What to do for nonseparable problems?
Different training methods pick different hyperplanes
Classifiers more powerful than linear often don’t perform better on text problems. Why?
Line or hyperplane defined by:
For Rocchio, set:
[Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher Linear Discriminant where you don’t model the variance (or assume it is spherical).]
Twoclass Naive Bayes. We compute:
Decide class C if the odds is greater than 1, i.e., if the log odds is greater than 0.
So decision boundary is hyperplane:
A linear classifier like Naïve Bayes does badly on this task
kNN will do very well (assuming enough training data)

Anyof or multivalue classification
Classes are independent of each other.
A document can belong to 0, 1, or >1 classes.
Decompose into n binary problems
Quite common for documents
Oneof or multinomial or polytomous classification
Classes are mutually exclusive.
Each document belongs to exactly one class
E.g., digit recognition is polytomous classification
Digits are mutually exclusive
Build a separator between each class and its complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Apply decision criterion of classifiers independently
Done
Though maybe you could do better by considering dependencies between categories
Build a separator between each class and its complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Assign document to class with:

Representations of text are usually very high dimensional (one feature for each word)
Highbias algorithms that prevent overfitting in highdimensional space should generally work best*
For most text categorization tasks, there are many relevant features and many irrelevant ones
Methods that combine evidence from many or all features (e.g. naive Bayes, kNN) often tend to work better than ones that try to isolate just a few relevant features*
_{ *Although the results are a bit more mixed than often thought}
Is there a learning method that is optimal for all text classification problems?
No, because there is a tradeoff between bias and variance.
Factors to take into account:
How much training data is available?
How simple/complex is the problem? (linear vs. nonlinear decision boundary)
How noisy is the data?
How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust classifier.
IIR 14
Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):147, 2002.
Yiming Yang & Xin Liu, A reexamination of text categorization methods. Proceedings of SIGIR, 1999.
Trevor Hastie, Robert Tibshirani and Jerome Friedman, Elements of Statistical Learning: Data Mining, Inference and Prediction. SpringerVerlag, New York.
Open Calais: Automatic Semantic Tagging
Free (but they can keep your data), provided by Thompson/Reuters
Weka: A data mining software package that includes an implementation of many ML algorithms