Learning is just storing the representations of the training examples in D.
Testing instance x (under 1NN):
Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Does not explicitly compute a generalization or category prototypes.
Also called:
Case-based learning
Memory-based learning
Lazy learning
Rationale of kNN: contiguity hypothesis
Cover and Hart (1967)
Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate_{ [error rate of classifier knowing model that generated data]}
In particular, asymptotic error rate is 0 if Bayes rate is 0.
Assume: query point coincides with a training point.
Both query point and training point contribute error → 2 times Bayes rate
Using only the closest example (1NN) to determine the class is subject to errors due to:
A single atypical example.
Noise (i.e., an error) in the category label of a single training example.
More robust alternative is to find the k most-similar examples and return the majority category of these k examples.
Value of k is typically odd to avoid ties; 3 and 5 are most common.
kNN = k Nearest Neighbor
To classify a document d into class c:
Define k-neighborhood N as k nearest neighbors of d
Count number of documents i in N that belong to c
Estimate P(c|d) as i/k
Choose as class argmaxc P(c|d) [ = majority class]
Nearest neighbor method depends on a similarity (or distance) metric.
Simplest for continuous m-dimensional instance space is Euclidean distance.
Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).
For text, cosine similarity of tf.idf weighted vectors is typically most effective.
Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB.
Naively, finding nearest neighbors requires a linear search through |D| documents in collection
But determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.
Use standard vector space inverted index methods to find the k nearest neighbors.
Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears.
Typically B << |D|
No feature selection necessary
Scales well with large number of classes
Don’t need to train n classifiers for n classes
Classes can influence each other
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
No training necessary
Actually: perhaps not true. (Data editing, etc.)
May be expensive at test time
In most cases it’s more accurate than NB or Rocchio
Bias/Variance tradeoff
Variance ≈ Capacity
kNN has high variance and low bias.
Infinite memory
NB has low variance and high bias.
Decision surface has to be linear (hyperplane – see later)
Consider asking a botanist: Is an object a tree?
Too much capacity/variance, low bias
Botanist who memorizes
Will always say “no” to new object (e.g., different # of leaves)
Not enough capacity/variance, high bias
Lazy botanist
Says “yes” if the object is green
You want the middle ground
(Example due to C. Burges)