dft is the document frequency of t: the number of documents that contain t
dft is an inverse measure of the informativeness of t
dft ≤ N
We define the idf (inverse document frequency) of t by
We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
Will turn out the base of the log is immaterial.
There is one idf value for each term t in a collection.
idft = log10 = ( N/dft)
idf has no effect on ranking one term queries
idf affects the ranking of documents for queries with at least two terms
For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
The tf-idf weight of a term is the product of its tf weight and its idf weight.
Wt,d = log(1+ tft,d) x log10(N/dft)
Best known weighting scheme in information retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
Score(q,d) = ∑t∈q∩d tf.idft,d
There are many variants
How “tf” is computed (with/without logs)
Whether the terms in the query are also weighted
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|