dft is the document frequency of t: the number of documents that contain t
dft is an inverse measure of the informativeness of t
df_{t} ≤ N
We define the idf (inverse document frequency) of t by
We use log (N/df_{t}) instead of N/df_{t} to “dampen” the effect of idf.
Will turn out the base of the log is immaterial.
There is one idf value for each term t in a collection.
idf_{t = }log_{10 }= ( N/df_{t})




 

 

 

 

 


iPhone
idf has no effect on ranking one term queries
idf affects the ranking of documents for queries with at least two terms
For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
Example:









The tfidf weight of a term is the product of its tf weight and its idf weight.
W_{t,d} = log(1+ tf_{t,d}) x log_{10}(N/df_{t})
Best known weighting scheme in information retrieval
Note: the “” in tfidf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
Score(q,d) = ∑_{t∈q∩d }tf.idf_{t,d}
There are many variants
How “tf” is computed (with/without logs)
Whether the terms in the query are also weighted
…
Each document is now represented by a realvalued vector of tfidf weights ∈ R^{V}