### idf weight

• dft is the document frequency of t: the number of documents that contain t

• dft is an inverse measure of the informativeness of t

• dft N

• We define the idf (inverse document frequency) of t by

• We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

• Will turn out the base of the log is immaterial.

### idf example, suppose N = 1 million

There is one idf value for each term t in a collection.

idft = log10 = ( N/dft)

 term dft idft calpurnia 1 animal 100 sunday 1,000 fly 10,000 under 100,000 the 1,000,000

### Effect of idf on ranking

• Does idf have an effect on ranking for one-term queries, like

• iPhone

• idf has no effect on ranking one term queries

• idf affects the ranking of documents for queries with at least two terms

• For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.

### Collection vs. Document frequency

• The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

• Example:

 Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760
• Which word is a better search term (and should get a higher weight)?

### tf-idf weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

Wt,d = log(1+ tft,d) x log10(N/dft)

• Best known weighting scheme in information retrieval

• Note: the “-” in tf-idf is a hyphen, not a minus sign!

• Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

### Score for a document given a query

Score(q,d) = ∑t∈q∩d tf.idft,d

• There are many variants

• How “tf” is computed (with/without logs)

• Whether the terms in the query are also weighted

• …

### Binary → count → weight matrix

• Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Creator: Tgbyrdmc

Contributors:
-