We need a way of assigning a score to a query/document pair
Let’s start with a one-term query
If the query term does not occur in the document: score should be 0
The more frequent the query term in the document, the higher the score (should be)
We will look at a number of alternatives for this.
Recall from Lecture 3: A commonly used measure of overlap of two sets A and B
jaccard(A,B) = |A ∩ B| / |A ∪ B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?
Query: ides of march
Document 1: caesar died in march
Document 2: the long march
It doesn’t consider term frequency (how many times a term occurs in a document)
Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information
We need a more sophisticated way of normalizing for length
Later in this lecture, we’ll use
. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.