### Query-document matching scores

• We need a way of assigning a score to a query/document pair

• If the query term does not occur in the document: score should be 0

• The more frequent the query term in the document, the higher the score (should be)

• We will look at a number of alternatives for this.

### Take 1: Jaccard coefficient

• Recall from Lecture 3: A commonly used measure of overlap of two sets A and B

• jaccard(A,B) = |A B| / |A B|

• jaccard(A,A) = 1

• jaccard(A,B) = 0 if A ∩ B =

• A and B don’t have to be the same size.

• Always assigns a number between 0 and 1.

### Jaccard coefficient: Scoring example

• What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?

• Query: ides of march

• Document 1: caesar died in march

• Document 2: the long march

### Issues with Jaccard for scoring

• It doesn’t consider term frequency (how many times a term occurs in a document)

• Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information

• We need a more sophisticated way of normalizing for length

• Later in this lecture, we’ll use

• . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.

Creator: Tgbyrdmc

Contributors:
-