Query-document matching scores

  • We need a way of assigning a score to a query/document pair

  • Let’s start with a one-term query 

  • If the query term does not occur in the document: score should be 0

  • The more frequent the query term in the document, the higher the score (should be)

  • We will look at a number of alternatives for this.

Take 1: Jaccard coefficient

  • Recall from Lecture 3: A commonly used measure of overlap of two sets A and B

  • jaccard(A,B) = |A B| / |A B|

  • jaccard(A,A) = 1

  • jaccard(A,B) = 0 if A ∩ B =

  • A and B don’t have to be the same size.

  • Always assigns a number between 0 and 1.

Jaccard coefficient: Scoring example

  • What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?

  • Query: ides of march

  • Document 1: caesar died in march

  • Document 2: the long march

Issues with Jaccard for scoring

  • It doesn’t consider term frequency (how many times a term occurs in a document)

  • Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information

  • We need a more sophisticated way of normalizing for length

  • Later in this lecture, we’ll use 

  • . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.