### Query-document matching scores

We need a way of assigning a score to a query/document pair

**Let’s start with a one-term query**

If the query term does not occur in the document: score should be 0

**The more frequent the query term in the document, the higher the score (should be)**

We will look at a number of alternatives for this.

### Take 1: Jaccard coefficient

Recall from Lecture 3: A commonly used measure of overlap of two sets

*A*and*B*

**jaccard***(A,B) =*|*A*∩*B*| / |*A*∪*B*|

**jaccard***(A,A) =*1

**jaccard***(A,B) =*0*A ∩ B =*0

*A*and*B*don’t have to be the same size.

Always assigns a number between 0 and 1.

### Jaccard coefficient: Scoring example

What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?

Query:

*ides of march*

Document 1:

*caesar died in march*

Document 2:

*the long march*

### Issues with Jaccard for scoring

It doesn’t consider

*term frequency*(how many times a term occurs in a document)

Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information

We need a more sophisticated way of normalizing for length

Later in this lecture, we’ll use

. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.