### Computing cosine scores

### Efficient cosine ranking

Find the

*K*docs in the collection “nearest” to the query ⇒*K*largest query-doc cosines.

Efficient ranking:

Computing a single cosine efficiently.

Choosing the

*K*largest cosine values efficiently.

Can we do this without computing all

*N*cosines?

### Efficient cosine ranking

What we’re doing in effect: solving the

*K*-nearest neighbor problem for a query vector

In general, we do not know how to do this efficiently for high-dimensional spaces

But it is solvable for short queries, and standard indexes support this well

### Special case – unweighted queries

No weighting on query terms

Assume each query term occurs only once

Then for ranking, don’t need to normalize query vector

Slight simplification of algorithm from Lecture 6

### Computing the K largest cosines: selection vs. sorting

Typically we want to retrieve the top

*K*docs (in the cosine ranking for the query)

not to totally order all docs in the collection

Can we pick off docs with

*K*highest cosines?

Let

*J*= number of docs with nonzero cosines

We seek the

*K*best of these*J*

### Use heap for selecting top K

Binary tree in which each node’s value > the values of children

Takes

*2J*operations to construct, then each of*K*“winners” read off in 2log*J*steps.

For

*J*=1M,*K*=100, this is about 10% of the cost of sorting.

### Bottlenecks

Primary computational bottleneck in scoring: cosine computation

Can we avoid all this computation?

Yes, but may sometimes get it wrong

a doc

*not*in the top*K*may creep into the list of*K*output docs

Is this such a bad thing?

### Cosine similarity is only a proxy

User has a task and a query formulation

Cosine matches docs to query

Thus cosine is anyway a proxy for user happiness

If we get a list of

*K*docs “close” to the top*K*by cosine measure, should be ok