### Efficient cosine ranking

• Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines.

• Efficient ranking:

• Computing a single cosine efficiently.

• Choosing the K largest cosine values efficiently.

• Can we do this without computing all N cosines?

### Efficient cosine ranking

• What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

• In general, we do not know how to do this efficiently for high-dimensional spaces

• But it is solvable for short queries, and standard indexes support this well

### Special case – unweighted queries

• No weighting on query terms

• Assume each query term occurs only once

• Then for ranking, don’t need to normalize query vector

• Slight simplification of algorithm from Lecture 6

### Computing the K largest cosines: selection vs. sorting

• Typically we want to retrieve the top K docs (in the cosine ranking for the query)

• not to totally order all docs in the collection

• Can we pick off docs with K highest cosines?

• Let J = number of docs with nonzero cosines

• We seek the K best of these J

### Use heap for selecting top K

• Binary tree in which each node’s value > the values of children

• Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

• For J=1M, K=100, this is about 10% of the cost of sorting.

### Bottlenecks

• Primary computational bottleneck in scoring: cosine computation

• Can we avoid all this computation?

• Yes, but may sometimes get it wrong

• a doc not in the top K may creep into the list of K output docs

• Is this such a bad thing?

### Cosine similarity is only a proxy

• User has a task and a query formulation

• Cosine matches docs to query

• Thus cosine is anyway a proxy for user happiness

• If we get a list of K docs “close” to the top K by cosine measure, should be ok