Computing cosine scores

Efficient cosine ranking

  • Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines.

  • Efficient ranking:

    • Computing a single cosine efficiently.

    • Choosing the K largest cosine values efficiently.

      • Can we do this without computing all N cosines?

Efficient cosine ranking

  • What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

  • In general, we do not know how to do this efficiently for high-dimensional spaces

  • But it is solvable for short queries, and standard indexes support this well

Special case – unweighted queries

  • No weighting on query terms

    • Assume each query term occurs only once

  • Then for ranking, don’t need to normalize query vector

    • Slight simplification of algorithm from Lecture 6

Computing the K largest cosines: selection vs. sorting

  • Typically we want to retrieve the top K docs (in the cosine ranking for the query)

    • not to totally order all docs in the collection

  • Can we pick off docs with K highest cosines?

  • Let J = number of docs with nonzero cosines

    • We seek the K best of these J

Use heap for selecting top K

  • Binary tree in which each node’s value > the values of children

  • Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

  • For J=1M, K=100, this is about 10% of the cost of sorting.

Bottlenecks

  • Primary computational bottleneck in scoring: cosine computation

  • Can we avoid all this computation?

  • Yes, but may sometimes get it wrong

    • a doc not in the top K may creep into the list of K output docs

    • Is this such a bad thing?

Cosine similarity is only a proxy

  • User has a task and a query formulation

  • Cosine matches docs to query

  • Thus cosine is anyway a proxy for user happiness

  • If we get a list of K docs “close” to the top K by cosine measure, should be ok