Ranked retrieval

  • Thus far, our queries have all been Boolean.

    • Documents either match or don’t.

  • Good for expert users with precise understanding of their needs and the collection.

    • Also good for applications: Applications can easily consume 1000s of results.

  • Not good for the majority of users.

    • Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

    • Most users don’t want to wade through 1000s of results.

      • This is particularly true of web search.

Problem with Boolean search:

  • feast or famine

  • Boolean queries often result in either too few (=0) or too many (1000s) results.

  • Query 1: “standard user dlink 650” → 200,000 hits

  • Query 2: “standard user dlink 650 no card found”: 0 hits

  • It takes a lot of skill to come up with a query that produces a manageable number of hits.

    • AND gives too few; OR gives too many

Ranked retrieval models

  • Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

  • Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

  • In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa

Feast or famine: not a problem in ranked retrieval

  • When a system produces a ranked result set, large result sets are not an issue

    • Indeed, the size of the result set is not an issue

    • We just show the top k ( ≈ 10) results

    • We don’t overwhelm the user

    • Premise: the ranking algorithm works

Scoring as the basis of ranked retrieval

  • We wish to return in order the documents most likely to be useful to the searcher

  • How can we rank-order the documents in the collection with respect to a query?

  • Assign a score – say in [0, 1] – to each document

  • This score measures how well document and query “match”.