Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of their needs and the collection.
Also good for applications: Applications can easily consume 1000s of results.
Not good for the majority of users.
Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Problem with Boolean search:
feast or famine
Boolean queries often result in either too few (=0) or too many (1000s) results.
Query 1: “standard user dlink 650” → 200,000 hits
Query 2: “standard user dlink 650 no card found”: 0 hits
It takes a lot of skill to come up with a query that produces a manageable number of hits.
AND gives too few; OR gives too many
Ranked retrieval models
Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query
Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language
In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa
Feast or famine: not a problem in ranked retrieval
When a system produces a ranked result set, large result sets are not an issue
Indeed, the size of the result set is not an issue
We just show the top k ( ≈ 10) results
We don’t overwhelm the user
Premise: the ranking algorithm works
Scoring as the basis of ranked retrieval
We wish to return in order the documents most likely to be useful to the searcher
How can we rank-order the documents in the collection with respect to a query?
Assign a score – say in [0, 1] – to each document
This score measures how well document and query “match”.