### CS276

Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 7: Scoring and results assembly

### Lecture 6 – I introduced a bug

• In my anxiety to avoid taking the log of zero, I rewrote

• In fact this was unnecessary, since the zero case is treated

• specially above; net the FIRST version above is right.

### Recap: tf-idf weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

wt,d= (1+log10 tft,d) x log10(N/dft)

• Best known weighting scheme in information retrieval

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

### Recap: Queries as vectors

• Key idea 1: Do the same for queries: represent them as vectors in the space

• Key idea 2: Rank documents according to their proximity to the query in this space

• proximity = similarity of vectors

### Recap: cosine(query,document)

• cos(q,d) is the cosine similarity of q and d … or,

• equivalently, the cosine of the angle between q and d.

### This lecture

• Speeding up vector space ranking

• Putting together a complete search system

• Will require learning about a number of miscellaneous topics and heuristics

### Efficient cosine ranking

• Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines.

• Efficient ranking:

• Computing a single cosine efficiently.

• Choosing the K largest cosine values efficiently.

• Can we do this without computing all N cosines?

### Efficient cosine ranking

• What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

• In general, we do not know how to do this efficiently for high-dimensional spaces

• But it is solvable for short queries, and standard indexes support this well

### Special case – unweighted queries

• No weighting on query terms

• Assume each query term occurs only once

• Then for ranking, don’t need to normalize query vector

• Slight simplification of algorithm from Lecture 6

### Computing the K largest cosines: selection vs. sorting

• Typically we want to retrieve the top K docs (in the cosine ranking for the query)

• not to totally order all docs in the collection

• Can we pick off docs with K highest cosines?

• Let J = number of docs with nonzero cosines

• We seek the K best of these J

### Use heap for selecting top K

• Binary tree in which each node’s value > the values of children

• Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

• For J=1M, K=100, this is about 10% of the cost of sorting.

### Bottlenecks

• Primary computational bottleneck in scoring: cosine computation

• Can we avoid all this computation?

• Yes, but may sometimes get it wrong

• a doc not in the top K may creep into the list of K output docs

• Is this such a bad thing?

### Cosine similarity is only a proxy

• User has a task and a query formulation

• Cosine matches docs to query

• Thus cosine is anyway a proxy for user happiness

• If we get a list of K docs “close” to the top K by cosine measure, should be ok

### Generic approach

• Find a set A of contenders, with K < |A| << N

• A does not necessarily contain the top K, but has many docs from among the top K

• Return the top K docs in A

• Think of A as pruning non-contenders

• The same approach is also used for other (non-cosine) scoring functions

• Will look at several schemes following this approach

### Index elimination

• Basic algorithm cosine computation algorithm only considers docs containing at least one query term

• Take this further:

• Only consider high-idf query terms

• Only consider docs containing many query terms

### High-idf query terms only

• For a query such as catcher in the rye

• Only accumulate scores from catcher and rye

• Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much

• Benefit:

• Postings of low-idf terms have many docs  these (many) docs get eliminated from set A of contenders

### Docs containing many query terms

• Any doc with at least one query term is a candidate for the top K output list

• For multi-term queries, only compute scores for docs containing several of the query terms

• Say, at least 3 out of 4

• Imposes a “soft conjunction” on queries seen on web search engines (early Google)

• Easy to implement in postings traversal

### 3 of 4 query terms

• Scores only computed for docs 8, 16 and 32.

### Champion lists

• Precompute for each dictionary term t, the r docs of highest weight in t’s postings

• Call this the champion list for t

• (aka fancy list or top docs for t)

• Note that r has to be chosen at index build time

• Thus, it’s possible that r < K

• At query time, only compute scores for docs in the champion list of some query term

• Pick the K top-scoring docs from amongst these

### Exercises

• How do Champion Lists relate to Index Elimination? Can they be used together?

• How can Champion Lists be implemented in an inverted index?

• Note that the champion list has nothing to do with small docIDs

### Quantitative

• Static quality scores

• We want top-ranking documents to be both relevant and authoritative

• Relevance is being modeled by cosine scores

• Authority is typically a query-independent property of a document

• Examples of authority signals

• Wikipedia among websites

• Articles in certain newspapers

• A paper with many citations

• Many bitly’s, diggs or del.icio.us marks

• (Pagerank)

### Modeling authority

• Assign to each document a query-independent quality score in [0,1] to each document d

• Denote this by g(d)

• Thus, a quantity like the number of citations is scaled into [0,1]

• Exercise: suggest a formula for this.

### Net score

• Consider a simple total score combining cosine relevance and authority

• net-score(q,d) = g(d) + cosine(q,d)

• Can use some other linear combination

• Indeed, any function of the two “signals” of user happiness – more later

• Now we seek the top K docs by net score

### Top K by net score – fast methods

• First idea: Order all postings by g(d)

• Key: this is a common ordering for all postings

• Thus, can concurrently traverse query terms’ postings for

• Postings intersection

• Cosine score computation

• Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)

### Why order postings by g(d)?

• Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal

• In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early

• Short of computing scores for all docs in postings

### Champion lists in g(d)-ordering

• Can combine champion lists with g(d)-ordering

• Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd

• Seek top-K results from only the docs in these champion lists

### High and low lists

• For each term, we maintain two postings lists called high and low

• Think of high as the champion list

• When traversing postings on a query, only traverse high lists first

• If we get more than K docs, select the top K and stop

• Else proceed to get docs from the low lists

• Can be used even for simple cosine scores, without global quality g(d)

• A means for segmenting index into two tiers

### Impact-ordered postings

• We only want to compute scores for docs for which wft,d is high enough

• We sort each postings list by wft,d

• Now: not all postings in a common order!

• How do we compute scores in order to pick off top K?

• Two ideas follow

### 1. Early termination

• When traversing t’s postings, stop early after either

• a fixed number of r docs

• wft,d drops below some threshold

• Take the union of the resulting sets of docs

• One from the postings of each query term

• Compute only the scores for docs in this union

### 2. idf-ordered terms

• When considering the postings of query terms

• Look at them in order of decreasing idf

• High idf terms likely to contribute most to score

• As we update score contribution from each query term

• Stop if doc scores relatively unchanged

• Can apply to cosine or some other net scores

### Cluster pruning: preprocessing

• Pick √N docs at random: call these leaders

• For every other doc, pre-compute nearest leader

• Docs attached to a leader: its followers;

• Likely: each leader has ~ N followers.

### Cluster pruning: query processing

• Process a query as follows:

• Given query Q, find its nearest leader L.

• Seek K nearest docs from among L’s followers.

• Fast

### General variants

• Have each follower attached to b1=3 (say) nearest leaders.

• From query, find b2=4 (say) nearest leaders and their followers.

• Can recurse on leader/follower construction.

### Exercises

• To find the nearest leader in step 1, how many cosine computations do we do?

• Why did we have √N in the first place?

• What is the effect of the constants b1, b2 on the previous slide?

• Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs.

• Likely under random sampling.

### Parametric and zone indexes

• Thus far, a doc has been a sequence of terms

• In fact documents have multiple parts, some with special semantics:
• Author
• Title
• Date of publication
• Language
• Format
• etc.

### Fields

• We sometimes wish to search by these metadata

• E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick

• Year = 1601 is an example of a field

• Also, author last name = shakespeare, etc.

• Field or parametric index: postings for each field value

• Sometimes build range trees (e.g., for dates)

• Field query typically treated as conjunction

• (doc must be authored by shakespeare)

### Zone

• A zone is a region of the doc that can contain an arbitrary amount of text, e.g.,

• Title

• Abstract

• References …

• Build inverted indexes on zones as well to permit querying

• E.g., “find docs with merchant in the title zone and matching the query gentle rain”

### Example zone indexes

Encode zones in dictionary vs. postings.

↓ ### Tiered indexes

• Break postings up into a hierarchy of lists

• Most important

• Least important

• Can be done by g(d) or another measure

• Inverted index thus broken up into tiers of decreasing importance

• At query time use top tier unless it fails to yield K docs

• If so drop to lower tiers

### Query term proximity

• Free text queries: just a set of terms typed into the query box – common on the web

• Users prefer docs in which query terms occur within close proximity of each other

• Let w be the smallest window in a doc containing all query terms, e.g.,

• For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words)

• Would like scoring function to take this into account – how?

### Query parsers

• Free text query from user may in fact spawn one or more queries to the indexes, e.g., query rising interest rates

• Run the query as a phrase query

• If <K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates

• If we still have <K docs, run the vector space query rising interest rates

• Rank matching docs by vector space scoring

• This sequence is issued by a query parser

### Aggregate scores

• We’ve seen that score functions can combine cosine, static quality, proximity, etc.

• How do we know the best combination?

• Some applications – expert-tuned

• Increasingly common: machine-learned

• See May 19th lecture

• IIR 7, 6.1