Introduction to Information Retrieval


Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 1: Boolean retrieval

Information Retrieval

  • Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

new slide

Unstructured (text) vs. structured (database) data in 1996


Unstructured (text) vs. structured (database) data in 2009

Unstructured data in 1680

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
  • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?
  • Why is that not the answer?
    • Slow (for large corpora)
    • NOT Calpurnia is non-trivial
    • Other operations (e.g., find the word Romans near countrymen) not feasible
    • Ranked retrieval (best documents to return)
      • Later lectures

Term-document incidence

 Brutus AND Caesar BUT NOT                     1 if play contains word

   Calpurnia                                                                 0 otherwise  


Incidence vectors

  • So we have a 0/1 vector for each term.

  • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  → bitwise AND.

  • 110100 AND 110111 AND 101111 = 100100.

Incidence vectors

  • So we have a 0/1 vector for each term.

  • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  → bitwise AND.

  • 110100 AND 110111 AND 101111 = 100100.

Basic assumptions of Information Retrieval

  • Collection: Fixed set of documents
  • Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.

The classic search model

How good are the retrieved docs?

  • Precision : Fraction of retrieved docs that are relevant to user’s information need

  • Recall : Fraction of relevant docs in collection that are retrieved

  • More precise definitions and measurements to follow in later lectures

Bigger collections

  • Consider N = 1 million documents, each with about 1000 words.

  • Avg 6 bytes/word including spaces/punctuation

    • 6GB of data in the documents.

  • Say there are M = 500K distinct terms among these.

Can’t build the matrix

  • 500K x 1M matrix has half-a-trillion 0’s and 1’s.

  • But it has no more than one billion 1’s.                               Why? 

    • matrix is extremely sparse.

  • What’s a better representation?

    • We only record the 1 positions.

Inverted index

  • For each term t, we must store a list of all documents that contain t.
      • Identify each by a docID, a document serial number
  • Can we use fixed-size arrays for this?
Brutus124113145 173 174 
Caesar 124 5 616 57 32
Calpurnia 23154 101  


                     What happens if the word Caesar is added to document 14? 

Inverted index

  • We need variable-size postings lists

    • On disk, a continuous run of postings is normal and best

    • In memory, can use linked lists or variable length arrays

      • Some tradeoffs in size/ease of insertion                         Posting(below)

Brutus124113145 173 174 
Caesar 124 5 616 57 32
Calpurnia 23154 101 



Brutus , Caeser and Calpurnia are dictionaries.

Numbers are postings.

Sorted by docID (more later on why).

Inverted index construction

Indexer steps: Token sequence

Indexer steps: Sort

Indexer steps: Dictionary & Postings

Where do we pay in storage?

The index we just built

  • How do we process a query?

    • Later - what kinds of queries can we process?                ← Today’s focus 

Query processing: AND

  • Consider processing the query: Brutus AND Caesar

    • Locate Brutus in the Dictionary;

      • Retrieve its postings.

    • Locate Caesar in the Dictionary;

      • Retrieve its postings.

    • Merge” the two postings:

The merge

  • Walk through the two postings simultaneously, in time linear in the total number of postings entries

Intersecting two postings lists (a "merge algorithm")

Query optimization

  • What is the best order for query processing?

  • Consider a query that is an AND of n terms.

  • For each of the n terms, get its postings, then AND them together.

Brutus248163264 128 
Caesar 123 5 816 2134
Calpurnia 1316  


                           Query: Brutus AND Calpurnia AND Caesar

Query optimization example

  • Process in order of increasing freq:

    • start with smallest set, then keep cutting further.


      (This is why we kept document freq. in dictionary)

Brutus248163264 128 
Caesar 123 5 816 2134
Calpurnia 1316  


              Execute the query as (Calpurnia AND Brutus) AND Caesar.

Boolean queries: Exact match

  • The Boolean retrieval model is being able to ask a query that is a Boolean expression:
    • Boolean Queries use AND, OR and NOT to join query terms

      • Views each document as a set of words

      • Is precise: document matches condition or not.

    • Perhaps the simplest model to build an IR system on

  • Primary commercial retrieval tool for 3 decades.

  • Many search systems you still use are Boolean:

    • Email, library catalog, Mac OS X Spotlight

Boolean queries: More General Merges

  • Exercise: Adapt the merge for the queries:

  • Brutus AND NOT Caesar

  • Brutus OR NOT Caesar

    Can we still run through the merge in time O(x+y)?

    What can we achieve?

Boolean queries: More General Merges

  • Exercise: Adapt the merge for the queries:

  • Brutus AND NOT Caesar

  • Brutus OR NOT Caesar

    Can we still run through the merge in time O(x+y)?

    What can we achieve?

Example: WestLaw

  • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

  • Tens of terabytes of data; 700,000 users

  • Majority of users still use boolean queries

  • Example query:

    • What is the statute of limitations in cases involving the federal tort claims act?


      • /3 = within 3 words, /S = in same sentence

Example: WestLaw

  • Another example query:

    • Requirements for disabled people to be able to access a workplace

    • disabl! /p access! /s work-site work-place (employment /3 place)

  • Note that SPACE is disjunction, not conjunction!

  • Long, precise queries; proximity operators; incrementally developed; not like web search

  • Many professional searchers still like Boolean search

    • You know exactly what you are getting

  • But that doesn’t mean it actually works better….

new slide

More general optimization

  • e.g., (madding OR crowd) AND (ignoble OR strife)

  • Get doc. freq.’s for all terms.

  • Estimate the size of each OR by the sum of its doc. freq.’s (conservative).

  • Process in increasing order of OR sizes.


  • Recommend a query processing order for
  • (tangerine OR trees) AND

  • (marmalade OR skies) AND

  • (kaleidoscope OR eyes)

    Term                            Freq

    eyes                              213312

    kaleidoscope                 87009

    marmalade                   107913

    skies                             271658

    tangerine                       46653

    trees                             316812

Query processing exercises

  • Exercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?

  • Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?

  • Hint: Begin with the case of a Boolean formula query where each term appears only once in the query.


  • Write down five search features you think it could do better

What’s ahead in IR?

  • Beyond term search

  • What about phrases?

    • Stanford University

  • Proximity: Find Gates NEAR Microsoft.

    • Need index to capture position information in docs.

  • Zones in documents: Find documents with
    (author = Ullman) AND (text contains automata).

Evidence accumulation

  • 1 vs. 0 occurrence of a search term

    • 2 vs. 1 occurrence

    • 3 vs. 2 occurrences, etc.

    • Usually more seems better

  • Need term frequency information in docs

Ranking search results

  • Boolean queries give inclusion or exclusion of docs.

  • Often we want to rank/group results

    • Need to measure proximity from query to each doc.

    • Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.

IR vs. databases: Structured vs unstructured data

  • Structured data tends to refer to information in “tables”





  • Typically allows numerical range and exact match

  • (for text) queries, e.g.,

  • Salary < 60000 AND Manager = Smith.

Unstructured data

  • Typically refers to free-form text

  • Allows

    • Keyword queries including operators

    • More sophisticated “concept” queries, e.g.,

      • find all web pages dealing with drug abuse

  • Classic model for searching text documents

Semi-structured data

  • In fact almost no data is “unstructured”

  • E.g., this slide has distinctly identified zones such as the Title and Bullets

  • Facilitates “semi-structured” search such as

    • Title contains data AND Bullets contain search

      … to say nothing of linguistic structure

More sophisticated semi-structured search

  • Title is about Object Oriented Programming AND Author something like stro*rup

  • where * is the wild-card operator

  • Issues:

    • how do you process “about”?

    • how do you rank results?

  • The focus of XML search (IIR chapter 10)

Clustering, classification and ranking

  • Clustering: Given a set of docs, group them into clusters based on their contents.

  • Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.

  • Ranking: Can we learn how to best order a set of documents, e.g., a set of search results

The web and its challenges

  • Unusual and diverse documents

  • Unusual and diverse users, queries, information needs

  • Beyond terms, exploit ideas from social networks

    • link analysis, clickstreams ...

  • How do search engines work?
    And how can we make them better?

More sophisticated information retrieval

  • Cross-language information retrieval

  • Question answering

  • Summarization

  • Text mining

Resources for today’s lecture

  • Introduction to Information Retrieval, chapter 1

  • Shakespeare:

    • Try the neat browse by keyword sequence feature!

  • Managing Gigabytes, chapter 3.2

  • Modern Information Retrieval, chapter 8.2


    Any questions?

Introduction to Information Retrieval

CS276: Information Retrieval and Web Search 

Pandu Nayak and Prabhakar Raghavan

Lecture 2: The term vocabulary and postings lists

Recap of the previous lecture

  • Basic inverted indexes:

    • Structure: Dictionary and Postings

    • Key step in construction: Sortingli>

  • Boolean query processing

    • Intersection by linear time “merging”

    • Simple optimizations

  • Overview of course topics

Plan for this lecture

Elaborate basic indexing

  • Preprocessing to form the term vocabulary

    • Documents

    • Tokenization

    • What terms do we put in the index?

  • Postings

    • Faster merges: skip lists

    • Positional postings and phrase queries

Recall the basic indexing pipeline

new slide

Parsing a document

  • What format is it in?
    • pdf/word/excel/html?

  • What language is it in?

  • What character set is in use?

     Each of these is a classification problem, which we will study later in the course.

     But these tasks are often done heuristically …

Complications: Format/language

  • Documents being indexed can include docs from many different languages

    • A single index may have to contain terms of several languages.

  • Sometimes a document or its components can contain multiple languages/formats

    • French email with a German pdf attachment.

  • What is a unit document?

    • A file?

    • An email? (Perhaps one of many in an mbox.)

    • An email with 5 attachments?

    • A group of files (PPT or LaTeX as HTML pages)

Token and Terms



  • Input: “Friends, Romans, Countrymen

  • Output: Tokens

    • Friends

    • Romans

    • Countrymen

  • A token is a sequence of characters in a document

  • Each such token is now a candidate for an index entry, after further processing

    • Described below

  • But what are valid tokens to emit?


  • Issues in tokenization:

    • Finland’s capital → 

    • Finland? Finlands? Finland’s?

    • Hewlett-Packard → Hewlett and Packard as two tokens?

      • state-of-the-art: break up hyphenated sequence.

      • co-education

      • lowercase, lower-case, lower case ?

      • It can be effective to get the user to put in possible hyphens

    • San Francisco: one token or two?

      • How do you decide it is one token?


  • 3/12/91                Mar. 12, 1991                     12/3/91

  • 55 B.C.

  • B-52

  • My PGP key is 324a3df234cb23e

  • (800) 234-2333

    • Often have embedded spaces

    • Older IR systems may not index numbers

      • But often very useful: think about things like looking up error codes/stacktraces on the web

      • (One answer is using n-grams: Lecture 3)

    • Will often index “meta-data” separately

      • Creation date, format, etc.

Tokenization: language issues

  • French

    • L'ensemble    one token or two?

      • L ? L’ ? Le ?

      • Want l’ensemble to match with un ensemble

        • Until at least 2003, it didn’t on Google

        • Internationalization!
  • German noun compounds are not segmented

    • Lebensversicherungsgesellschaftsangestellter

    • ‘life insurance company employee’

    • German retrieval systems benefit greatly from a compound splitter module

        • Can give a 15% performance boost for German

Tokenization: language issues

  • Chinese and Japanese have no spaces between words:

    • 莎拉波娃现在居住在美国东南部的佛罗里达。
    • Not always guaranteed a unique tokenization
  • Further complicated in Japanese, with multiple alphabets intermingled

  • Dates/amounts in multiple formats

      End-user can express query entirely in hiragana!

Tokenization: language issues

  • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

  • Words are separated, but letter forms within a word form complex ligatures

  •                                        ← →    ← →                                  ← start

  • ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

  • With Unicode, the surface presentation is complex, but the stored form is straightforward

Stop words

  • With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:

    • They have little semantic content: the, a, and, to, be

    • There are a lot of them: ~30% of postings for top 30 words

  • But the trend is away from doing this:

    • Good compression techniques (lecture 5) means the space for including stopwords in a system is very small

    • Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words.

    • You need them for:

      • Phrase queries: “King of Denmark”

      • Various song titles, etc.: “Let it be”, “To be or not to be”

      • “Relational” queries: “flights to London”

Normalization to terms

  • We need to “normalize” words in indexed text as well as query words into the same form

    • We want to match U.S.A. and USA

  • Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary

  • We most commonly implicitly define equivalence classes of terms by, e.g.,

    • deleting periods to form a term

      • U.S.A., USA ⌊ USA

    • deleting hyphens to form a term

      • anti-discriminatory, antidiscriminatory ⌊ antidiscriminatory

Normalization: other languages

  • Accents: e.g., French résumé vs. resume.

  • Umlauts: e.g., German: Tuebingen vs. Tübingen

    • Should be equivalent

  • Most important criterion:

    • How are your users like to write their queries for these words?

  • Even in languages that standardly have accents, users often may not type them

    • Often best to normalize to a de-accented term

      • Tuebingen, Tübingen, Tubingen    Tubingen

Normalization: other languages

  • Normalization of things like date forms

    • 7月30日 vs. 7/30
    • Japanese use of kana vs. Chinese characters
  • Tokenization and normalization may depend on the language and so is intertwined with language detection

  • Crucial: Need to “normalize” indexed text as well as query terms into the same form

Case folding

  • Reduce all letters to lower case

    • exception: upper case in mid-sentence?

      • e.g., General Motors
      • Fed vs. fed
      • SAIL vs. sail
    • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization
  • Google example:

    • Query C.A.T.

    • #1 result was for “cat” (well, Lolcats) not Caterpillar Inc.

Normalization to terms

  • An alternative to equivalence classing is to do asymmetric expansion

  • An example of where this may be useful

    • Enter: window Search: window, windows

    • Enter: windows Search: Windows, windows, window

    • Enter: Windows Search: Windows

  • Potentially more powerful, but less efficient


  • Reduce inflectional/variant forms to base form

  • E.g.,

    • am, are, is  → be

    • car, cars, car's, cars'    car

  • the boy's cars are different colorsthe boy car be different color

  • Lemmatization implies doing “proper” reduction to dictionary headword form


  • Reduce terms to their “roots” before indexing

  • “Stemming” suggest crude affix chopping

    • language dependent

    • e.g., automate(s), automatic, automation all reduced to automat.

Porter’s algorithm

  • Commonest algorithm for stemming English

    • Results suggest it’s at least as good as other stemming options

  • Conventions + 5 phases of reductions

    • phases applied sequentially

    • each phase consists of a set of commands

    • sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

Typical rules in Porter

  • sses → ss

  • ies i

  • ational    ate

  • tional    tion

  • Rules sensitive to the measure of words

  • (m>1) EMENT

      • replacementreplac

      • cement cement

Other stemmers

  • Other stemmers exist, e.g., Lovins stemmer


    • Single-pass, longest suffix removal (about 250 rules)

  • Full morphological analysis – at most modest benefits for retrieval

  • Do stemming and other normalizations help?

    • English: very mixed results. Helps recall but harms precision

      • operative (dentistry) ⇒ oper

      • operational (research) ⇒ oper

      • operating (systems) ⇒ oper

    • Definitely useful for Spanish, German, Finnish, …

      • 30% performance gains for Finnish!

new slide

Thesauri and soundex

  • Do we handle synonyms and homonyms?

    • E.g., by hand-constructed equivalence classes

      • car = automobile      color = colour

    • We can rewrite to form equivalence-class terms

      • When the document contains automobile, index it under car-automobile (and vice-versa)

    • Or we can expand a query

      • When the query contains automobile, look under car as well

  • What about spelling mistakes?

    • One approach is soundex, which forms equivalence classes of words based on phonetic heuristics

  • More in lectures 3 and 9


  • Many of the above features embody transformations that are

    • Language-specific and

    • Often, application-specific

  • These are “plug-in” addenda to the indexing process

  • Both open source and commercial plug-ins are available for handling these

Dictionary entries – first cut

Faster postıngs merges:

Faster postıngs merges:
Skıp poınters/Skıp lısts

Recall basic merge

  • Walk through the two postings simultaneously, in time linear in the total number of postings entries

Can we do better? Yes (if index isn’t changing too fast).

Augment postings with skip pointers (at indexing time)

  • Why?

  • To skip postings that will not figure in the search results.

  • How?

  • Where do we place skip pointers?

Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.

Where do we place skips?

  • Tradeoff:

    • More skips →shorter skip spans ⇒more likely to skip. But lots of comparisons to skip pointers.

    • Fewer skips  few pointer comparison, but then long skip spans    few successful skips.

Placing skips

  • Simple heuristic: for postings of length L, use √L evenly-spaced skip pointers.

  • This ignores the distribution of query terms.

  • Easy if the index is relatively static; harder if L keeps changing because of updates.

  • This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based

    • The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!

Phrase queries and positiona...

Phrase querıes and posıtıonal ındexes

Phrase queries

  • Want to be able to answer queries such as “stanford university” – as a phrase

  • Thus the sentence “I went to university at Stanford” is not a match.

    • The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works

    • Many more queries are implicit phrase queries

  • For this, it no longer suffices to store only

  • <term : docs> entries

A first attempt: Biword indexes

  • Index every consecutive pair of terms in the text as a phrase

  • For example the text “Friends, Romans, Countrymen” would generate the biwords

    • friends romans

    • romans countrymen

  • Each of these biwords is now a dictionary term

  • Two-word phrase query-processing is now immediate.

Longer phrase queries

  • Longer phrases are processed as we did with wild-cards:

  • stanford university palo alto can be broken into the Boolean query on biwords:

  • stanford university AND university palo AND palo alto

  • Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.

                                                                                    ↑   Can have false positives!

Extended biwords

  • Parse the indexed text and perform part-of-speech-tagging (POST).

  • Bucket the terms into (say) Nouns (N) and articles/prepositions (X).

  • Call any string of terms of the form NX*N an extended biword.

    • Each such extended biword is now made a term in the dictionary.

  • Example: catcher in the rye

                   N           X   X    N

  • Query processing: parse it into N’s and X’s

    • Segment query into enhanced biwords

    • Look up in index: catcher rye

Issues for biword indexes

  • False positives, as noted before

  • Index blowup due to bigger dictionary

    • Infeasible for more than biwords, big even for them

  • Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

Solution 2: Positional indexes

  • In the postings, store for each term the position(s) in which tokens of it appear:

    <term, number of docs containing term;

     doc1: position1, position2 … ;

     doc2: position1, position2 … ;


Positional index example

  • For phrase queries, we use a merge algorithm recursively at the document level

  • But we now need to deal with more than just equality

Processing a phrase query

  • Extract inverted index entries for each distinct term: to, be, or, not.

  • Merge their doc:position lists to enumerate all positions with “to be or not to be”.

    • to:

      • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...

    • be:

      • 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...

  • Same general method for proximity searches

Proximity queries


    • Again, here, /k means “within k words of”.

  • Clearly, positional indexes can be used for such queries; biword indexes cannot.

  • Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?

    • This is a little tricky to do correctly and efficiently

    • See Figure 2.12 of IIR

    • There’s likely to be a problem on it!

Positional index size

  • You can compress position values/offsets: we’ll talk about that in lecture 5

  • Nevertheless, a positional index expands postings storage substantially

  • Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system.

Positional index size

  • Need an entry for each occurrence, not just once per document

  • Index size depends on average document size                    ←  Why?

    • Average web page has <1000 _tmplitem="94" terms

    • SEC filings, books, even some epic poems … easily 100,000 terms

  • Consider a term with frequency 0.1%

Rules of thumb

  • A positional index is 2–4 as large as a non-positional index

  • Positional index size 35–50% of volume of original text

  • Caveat: all of this holds for “English-like” languages

Combination schemes

  • These two approaches can be profitably combined

    • For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists

      • Even more so for phrases like “The Who”

  • Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme

    • A typical web query mixture was executed in ¼ of the time of using just a positional index

    • It required 26% more space than having a positional index alone

Resources for today’s lecture

  • IIR 2

  • MG 3.6, 4.3; MIR 7.2

  • Skip Lists theory: Pugh (1990)

    • Multilevel skip lists give same O(log n) efficiency as trees

  • D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp. 215-221.

Introduction to Information Retrieval

CS276: Information Retrieval and Web Search 

Pandu Nayak and Prabhakar Raghavan

Lecture 3: Dictionaries and tolerant retrieval

Recap of the previous lecture

  • The type/token distinction

    • Terms are normalized types put in the dictionary

  • Tokenization problems:

    • Hyphens, apostrophes, compounds, CJK

  • Term equivalence classing:

    • Numbers, case folding, stemming, lemmatization

  • Skip pointers

    • Encoding a tree-like structure in a postings list

  • Biword indexes for phrases

  • Positional indexes for phrases/proximity queries

This lecture

  • Dictionary data structures

  • “Tolerant” retrieval

    • Wild-card queries

    • Spelling correction

    • Soundex

Dictionary data structures for inverted indexes

  • The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

A naïve dictionary

  • An array of struct:

                                     char[20]      int                           Postings *

                                     20 bytes      4/8 bytes                 4/8 bytes

  • How do we store a dictionary in memory efficiently?

  • How do we quickly look up elements at query time?

Dictionary data structures

  • Two main choices:

    • Hashtables

    • Trees

  • Some IR systems use hashtables, some trees


  • Each vocabulary term is hashed to an integer

    • (We assume you’ve seen hashtables before)

  • Pros:

    • Lookup is faster than for a tree: O(1)

  • Cons:

    • No easy way to find minor variants:

      • judgment/judgement

    • No prefix search [tolerant retrieval]

    • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything


    • Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].


  • Simplest: binary tree

  • More usual: B-trees

  • Trees require a standard ordering of characters and hence strings … but we typically have one

  • Pros:

    • Solves the prefix problem (terms starting with hyp)

  • Cons:

    • Slower: O(log M) [and this requires balanced tree]

    • Rebalancing binary trees is expensive

      • But B-trees mitigate the rebalancing problem

Wild-card queries 

Wild-card queries: *

  • mon*: find all docs containing any word beginning with “mon”.

  • Easy with binary tree (or B-tree) lexicon: retrieve all words in range:

    mon w < moo

  • *mon: find words ending in “mon”: harder

    • Maintain an additional B-tree for terms backwards.

    • Can retrieve all words in range: nom w < non.

  • Exercise: from this, how can we enumerate all terms

  • meeting the wild-card query pro*cent ?

Query processing

  • At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.

  • We still have to look up the postings for each enumerated term.

  • E.g., consider the query:

  • se*ate AND fil*er

  • This may result in the execution of many Boolean AND queries.

B-trees handle *’s at the end of a query term

  • How can we handle *’s in the middle of query term?

    • co*tion

  • We could look up co* AND *tion in a B-tree and intersect the two term sets

    • Expensive

  • The solution: transform wild-card queries so that the *’s occur at the end

  • This gives rise to the Permuterm Index.

Permuterm index

  • For term hello, index under:

    • hello$, ello$h, llo$he, lo$hel, o$hell

    • where $ is a special symbol.

  • Queries:

    • X lookup on X$                    X* lookup on $X*

    • *X lookup on X$*              *X* lookup on X*

    • X*Y lookup on Y$X*           X*Y*Z ??? Exercise!


                                         Query = hel*o

                                             X=hel, Y=o

                                         Lookup o$hel*

Permuterm query processing

  • Rotate query wild-card to the right

  • Now use B-tree lookup as before.

  • Permuterm problem: quadruples lexicon size


                            Empirical observation for English.

Bigram (k-gram) indexes

  • Enumerate all k-grams (sequence of k chars) occurring in any term

  • e.g., from text “April is the cruelest month” we get the 2-grams (bigrams)

    • $ is a special word boundary symbol

  • Maintain a second inverted index from bigrams to dictionary terms that match each bigram.

  • $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,

    ue,el,le,es,st,t$, $m,mo,on,nt,h$

Bigram index example

  • The k-gram index finds terms based on a query consisting of k-grams (here k=2).

Processing wild-cards

  • Query mon* can now be run as

    • $m AND mo AND on


  • Gets terms that match AND version of our wildcard query.

  • But we’d enumerate moon.

  • Must post-filter these terms against query.

  • Surviving enumerated terms are then looked up in the term-document inverted index.

  • Fast, space efficient (compared to permuterm).

Processing wild-card queries

  • As before, we must execute a Boolean query for each enumerated, filtered term.

  • Wild-cards can result in expensive query execution (very large disjunctions…)

    • pyth* AND prog*

  • If you encourage “laziness” people will respond!

Which web search engines allow wildcard queries? 

Spelling Correction 

Spell correction

  • Two principal uses

    • Correcting document(s) being indexed

    • Correcting user queries to retrieve “right” answers

  • Two main flavors:

    • Isolated word

      • Check each word on its own for misspelling

      • Will not catch typos resulting in correctly spelled words

      • e.g., from →form

    • Context-sensitive

      • Look at surrounding words,

      • e.g., I flew form Heathrow to Narita.

Document correction

  • Especially needed for OCR’ed documents

    • Correction algorithms are tuned for this: rn/m

    • Can use domain-specific knowledge

      • E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

  • But also: web pages and even printed material have typos

  • Goal: the dictionary contains fewer misspellings

  • But often we don’t change the documents and instead fix the query-document mapping

Query mis-spellings

  • Our principal focus here

    • E.g., the query Alanis Morisett

  • We can either

    • Retrieve documents indexed by the correct spelling, OR

    • Return several suggested alternative queries with the correct spelling

      • Did you mean … 

Isolated word correction

  • Fundamental premise – there is a lexicon from which the correct spellings come

  • Two basic choices for this

    • A standard lexicon such as

      • Webster’s English Dictionary

      • An “industry-specific” lexicon – hand-maintained

    • The lexicon of the indexed corpus

      • E.g., all words on the web

      • All names, acronyms etc.

      • (Including the mis-spellings)

Isolated word correction

  • Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q

  • What’s “closest”?

  • We’ll study several alternatives

    • Edit distance (Levenshtein distance)

    • Weighted edit distance

    • n-gram overlap

Edit distance

  • Given two strings S1 and S2, the minimum number of operations to convert one to the other

  • Operations are typically character-level

    • Insert, Delete, Replace, (Transposition)

  • E.g., the edit distance from dof to dog is 1

    • From cat to act is 2 (Just 1 with transpose.)

    • from cat to dog is 3.

  • Generally found by dynamic programming.

Weighted edit distance

  • As above, but the weight of an operation depends on the character(s) involved

    • Meant to capture OCR or keyboard errors
      Example: m more likely to be mis-typed as n than as q

    • Therefore, replacing m by n is a smaller edit distance than by q

    • This may be formulated as a probability model

  • Requires weight matrix as input

  • Modify dynamic programming to handle weights

Using edit distances

  • Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2)

  • Intersect this set with list of “correct” words

  • Show terms you found to user as suggestions

  • Alternatively,

    • We can look up all possible corrections in our inverted index and return all docs … slow

    • We can run with a single most likely correction

  • The alternatives disempower the user, but save a round of interaction with the user

Edit distance to all dictionary terms?

  • Given a (mis-spelled) query – do we compute its edit distance to every dictionary term?

    • Expensive and slow

    • Alternative?

  • How do we cut the set of candidate dictionary terms?

  • One possibility is to use n-gram overlap for this

  • This can also be used by itself for spelling correction.

n-gram overlap

  • Enumerate all the n-grams in the query string as well as in the lexicon

  • Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams

  • Threshold by number of matching n-grams

    • Variants – weight by keyboard layout, etc.

Example with trigrams

  • Suppose the text is november

    • Trigrams are nov, ove, vem, emb, mbe, ber.

  • The query is december

    • Trigrams are  dec, ece, cem, emb, mbe, ber.

  • So 3 trigrams overlap (of 6 in each term)

  • How can we turn this into a normalized measure of overlap?

One option – Jaccard coefficient

  • A commonly-used measure of overlap

  • Let X and Y be two sets; then the J.C. is

  • Equals 1 when X and Y have the same elements and zero when they are disjoint

  • X and Y don’t have to be of the same size

  • Always assigns a number between 0 and 1

    • Now threshold to decide if you have a match

    • E.g., if J.C. > 0.8, declare a match 

Matching trigrams

  • Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)

    lo ⇒ alone → lore → sloth

    or border → lore → morbid

    rd ⇒ ardent → border → card


    Standard postings “merge” will enumerate … 

    Adapt this to using Jaccard (or another) measure.

Context-sensitive spell correction

  • Text: I flew from Heathrow to Narita.

  • Consider the phrase query “flew form Heathrow”

  • We’d like to respond

    Did you mean “flew from Heathrow”?

    because no docs matched the query phrase.

Context-sensitive correction

  • Need surrounding context to catch this.

  • First idea: retrieve dictionary terms close (in weighted edit distance) to each query term

  • Now try all possible resulting phrases with one word “fixed” at a time

    • flew from heathrow

    • fled form heathrow

    • flea form heathrow

  • Hit-based spelling correction: Suggest the alternative that has lots of hits.


  • Suppose that for “flew form Heathrow” we have 7 alternatives for flew, 19 for form and 3 for heathrow.

  • How many “corrected” phrases will we enumerate in this scheme?

Another approach

  • Break phrase query into a conjunction of biwords (Lecture 2).

  • Look for biwords that need only one term corrected.

  • Enumerate only phrases containing “common” biwords.

General issues in spell correction

  • We enumerate multiple alternatives for “Did you mean?”

  • Need to figure out which to present to the user

    • The alternative hitting most docs

    • Query log analysis

  • More generally, rank alternatives probabilistically

    • argmaxcorr P(corr | query)

    • From Bayes rule, this is equivalent to
      argmaxcorr P(query | corr) * P(corr)

                                ↑                 ↑

                    Noisy channel     Language model



  • Class of heuristics to expand a query into phonetic equivalents

    • Language specific – mainly for names

    • E.g., chebyshev → tchebycheff

  • Invented for the U.S. census … in 1918

Soundex – typical algorithm

  • Turn every token to be indexed into a 4-character reduced form

  • Do the same with query terms

  • Build and search an index on the reduced forms

    • (when the query calls for a soundex match)

Soundex – typical algorithm

  1. Retain the first letter of the word.

  2. Change all occurrences of the following letters to '0' (zero):
      'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.

  3. Change letters to digits as follows:

    • B, F, P, V →1

    • C, G, J, K, Q, S, X, Z 2

    • D,T 3

    • 4

    • M, N 5

    • 6

Soundex continued

  1. Remove all pairs of consecutive digits.

  2. Remove all zeros from the resulting string.

  3. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form .

    • E.g., Herman becomes H655.


      Will hermann generate the same code?

What queries can we process?

  • We have

    • Positional inverted index with skip pointers

    • Wild-card index

    • Spell-correction

    • Soundex

  • Queries such as

  • (SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)


  • Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, …)

  • How useful is soundex?

  • Not very – for information retrieval

  • Okay for “high recall” tasks (e.g., Interpol), though biased to names of certain nationalities

  • Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR


  • Draw yourself a diagram showing the various indexes in a search engine incorporating all the functionality we have talked about

  • Identify some of the key design choices in the index pipeline:

    • Does stemming happen before the Soundex index?

    • What about n-grams?

  • Given a query, how would you parse and dispatch sub-queries to the various indexes?


  • IIR 3, MG 4.2

  • Efficient spell retrieval:

    • K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992.

    • J. Zobel and P. Dart.  Finding approximate matches in large lexicons.  Software - practice and experience 25(3), March 1995.

    • Mikael Tillenius: Efficient Generation and Ranking of Spelling Error Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.

  • Nice, easy reading on spell correction:

    • Peter Norvig: How to write a spelling corrector


Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 4: Index Construction


  • Last lecture:

    • Dictionary data structures

    • Tolerant retrieval

      • Wildcards

      • Spell correction

      • Soundex

  • This time:

    • Index construction

new slide

Index construction

  • How do we construct an index?

  • What strategies can we use with limited main memory?

Hardware basics

  • Many design decisions in information retrieval are based on the characteristics of hardware

  • We begin by reviewing hardware basics

Hardware basics

  • Access to data in memory is much faster than access to data on disk.

  • Disk seeks: No data is transferred from disk while the disk head is being positioned.

  • Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks.

  • Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).

  • Block sizes: 8KB to 256 KB.

Hardware basics

  • Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB.

  • Available disk space is several (2–3) orders of magnitude larger.

  • Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.

Hardware assumptions for this lecture

  • symbol                   statistic                               value

  •     s                   average seek time 5                    ms = 5 x 10−3 s

  •     b                   transfer time per byte 0.02         μs = 2 x 10−8 s

  •                          processor’s clock rate                109 s−1

  •      p                  low-level operation 0.01             μs = 10−8 s

                            (e.g., compare & swap a word)

  •                          size of main memory                    several GB

  •                          size of disk space                          1 TB or more

RCV1: Our collection for this lecture

  • Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course.

  • The collection we’ll use isn’t really large enough either, but it’s publicly available and is at least a more plausible example.

  • As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.

  • This is one year of Reuters newswire (part of 1995 and 1996)

A Reuters RCV1 document

Reuters RCV1 statistics

  • symbol                        statistic                       value

  •     N                           documents                      800,000

  •      L                        avg. # tokens per doc            200

  •      M                       terms (= word types)         400,000

  •                                 avg. # bytes per token            6

                                    (incl. spaces/punct.)

  •                                 avg. # bytes per token           4.5

                                    (without spaces/punct.)

  •                                 avg. # bytes per term             7.5

  •                                non-positional postings    100,000,000

    4.5 bytes per word token vs. 7.5 bytes per word type: why?

new slide

Recall IIR 1 index construction

Documents are parsed to extract words and these are saved with the Document ID.

Key step

  • After all documents have been parsed, the inverted file is sorted by terms.                 ↑
We focus on this sort step.We have 100M items to sort.

Scaling index construction

  • In-memory index construction does not scale

    • Can’t stuff entire collection into memory, sort, then write back

  • How can we construct an index for very large collections?

  • Taking into account the hardware constraints we just learned about . . .

  • Memory, disk, speed, etc.

Sort-based index construction

  • As we build the index, we parse docs one at a time.

    • While building the index, we cannot easily exploit compression tricks (you can, but much more complex)

  • The final postings for any term are incomplete until the end.

  • At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.

  • T = 100,000,000 in the case of RCV1

    • So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of >150 years of newswire

  • Thus: We need to store intermediate results on disk.

Sort using disk as “memory”?

  • Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?

  • No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.

  • We need an external sorting algorithm.


  • Parse and build postings entries one doc at a time

  • Now sort postings entries by term (then by doc within each term)

  • Doing this with random disk seeks would be too slow

    – must sort T=100M records

If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take?

BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)

  • 12-byte (4+4+4) records (term, doc, freq).

  • These are generated as we parse docs.

  • Must now sort 100M such 12-byte records by term.

  • Define a Block ~ 10M such records

    • Can easily fit a couple into memory.

    • Will have 10 such blocks to start with.

  • Basic idea of algorithm:

    • Accumulate postings for each block, sort, write to disk.

    • Then merge the blocks into one long sorted order.

Sorting 10 blocks of 10M records

  • First, read each block and sort within:

    • Quicksort takes 2N ln N expected steps

    • In our case 2 x (10M ln 10M) steps

  • Exercise: estimate total time to read each block from disk and and quicksort it.

  • 10 times this estimate – gives us 10 sorted runs of 10M records each.

  • Done straightforwardly, need 2 copies of data on disk

    • But can optimize this

new slide

How to merge the sorted runs?

  • Can do binary merges, with a merge tree of log210 = 4 layers.

  • During each layer, read into memory runs in blocks of 10M, merge, write back.

How to merge the sorted runs?

  • But it is more efficient to do a multi-way merge, where you are reading from all blocks simultaneously

  • Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk, then you’re not killed by disk seeks.

Remaining problem with sort-based algorithm

  • Our assumption was: we can keep the dictionary in memory.

  • We need the dictionary (which grows dynamically) in order to implement a term to termID mapping.

  • Actually, we could work with term,docID postings instead of termID,docID postings . . .

  • . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

new slide

SPIMI: Single-pass in-memory indexing

  • Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.

  • Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.

  • With these two ideas we can generate a complete inverted index for each block.

  • These separate indexes can then be merged into one big index.


  • Merging of blocks is analogous to BSBI. 

SPIMI: Compression

  • Compression makes SPIMI even more efficient.

    • Compression of terms

    • Compression of postings

  • See next lecture

Distributed indexing

  • For web-scale indexing (don’t try this at home!):

    • must use a distributed computing cluster

  • Individual machines are fault-prone

    • Can unpredictably slow down or fail

  • How do we exploit such a pool of machines?

Web search engine data centers

  • Web search data centers (Google, Bing, Baidu) mainly contain commodity machines.

  • Data centers are distributed around the world.

  • Estimate: Google ~1 million servers, 3 million processors/cores (Gartner 2007)

Massive data centers

  • If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system?

  • Answer: 63%

  • Exercise: Calculate the number of servers failing per minute for an installation of 1 million servers.

Distributed indexing

  • Maintain a master machine directing the indexing job – considered “safe”.

  • Break up indexing into sets of (parallel) tasks.

  • Master machine assigns each task to an idle machine from a pool.

Parallel tasks

  • We will use two sets of parallel tasks

    • Parsers

    • Inverters

  • Break the input document collection into splits

  • Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI)


  • Master assigns a split to an idle parser machine

  • Parser reads a document at a time and emits (term, doc) pairs

  • Parser writes pairs into j partitions

  • Each partition is for a range of terms’ first letters

    • (e.g., a-f, g-p, q-z) – here j = 3.

  • Now to complete the index inversion


  • An inverter collects all (term,doc) pairs (= postings) for one term-partition.

  • Sorts and writes to postings lists

Data flow


  • The index construction algorithm we just described is an instance of MapReduce.

  • MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing …

  • … without having to write code for the distribution part.

  • They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.


  • Index construction was just one phase.

  • Another phase: transforming a term-partitioned index into a document-partitioned index.

    • Term-partitioned: one machine handles a subrange of terms

    • Document-partitioned: one machine handles a subrange of documents

  • As we’ll discuss in the web part of the course, most search engines use a document-partitioned index … better load balancing, etc.

Schema for index construction in MapReduce

  • Schema of map and reduce functions

  • map: input → list(k, v) reduce: (k,list(v)) → output

  • Instantiation of the schema for index construction

  • map: collection → list(termID, docID)

  • reduce: (, , …) → (postings list1, postings list2, …)

Example for index construction

  • Map:

  • d1 : C came, C c’ed.

  • d2 : C died. →

  • (C,d1>, <came,d1>, <C,d1>, <c'ed, d1>, >C, d2>, <died, d2>

  • Reduce:

  • (<C,(d1,d2,d1)>, <died,(d2)>, <came, (d1)>, <c'ed,(d1)>) 

    → <C,(d1:2,d2:1)>, <died,(d2:1)>,  <came, (d1:1)>, <c'ed,(d1:1)>) 

new slide

Dynamic indexing

  • Up to now, we have assumed that collections are static.

  • They rarely are:

    • Documents come in over time and need to be inserted.

    • Documents are deleted and modified.

  • This means that the dictionary and postings lists have to be modified:

    • Postings updates for terms already in dictionary

    • New terms added to dictionary

Simplest approach

  • Maintain “big” main index

  • New docs go into “small” auxiliary index

  • Search across both, merge results

  • Deletions

    • Invalidation bit-vector for deleted docs

    • Filter docs output on a search result by this invalidation bit-vector

  • Periodically, re-index into one main index

Issues with main and auxiliary indexes

  • Problem of frequent merges – you touch stuff a lot

  • Poor performance during merge

  • Actually:

    • Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.

    • Merge is the same as a simple append.

    • But then we would need a lot of files – inefficient for OS.

  • Assumption for the rest of the lecture: The index is one big file.

  • In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)

Logarithmic merge

  • Maintain a series of indexes, each twice as large as the previous one

    • At any time, some of these powers of 2 are instantiated

  • Keep smallest (Z0) in memory

  • Larger ones (I0, I1, …) on disk

  • If Z0 gets too big (> n), write to disk as I0

  • or merge with I0 (if I0 already exists) as Z1

  • Either write merge Z1 to disk as I1 (if no I1)

  • Or merge with I1 to form Z2

Logarithmic merge

  • Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge.

  • Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T)

  • So logarithmic merge is much more efficient for index construction

  • But query processing now requires the merging of O(log T) indexes

    • Whereas it is O(1) if you just have a main and auxiliary index

Further issues with multiple indexes

  • Collection-wide statistics are hard to maintain

  • E.g., when we spoke of spell-correction: which of several corrected alternatives do we present to the user?

    • We said, pick the one with the most hits

  • How do we maintain the top ones with multiple indexes and invalidation bit vectors?

    • One possibility: ignore everything but the main index for such ordering

  • Will see more such statistics used in results ranking

Dynamic indexing at search engines

  • All the large search engines now do dynamic indexing

  • Their indices have frequent incremental changes

    • News items, blogs, new topical web pages

      • Sarah Palin, …

  • But (sometimes/typically) they also periodically reconstruct the index from scratch

    • Query processing is then switched to the new index, and the old index is deleted

Other sorts of indexes

  • Positional indexes

    • Same sort of sorting problem … just larger                ← WHY?

  • Building character n-gram indexes:

    • As text is parsed, enumerate n-grams.

    • For each n-gram, need pointers to all dictionary terms containing it – the “postings”.

    • Note that the same “postings entry” will arise repeatedly in parsing the docs – need efficient hashing to keep track of this.

      • E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous

      • Only need to process each term once

new slide

new slide

Resources for today’s lecture

  • Chapter 4 of IIR

  • MG Chapter 5

  • Original publication on MapReduce: Dean and Ghemawat (2004)

  • Original publication on SPIMI: Heinz and Zobel (2003)

CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 5: Index Compression

Course work

  • Problem set 1 due Thursday

  • Programming exercise 1 will be handed out today

Last lecture – index construction

  • Sort-based indexing 
    • Naïve in-memory inversion 
    • Blocked Sort-Based Indexing
      • Merge sort is effective for disk-based sorting (avoid seeks!)
  • Single-Pass In-Memory Indexing
    • No global dictionary
      • Generate separate dictionary for each block
    • Don’t sort postings
      • Accumulate postings in postings lists as they occur
  • Distributed indexing using MapReduce
  • Dynamic indexing: Multiple indices, logarithmic merge


  • Collection statistics in more detail (with RCV1)

    • How big will the dictionary and postings be?

  • Dictionary compression

  • Postings compression

Why compression (in general)?

  • Use less disk space
    • Saves a little money
  • Keep more stuff in memory
    • Increases speed
  • Increase speed of data transfer from disk to memory
    • [read compressed data | decompress] is faster than [read uncompressed data]
    • Premise: Decompression algorithms are fast
      • True of the decompression algorithms we use

Why compression for inverted indexes?

  • Dictionary

    • Make it small enough to keep in main memory

    • Make it so small that you can keep some postings lists in main memory too

  • Postings file(s)

    • Reduce disk space needed

    • Decrease time needed to read postings lists from disk

    • Large search engines keep a significant part of the postings in memory.

      • Compression lets you keep more in memory

  • We will devise various IR-specific compression schemes

Recall Reuters RCV1

    symbol                             statistic                                     value
  •     N                               documents                                800,000

  •      L                     avg. # tokens per doc                             200

  •     M                     terms (= word types)                       ~400,000

  •                             avg. # bytes per token                             6

                                   (incl. spaces/punct.)

  •                             avg. # bytes per token                           4.5

                               (without spaces/punct.)

  •                             avg. # bytes per term                             7.5

  •                             non-positional postings                  100,000,000

Index parameters vs. what we index (details IIR Table 5.1, p.80)

Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns?

Lossless vs. lossy compression

  • Lossless compression: All information is preserved.

    • What we mostly do in IR.

  • Lossy compression: Discard some information

  • Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.

  • Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.

    • Almost no loss quality for top k list.

Vocabulary vs. collection size

  • How big is the term vocabulary?

    • That is, how many distinct words are there?

  • Can we assume an upper bound?

    • Not really: At least 7020 = 1037 different words of length 20

  • In practice, the vocabulary will keep growing with the collection size

    • Especially with Unicode :)

Vocabulary vs. collection size

  • Heaps’ law: M = kTb

  • M is the size of the vocabulary, T is the number of tokens in the collection

  • Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5

  • In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½

    • It is the simplest possible relationship between the two in log-log space

    • An empirical finding (“empirical law”)

Heaps’ Law

For RCV1, the dashed line

log10M = 0.49 log10T + 1.64 is the best least squares fit.

Thus, M = 101.64T0.49 so k = 101.64 ≈ 44 and b = 0.49.

Good empirical fit for Reuters RCV1 !

For first 1,000,020 tokens,

law predicts 38,323 terms;

actually, 38,365 terms.


  • What is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps’ law?

  • Compute the vocabulary size M for this scenario:

    • Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.

    • Assume a search engine indexes a total of 20,000,000,000 (2 × 1010) pages, containing 200 tokens on average

    • What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?

Zipf’s law

  • Heaps’ law gives the vocabulary size in collections.

  • We also study the relative frequencies of terms.

  • In natural language, there are a few very frequent terms and very many very rare terms.

  • Zipf’s law: The ith most frequent term has frequency proportional to 1/i .

  • cfi ∝ 1/i = K/i where K is a normalizing constant

  • cfi is collection frequency: the number of occurrences of the term ti in the collection.

Zipf consequences

  • If the most frequent term (the) occurs cf1 times

    • then the second most frequent term (of) occurs cf1/2 times

    • the third most frequent term (and) occurs cf1/3 times …

  • Equivalent: cfi = K/i where K is a normalizing factor, so

    • log cfi = log K - log i

    • Linear relationship between log cfi and log i

  • Another power law relationship

Zipf’s law for Reuters RCV1


  • Now, we will consider compressing the space for the dictionary and postings

    • Basic Boolean index only

    • No study of positional indexes, etc.

    • We will consider compression schemes


Why compress the dictionary?

  • Search begins with the dictionary

  • We want to keep it in memory

  • Memory footprint competition with other applications

  • Embedded/mobile devices may have very little memory

  • Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time

  • So, compressing the dictionary is important

Dictionary storage - first cut

  • Array of fixed-width entries
    • ~400,000 terms; 28 bytes/term = 11.2 MB.

                     ↓                        20 bytes                4 bytes each


Dictionary search structure

Fixed-width terms are wasteful

  • Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms.

    • And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.

  • Written English averages ~4.5 characters/word.

    • Exercise: Why is/isn’t this the number to use for estimating the dictionary size?

  • Ave. dictionary word in English: ~8 characters

    • How do we use ~8 characters per dictionary term?

  • Short words dominate token counts but not type average.

Compressing the term list: Dictionary-as-a-String

  • Store dictionary as a (long) string of characters:

    • Pointer to next word shows end of current word

    • Hope to save up to 60% of dictionary space.

Space for dictionary as a string

  • 4 bytes per term for Freq. 

  • 4 bytes per term for pointer to Postings.                         → Now avg. 11 bytes/term,

  • 3 bytes per term pointer                                                               ...Not 20

  • Avg. 8 bytes per term in term string

  • 400K terms x 19 ⇒7.6 MB (against 11.2MB for fixed width)


  • Store pointers to every kth term string.

    • Example below: k=4.

  • Need to store term lengths (1 extra byte)


  • Example for block size k = 4

  • Where we used 3 bytes/pointer without blocking

    • 3 x 4 = 12 bytes,

  • now we use 3 + 4 = 7 bytes.

  • Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB.

  • We can save more with larger k.

  • Why not go with larger k?


  • Estimate the space usage (and savings compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.

Dictionary search without blocking

  • Assuming each dictionary term equally likely in query

    (not really so in practice!), average number of comparisons

    =(1+2∙2+4∙3+4)/8 ~2.6 

  • Exercise: what if the frequencies of query terms

    were non-uniform but known,

    how would you structure the dictionary search tree?

Dictionary search with blocking

  • Büinary search down to 4-term block;

    • Then linear search through terms in block.

  • Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares


  • Estimate the impact on search performance (and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16. 

Front coding

  • Front-coding:

    • Sorted words commonly have long common prefix – store differences only

    • (for last k-1 in a block of k)

    • 8automata8automate9automatic10automation

                        → 8automat*a1♦e2ic3ion


               Encodes automat          Extra length beyond automat.

  • Begins to resemble general string compression.

RCV1 dictionary compression summary

  • Technique

  • Size in MB

  • Fixed width

  • 11.2

  • Dictionary-as-String with pointers to every term

  • 7.6

  • Also, blocking k = 4

  • 7.1

  • Also, Blocking + front coding

  • 5.9


Postings compression

  • The postings file is much larger than the dictionary, factor of at least 10.

  • Key desideratum: store each posting compactly.

  • A posting for our purposes is a docID.

  • For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers.

  • Alternatively, we can use log2 800,000 ≈ 20 bits per docID.

  • Our goal: use far fewer than 20 bits per docID.

Postings: two conflicting forces

  • A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.

  • A term like the occurs in virtually every doc, so 20 bits/posting is too expensive.

    • Prefer 0/1 bitmap vector in this case

Postings file entry

  • We store the list of docs containing a term in increasing order of docID.

    • computer: 33,47,154,159,202 …

  • Consequence: it suffices to store gaps.

    • 33,14,107,5,43 …

  • Hope: most gaps can be encoded/stored with far fewer than 20 bits.

Three postings entries

Variable length encoding

  • Aim:

    • For arachnocentric, we will use ~20 bits/gap entry.

    • For the, we will use ~1 bit/gap entry.

  • If the average gap for a term is G, we want to use ~log2G bits/gap entry.

  • Key challenge: encode every integer (gap) with about as few bits as needed for that integer.

  • This requires a variable length encoding

  • Variable length codes achieve this by using short codes for small numbers

Variable Byte (VB) codes

  • For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits

  • Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c

  • If G ≤127, binary-encode it in the 7 available bits and set c =1

  • Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm

  • At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.


Other variable unit codes

  • Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).

  • Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases.

  • Variable byte codes:

    • Used by many commercial/research systems

    • Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).

  • There is also recent work on word-aligned codes that pack a variable number of gaps into one word

Unary code

  • Represent n as n 1s with a final 0.

  • Unary code for 3 is 1110.

  • Unary code for 40 is

  • 11111111111111111111111111111111111111110 .

  • Unary code for 80 is:

  • 111111111111111111111111111111111111111111111111111111111111111111111111111111110

  • This doesn’t look promising, but….

Gamma codes

  • We can compress better with bit-level codes

    • The Gamma code is the best known of these.

  • Represent a gap G as a pair length and offset

  • offset is G in binary, with the leading bit cut off

    • For example 13 → 1101 → 101

  • length is the length of offset

    • For 13 (offset 101), this is 3.

  • We encode length with unary code: 1110.

  • Gamma code of 13 is the concatenation of length and offset: 1110101

Gamma code examples

Gamma code properties

  • G is encoded using 2   log G⌋ + 1 bits

    • Length of offset is   log G   bits

    • Length of length is ⌊log G  + 1 bits

  • All gamma codes have an odd number of bits

  • Almost within a factor of 2 of best possible, log2 G

  • Gamma code is uniquely prefix-decodable, like VB

  • Gamma code can be used for any distribution

  • Gamma code is parameter-free

Gamma seldom used in practice

  • Machines have word boundaries – 8, 16, 32, 64 bits

    • Operations that cross word boundaries are slower

  • Compressing and manipulating at the granularity of bits can be slow

  • Variable byte encoding is aligned and thus potentially more efficient

  • Regardless of efficiency, variable byte is conceptually simpler at little additional space cost

RCV1 compression

Index compression summary

  • We can now create an index for highly efficient Boolean retrieval that is very space efficient
  • Only 4% of the total size of the collection

  • Only 10-15% of the total size of the text in the collection

  • However, we’ve ignored positional information

  • Hence, space savings are less for indexes used in practice

    • But techniques substantially the same.

Resources for today’s lecture

  • IIR 5

  • MG 3.3, 3.4.

  • F. Scholer, H.E. Williams and J. Zobel. 2002. Compression of Inverted Indexes For Fast Query Evaluation. Proc. ACM-SIGIR 2002.

    • Variable byte codes

  • V. N. Anh and A. Moffat. 2005. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval 8: 151–166.

    • Word aligned codes

CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 6: Scoring, Term Weighting and the Vector Space Model

Recap of lecture 5

  • Collection and vocabulary statistics: Heaps’ and Zipf’s laws
  • Dictionary compression for Boolean indexes
    • Dictionary string, blocks, front coding
  • Postings compression: Gap encoding, prefix-unique codes
    • Variable-Byte and Gamma codes
  • collection (text, xml markup etc)
  • 3,600.0       MB
  • collection (text)
  • 960.0
  • Term-doc incidence matrix

  • 40,000.0

  • postings, uncompressed (32-bit words)
  • 400.0
  • postings, uncompressed (20 bits)
  • 250.0

  • postings, variable byte encoded
  • 116.0

  • postings, g-encoded
  • 101.0

This lecture; IIR Sections 6.2-6.4.3

  • Ranked retrieval

  • Scoring documents

  • Term frequency

  • Collection statistics

  • Weighting schemes

  • Vector space scoring

Ranked retrieval

  • Thus far, our queries have all been Boolean.

    • Documents either match or don’t.

  • Good for expert users with precise understanding of their needs and the collection.

    • Also good for applications: Applications can easily consume 1000s of results.

  • Not good for the majority of users.

    • Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

    • Most users don’t want to wade through 1000s of results.

      • This is particularly true of web search.

Problem with Boolean search:

  • feast or famine

  • Boolean queries often result in either too few (=0) or too many (1000s) results.

  • Query 1: “standard user dlink 650” → 200,000 hits

  • Query 2: “standard user dlink 650 no card found”: 0 hits

  • It takes a lot of skill to come up with a query that produces a manageable number of hits.

    • AND gives too few; OR gives too many

Ranked retrieval models

  • Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

  • Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

  • In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa

Feast or famine: not a problem in ranked retrieval

  • When a system produces a ranked result set, large result sets are not an issue

    • Indeed, the size of the result set is not an issue

    • We just show the top k ( ≈ 10) results

    • We don’t overwhelm the user

    • Premise: the ranking algorithm works

Scoring as the basis of ranked retrieval

  • We wish to return in order the documents most likely to be useful to the searcher

  • How can we rank-order the documents in the collection with respect to a query?

  • Assign a score – say in [0, 1] – to each document

  • This score measures how well document and query “match”.

Query-document matching scores

  • We need a way of assigning a score to a query/document pair

  • Let’s start with a one-term query 

  • If the query term does not occur in the document: score should be 0

  • The more frequent the query term in the document, the higher the score (should be)

  • We will look at a number of alternatives for this.

Take 1: Jaccard coefficient

  • Recall from Lecture 3: A commonly used measure of overlap of two sets A and B

  • jaccard(A,B) = |A B| / |A B|

  • jaccard(A,A) = 1

  • jaccard(A,B) = 0 if A ∩ B =

  • A and B don’t have to be the same size.

  • Always assigns a number between 0 and 1.

Jaccard coefficient: Scoring example

  • What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?

  • Query: ides of march

  • Document 1: caesar died in march

  • Document 2: the long march

Issues with Jaccard for scoring

  • It doesn’t consider term frequency (how many times a term occurs in a document)

  • Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information

  • We need a more sophisticated way of normalizing for length

  • Later in this lecture, we’ll use 

  • . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.

Recall (Lecture 1): Binary term-document incidence matrix

  • Each document is represented by a binary vector ∈ {0,1}|V|

Term-document count matrices

  • Consider the number of occurrences of a term in a document:

    • Each document is a count vector in ℕv: a column below 

Bag of words model

  • Vector representation doesn’t consider the ordering of words in a document

  • John is quicker than Mary and Mary is quicker than John have the same vectors

  • This is called the bag of words model.

  • In a sense, this is a step back: The positional index was able to distinguish these two documents.

  • We will look at “recovering” positional information later in this course.

  • For now: bag of words model

Term frequency tf

  • The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

  • We want to use tf when computing query-document match scores. But how?

  • Raw term frequency is not what we want:

    • A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.

    • But not 10 times more relevant.

  • Relevance does not increase proportionally with term frequency.

  • NB: frequency = count in IR

Log-frequency weighting

  • The log frequency weight of term t in d is

  • 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

  • Score for a document-query pair: sum over terms t in both q and d:

  • The score is 0 if none of the query terms is present in the document.
    score : 

Document frequency

  • Rare terms are more informative than frequent terms

    • Recall stop words

  • Consider a term in the query that is rare in the collection (e.g., arachnocentric)

  • A document containing this term is very likely to be relevant to the query arachnocentric

  • → We want a high weight for rare terms like arachnocentric.

Document frequency, continued

  • Frequent terms are less informative than rare terms

  • Consider a query term that is frequent in the collection (e.g., high, increase, line)

  • A document containing such a term is more likely to be relevant than a document that doesn’t

  • But it’s not a sure indicator of relevance.

  • → For frequent terms, we want high positive weights for words like high, increase, and line

  • But lower weights than for rare terms.

  • We will use document frequency (df) to capture this.

idf weight

  • dft is the document frequency of t: the number of documents that contain t

    • dft is an inverse measure of the informativeness of t

    • dft N

  • We define the idf (inverse document frequency) of t by

    • We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

  • Will turn out the base of the log is immaterial.

idf example, suppose N = 1 million

There is one idf value for each term t in a collection.

idft = log10 = ( N/dft)

  • term
  • dft
  • idft
  • calpurnia
  • 1
  • animal
  • 100
  • sunday
  • 1,000
  • fly
  • 10,000
  • under
  • 100,000
  • the
  • 1,000,000

Effect of idf on ranking

  • Does idf have an effect on ranking for one-term queries, like

    • iPhone

  • idf has no effect on ranking one term queries

    • idf affects the ranking of documents for queries with at least two terms

    • For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.

Collection vs. Document frequency

  • The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

  • Example:

  • Word

  • Collection frequency

  • Document frequency

  • insurance

  • 10440

  • 3997

  • try

  • 10422

  • 8760

  • Which word is a better search term (and should get a higher weight)?

tf-idf weighting

  • The tf-idf weight of a term is the product of its tf weight and its idf weight.

    Wt,d = log(1+ tft,d) x log10(N/dft)

  • Best known weighting scheme in information retrieval

    • Note: the “-” in tf-idf is a hyphen, not a minus sign!

    • Alternative names: tf.idf, tf x idf

  • Increases with the number of occurrences within a document

  • Increases with the rarity of the term in the collection

Score for a document given a query

Score(q,d) = ∑t∈q∩d tf.idft,d

  • There are many variants

    • How “tf” is computed (with/without logs)

    • Whether the terms in the query are also weighted

    • … 

Binary → count → weight matrix

  • Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Documents as vectors

  • So we have a |V|-dimensional vector space

  • Terms are axes of the space

  • Documents are points or vectors in this space

  • Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine

  • These are very sparse vectors - most entries are zero.

Queries as vectors

  • Key idea 1: Do the same for queries: represent them as vectors in the space

  • Key idea 2: Rank documents according to their proximity to the query in this space

  • proximity = similarity of vectors

  • proximity ≈ inverse of distance

  • Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.

  • Instead: rank more relevant documents higher than less relevant documents

Formalizing vector space proximity

  • First cut: distance between two points

    • ( = distance between the end points of the two vectors)

  • Euclidean distance?

  • Euclidean distance is a bad idea . . .

  • . . . because Euclidean distance is large for vectors of different lengths.

Why distance is a bad idea

  • The Euclidean distance between and d2 is large even though the distribution of t terms in the query q and the distribution of terms in the document d2 are very similar.

Use angle instead of distance

  • Thought experiment: take a document d and append it to itself. Call this document d′.

  • “Semantically” d and d′ have the same content

  • The Euclidean distance between the two documents can be quite large

  • The angle between the two documents is 0, corresponding to maximal similarity.

  • Key idea: Rank documents according to angle with query.

From angles to cosines

  • The following two notions are equivalent.

    • Rank documents in decreasing order of the angle between query and document

    • Rank documents in increasing order of cosine(query,document)

  • Cosine is a monotonically decreasing function for the interval [0o, 180o]

From angles to cosines

  • But how – and why – should we be computing cosines?

Length normalization

  • A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

  • Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)

  • Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

    • Long and short documents now have comparable weights


  • qi is the tf-idf weight of term i in the query

  • di is the tf-idf weight of term i in the document

  • cos(q,d) is the cosine similarity of q and d … or,

  • equivalently, the cosine of the angle between q and d.

Cosine for length-normalized vectors

  • For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

  • for q, d length-normalized.

Cosine similarity illustrated

Cosine similarity amongst 3 documents

  • How similar are the novels 
  • SaS: Sense and Sensibility
  •  PaP: Pride and  Prejudice, and  
  • WH: Wuthering Heights?
  • Term frequencies (counts)
  • Note: To simplify this example, we don’t do idf weighting.
  • term
  • SaS
  • PaP
  • WH
  • affection
  • 115
  • 58
  • 20
  • jealous
  • 10
  • 7
  • 11
  • gossip
  • 2
  • <0
  • 6
  • wuthering
  • 0
  • 0
  • 38

3 documents example contd.

  • cos(SaS,PaP) ≈

    0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0

    ≈ 0.94

    cos(SaS,WH) ≈ 0.79

    cos(PaP,WH) ≈ 0.69

  • Why do we have cos(SaS,PaP) > cos(SaS,WH)?

Computing cosine scores

tf-idf weighting has many variants

  • Columns headed ‘n’ are acronyms for weight schemes.

  • Why is the base of the log in idf immaterial?

Weighting may differ in queries vs documents

  • Many search engines allow for different weightings for queries vs. documents

  • SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table

  • A very standard weighting scheme is: lnc.ltc

  • Document: logarithmic tf (l as first character), no idf and cosine normalization

  • Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

  • A bad idea?

tf-idf example: lnc.ltc

  • Document: car insurance auto insurance

    Query: best car insurance

  • Exercise: what is N, the number of docs?

  • Score = 0+0+0.27+0.53 = 0.8

  • Doc length =

Summary – vector space ranking

  • Represent the query as a weighted tf-idf vector

  • Represent each document as a weighted tf-idf vector

  • Compute the cosine similarity score for the query vector and each document vector

  • Rank documents with respect to the query by score

  • Return the top K (e.g., K = 10) to the user

Resources for today’s lecture

  • IIR 6.2 – 6.4.3


    • Term weighting and cosine similarity tutorial for SEO folk!


Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 7: Scoring and results assembly

Lecture 6 – I introduced a bug

  • In my anxiety to avoid taking the log of zero, I rewrote

  • In fact this was unnecessary, since the zero case is treated

  • specially above; net the FIRST version above is right.

Recap: tf-idf weighting

  • The tf-idf weight of a term is the product of its tf weight and its idf weight.

    wt,d= (1+log10 tft,d) x log10(N/dft)

  • Best known weighting scheme in information retrieval

  • Increases with the number of occurrences within a document

  • Increases with the rarity of the term in the collection

Recap: Queries as vectors

  • Key idea 1: Do the same for queries: represent them as vectors in the space

  • Key idea 2: Rank documents according to their proximity to the query in this space

  • proximity = similarity of vectors

Recap: cosine(query,document)

  • cos(q,d) is the cosine similarity of q and d … or,

  • equivalently, the cosine of the angle between q and d.

This lecture

  • Speeding up vector space ranking

  • Putting together a complete search system

    • Will require learning about a number of miscellaneous topics and heuristics

Computing cosine scores

Efficient cosine ranking

  • Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines.

  • Efficient ranking:

    • Computing a single cosine efficiently.

    • Choosing the K largest cosine values efficiently.

      • Can we do this without computing all N cosines?

Efficient cosine ranking

  • What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

  • In general, we do not know how to do this efficiently for high-dimensional spaces

  • But it is solvable for short queries, and standard indexes support this well

Special case – unweighted queries

  • No weighting on query terms

    • Assume each query term occurs only once

  • Then for ranking, don’t need to normalize query vector

    • Slight simplification of algorithm from Lecture 6

Computing the K largest cosines: selection vs. sorting

  • Typically we want to retrieve the top K docs (in the cosine ranking for the query)

    • not to totally order all docs in the collection

  • Can we pick off docs with K highest cosines?

  • Let J = number of docs with nonzero cosines

    • We seek the K best of these J

Use heap for selecting top K

  • Binary tree in which each node’s value > the values of children

  • Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

  • For J=1M, K=100, this is about 10% of the cost of sorting.


  • Primary computational bottleneck in scoring: cosine computation

  • Can we avoid all this computation?

  • Yes, but may sometimes get it wrong

    • a doc not in the top K may creep into the list of K output docs

    • Is this such a bad thing?

Cosine similarity is only a proxy

  • User has a task and a query formulation

  • Cosine matches docs to query

  • Thus cosine is anyway a proxy for user happiness

  • If we get a list of K docs “close” to the top K by cosine measure, should be ok

Generic approach

  • Find a set A of contenders, with K < |A| << N

    • A does not necessarily contain the top K, but has many docs from among the top K

    • Return the top K docs in A

  • Think of A as pruning non-contenders

  • The same approach is also used for other (non-cosine) scoring functions

  • Will look at several schemes following this approach

Index elimination

  • Basic algorithm cosine computation algorithm only considers docs containing at least one query term

  • Take this further:

    • Only consider high-idf query terms

    • Only consider docs containing many query terms

High-idf query terms only

  • For a query such as catcher in the rye

  • Only accumulate scores from catcher and rye

  • Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much

  • Benefit:

    • Postings of low-idf terms have many docs  these (many) docs get eliminated from set A of contenders

Docs containing many query terms

  • Any doc with at least one query term is a candidate for the top K output list

  • For multi-term queries, only compute scores for docs containing several of the query terms

    • Say, at least 3 out of 4

    • Imposes a “soft conjunction” on queries seen on web search engines (early Google)

  • Easy to implement in postings traversal

3 of 4 query terms

  • Scores only computed for docs 8, 16 and 32.

Champion lists

  • Precompute for each dictionary term t, the r docs of highest weight in t’s postings

    • Call this the champion list for t

    • (aka fancy list or top docs for t)

  • Note that r has to be chosen at index build time

    • Thus, it’s possible that r < K

  • At query time, only compute scores for docs in the champion list of some query term

    • Pick the K top-scoring docs from amongst these


  • How do Champion Lists relate to Index Elimination? Can they be used together?

  • How can Champion Lists be implemented in an inverted index?

    • Note that the champion list has nothing to do with small docIDs


  • Static quality scores

  • We want top-ranking documents to be both relevant and authoritative

  • Relevance is being modeled by cosine scores

  • Authority is typically a query-independent property of a document

  • Examples of authority signals

    • Wikipedia among websites

    • Articles in certain newspapers

    • A paper with many citations

    • Many bitly’s, diggs or marks

    • (Pagerank)

Modeling authority

  • Assign to each document a query-independent quality score in [0,1] to each document d

    • Denote this by g(d)

  • Thus, a quantity like the number of citations is scaled into [0,1]

    • Exercise: suggest a formula for this.

Net score

  • Consider a simple total score combining cosine relevance and authority

  • net-score(q,d) = g(d) + cosine(q,d)

    • Can use some other linear combination

    • Indeed, any function of the two “signals” of user happiness – more later

  • Now we seek the top K docs by net score

Top K by net score – fast methods

  • First idea: Order all postings by g(d)

  • Key: this is a common ordering for all postings

  • Thus, can concurrently traverse query terms’ postings for

    • Postings intersection

    • Cosine score computation

  • Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)

Why order postings by g(d)?

  • Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal

  • In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early

    • Short of computing scores for all docs in postings

Champion lists in g(d)-ordering

  • Can combine champion lists with g(d)-ordering

  • Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd

  • Seek top-K results from only the docs in these champion lists 

High and low lists

  • For each term, we maintain two postings lists called high and low

    • Think of high as the champion list

  • When traversing postings on a query, only traverse high lists first

    • If we get more than K docs, select the top K and stop

    • Else proceed to get docs from the low lists

  • Can be used even for simple cosine scores, without global quality g(d)

  • A means for segmenting index into two tiers

Impact-ordered postings

  • We only want to compute scores for docs for which wft,d is high enough

  • We sort each postings list by wft,d

  • Now: not all postings in a common order!

  • How do we compute scores in order to pick off top K?

    • Two ideas follow

1. Early termination

  • When traversing t’s postings, stop early after either

    • a fixed number of r docs

    • wft,d drops below some threshold

  • Take the union of the resulting sets of docs

    • One from the postings of each query term

  • Compute only the scores for docs in this union

2. idf-ordered terms

  • When considering the postings of query terms

  • Look at them in order of decreasing idf

    • High idf terms likely to contribute most to score

  • As we update score contribution from each query term

    • Stop if doc scores relatively unchanged

  • Can apply to cosine or some other net scores

Cluster pruning: preprocessing

  • Pick √N docs at random: call these leaders

  • For every other doc, pre-compute nearest leader

    • Docs attached to a leader: its followers;

    • Likely: each leader has ~ N followers.

Cluster pruning: query processing

  • Process a query as follows:

    • Given query Q, find its nearest leader L.

    • Seek K nearest docs from among L’s followers.


Why use random sampling

  • Fast

  • Leaders reflect data distribution

General variants

  • Have each follower attached to b1=3 (say) nearest leaders.

  • From query, find b2=4 (say) nearest leaders and their followers.

  • Can recurse on leader/follower construction.


  • To find the nearest leader in step 1, how many cosine computations do we do?

    • Why did we have √N in the first place?

  • What is the effect of the constants b1, b2 on the previous slide?

  • Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs.

    • Likely under random sampling.

Parametric and zone indexes

  • Thus far, a doc has been a sequence of terms

  • In fact documents have multiple parts, some with special semantics:
    • Author
    • Title
    • Date of publication
    • Language
    • Format
    • etc.
  • These constitute the metadata about a document


  • We sometimes wish to search by these metadata

    • E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick

  • Year = 1601 is an example of a field

  • Also, author last name = shakespeare, etc.

  • Field or parametric index: postings for each field value

    • Sometimes build range trees (e.g., for dates)

  • Field query typically treated as conjunction

    • (doc must be authored by shakespeare)


  • A zone is a region of the doc that can contain an arbitrary amount of text, e.g.,

    • Title

    • Abstract

    • References …

  • Build inverted indexes on zones as well to permit querying

  • E.g., “find docs with merchant in the title zone and matching the query gentle rain”

Example zone indexes

Encode zones in dictionary vs. postings.


Tiered indexes

  • Break postings up into a hierarchy of lists

    • Most important

    • Least important

  • Can be done by g(d) or another measure

  • Inverted index thus broken up into tiers of decreasing importance

  • At query time use top tier unless it fails to yield K docs

    • If so drop to lower tiers

Example tiered index

Query term proximity

  • Free text queries: just a set of terms typed into the query box – common on the web

  • Users prefer docs in which query terms occur within close proximity of each other

  • Let w be the smallest window in a doc containing all query terms, e.g.,

  • For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words)

  • Would like scoring function to take this into account – how?

Query parsers

  • Free text query from user may in fact spawn one or more queries to the indexes, e.g., query rising interest rates

    • Run the query as a phrase query

    • If <K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates

    • If we still have <K docs, run the vector space query rising interest rates

    • Rank matching docs by vector space scoring

  • This sequence is issued by a query parser

Aggregate scores

  • We’ve seen that score functions can combine cosine, static quality, proximity, etc.

  • How do we know the best combination?

  • Some applications – expert-tuned

  • Increasingly common: machine-learned

    • See May 19th lecture

Putting it all together


  • IIR 7, 6.1


Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 8: Evaluation

This lecture

  • How do we know if our results are any good?

    • Evaluating a search engine

      • Benchmarks

      • Precision and recall

  • Results summaries:

    • Making our good results usable to a user

Evaluating Search Engines

Measures for a search engine

  • How fast does it index

    • Number of documents/hour

    • (Average document size)

  • How fast does it search

    • Latency as a function of index size

  • Expressiveness of query language

    • Ability to express complex information needs

    • Speed on complex queries

  • Uncluttered UI

  • Is it free?

Measures for a search engine

  • All of the preceding criteria are measurable: we can quantify speed/size

    • we can make expressiveness precise

  • The key measure: user happiness

    • What is this?

    • Speed of response/size of index are factors

    • But blindingly fast, useless answers won’t make a user happy

  • Need a way of quantifying user happiness

Measuring user happiness

  • Issue: who is the user we are trying to make happy?

    • Depends on the setting

  • Web engine:

    • User finds what s/he wants and returns to the engine

      • Can measure rate of return users

    • User completes task – search as a means, not end

    • See Russell

  • eCommerce site: user finds what s/he wants and buys

    • Is it the end-user, or the eCommerce site, whose happiness we measure?

    • Measure time to purchase, or fraction of searchers who become buyers?

Measuring user happiness

  • Enterprise (company/govt/academic): Care about “user productivity”

    • How much time do my users save when looking for information?

    • Many other criteria having to do with breadth of access, secure access, etc.

Happiness: elusive to measure

  • Most common proxy: relevance of search results

  • But how do you measure relevance?

  • We will detail a methodology here, then examine its issues

  • Relevance measurement requires 3 elements:

  1. A benchmark document collection

  2. A benchmark suite of queries

  3. A usually binary assessment of either Relevant or Nonrelevant for each query and each document

        • Some work on more-than-binary, but not the standard

Evaluating an IR system

  • Note: the information need is translated into a query

  • Relevance is assessed relative to the information need not the query

  • E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

  • Query: wine red white heart attack effective

  • Evaluate whether the doc addresses the information need, not whether it has these words

Standard relevance benchmarks

  • TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years

  • Reuters and other benchmark doc collections used

  • “Retrieval tasks” specified

    • sometimes as queries

  • Human experts mark, for each query and for each doc, Relevant or Nonrelevant

    • or at least for subset of docs that some system returned for that query

Unranked retrieval evaluation:Precision and Recall

  • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)

  • Recall: fraction of relevant docs that are retrieved

  • = P(retrieved|relevant)

  • Relevant

  • Nonrelevant

  • Retrieved

  • tp

  • fp

  • Not Retrieved

  • fn

  • tn

  • Precision P = tp/(tp + fp)
  • Recall R = tp/(tp + fn)

Should we instead use the accuracy measure for evaluation?

  • Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”

  • The accuracy of an engine: the fraction of these classifications that are correct

    • (tp + tn) / ( tp + fp + fn + tn)

  • Accuracy is a commonly used evaluation measure in machine learning classification work

  • Why is this not a very useful evaluation measure in IR?

Why not just use accuracy?

  • How to build a 99.9999% accurate search engine on a low budget….

  • People doing information retrieval want to find something and have a certain tolerance for junk.


  • You can get high recall (but low precision) by retrieving all docs for all queries!

  • Recall is a non-decreasing function of the number of docs retrieved

  • In a good system, precision decreases as either the number of docs retrieved or recall increases

    • This is not a theorem, but a result with strong empirical confirmation

Difficulties in using precision/recall

  • Should average over large document collection/query ensembles

  • Need human relevance assessments

    • People aren’t reliable assessors

  • Assessments have to be binary

    • Nuanced assessments?

  • Heavily skewed by collection/authorship

    • Results may not translate from one domain to another

A combined measure: F

  • Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):

  • People usually use balanced F1 measure

    • i.e., with β = 1 or a= ½

  • Harmonic mean is a conservative average

    • See CJ van Rijsbergen, Information Retrieval

F1 and other averages

Evaluating ranked results

  • Evaluation of ranked results:

    • The system can return any number of results

    • By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve

A precision-recall curve

Averaging over queries

  • A precision-recall graph for one query isn’t a very sensible thing to look at

  • You need to average performance over a whole bunch of queries.

  • But there’s a technical issue:

    • Precision-recall calculations place some points on the graph

    • How do you determine a value (interpolate) between the points?

Interpolated precision

  • Idea: If locally precision increases with increasing recall, then you should get to count that…

  • So you take the max of precisions to right of value


  • Graphs are good, but people want summary measures!

    • Precision at fixed retrieval level

      • Precision-at-k: Precision of top k results

      • Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages

      • But: averages badly and has an arbitrary parameter of k

    • 11-point interpolated average precision

      • The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them

      • Evaluates performance at all recall levels

Typical (good) 11 point precisions

  • SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

Yet more evaluation measures…

  • Mean average precision (MAP)

    • Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved

    • Avoids interpolation, use of fixed recall levels

    • MAP for query collection is arithmetic ave.

      • Macro-averaging: each query counts equally

  • R-precision

    • If we have a known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of the top Rel docs returned

    • Perfect system could score 1.0.


  • For a test collection, it is usual that a system does crummily on some information needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7)

  • Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query.

  • That is, there are easy information needs and hard ones!

Creating Test Collections
for IR Evaluation

Test Collections

From document collections to test collections

  • Still need
    • Test queries
    • Relevance assessments

  • Test queries

    • Must be germane to docs available

    • Best designed by domain experts

    • Random query terms generally not a good idea

  • Relevance assessments

    • Human judges, time-consuming

    • Are human panels perfect?

Kappa measure for inter-judge (dis)agreement

  • Kappa measure

    • Agreement measure among judges

    • Designed for categorical judgments

    • Corrects for chance agreement

  • Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

  • P(A) – proportion of time judges agree

  • P(E) – what agreement would be by chance

  • Kappa = 0 for chance agreement, 1 for total agreement.

Kappa Measure: Example

  • P(A)? P(E)?

  • Number of docs

  • Judge 1

  • Judge 2

  • 300

  • Relevant

  • Relevant

  • 70

  • Nonrelevant

  • Nonrelevant

  • 20

  • Relevant

  • Nonrelevant

  • 10

  • Nonrelevant

  • Relevant

Kappa Example

  • P(A) = 370/400 = 0.925

  • P(nonrelevant) = (10+20+70+70)/800 = 0.2125

  • P(relevant) = (10+20+300+300)/800 = 0.7878

  • P(E) = 0.2125^2 + 0.7878^2 = 0.665

  • Kappa = (0.925 – 0.665)/(1-0.665) = 0.776

  • Kappa > 0.8 = good agreement

  • 0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)

  • Depends on purpose of study

  • For >2 judges: average pairwise kappas 


  • TREC Ad Hoc task from first 8 TRECs is standard IR task

    • 50 detailed information needs a year

    • Human evaluation of pooled results returned

    • More recently other related things: Web track, HARD

  • A TREC query (TREC 5)

    • Number: 225

    • Description:

    • What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?

Standard relevance benchmarks: Others

  • GOV2

    • Another TREC/NIST collection

    • 25 million web pages

    • Largest collection that is easily available

    • But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index


    • East Asian language and cross-language information retrieval

  • Cross Language Evaluation Forum (CLEF)

    • This evaluation series has concentrated on European languages and cross-language information retrieval.

  • Many others

Impact of Inter-judge Agreement

  • Impact on absolute performance measure can be significant (0.32 vs 0.39)

  • Little impact on ranking of different systems or relative performance

  • Suppose we want to know if algorithm A is better than algorithm B

  • A standard information retrieval experiment will give us a reliable answer to this question.

Critique of pure relevance

  • Relevance vs Marginal Relevance

    • A document can be redundant even if it is highly relevant

    • Duplicates

    • The same information from different sources

    • Marginal relevance is a better measure of utility for the user.

  • Using facts/entities as evaluation units more directly measures true relevance.

  • But harder to create evaluation set

  • See Carbonell reference

Can we avoid human judgment?

  • No

  • Makes experimental work hard

    • Especially on a large scale

  • In some very specific settings, can use proxies

    • E.g.: for approximate vector space retrieval, we can compare the cosine distance closeness of the closest docs to those found by an approximate retrieval algorithm

  • But once we have test collections, we can reuse them (so long as we don’t overtrain too badly)

Evaluation at large search engines

  • Search engines have test collections of queries and hand-ranked results

  • Recall is difficult to measure on the web

  • Search engines often use precision at top k, e.g., k = 10

  • . . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.

    • NDCG (Normalized Cumulative Discounted Gain)

  • Search engines also use non-relevance-based measures.

    • Clickthrough on first result

      • Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.

    • Studies of user behavior in the lab

    • A/B testing

A/B testing

  • Purpose: Test a single innovation

  • Prerequisite: You have a large search engine up and running.

  • Have most users use old system

  • Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation

  • Evaluate with an “automatic” measure like clickthrough on first result

  • Now we can directly see if the innovation does improve user happiness.

  • Probably the evaluation methodology that large search engines trust most

  • In principle less powerful than doing a multivariate regression analysis, but easier to understand

Results presentation 

Result Summaries

  • Having ranked the documents matching a query, we wish to present a results list

  • Most commonly, a list of the document titles plus a short summary, aka “10 blue links”


  • The title is often automatically extracted from document metadata. What about the summaries?

    • This description is crucial.

    • User can identify good/relevant hits based on description.

  • Two basic kinds:

    • Static

    • Dynamic

  • A static summary of a document is always the same, regardless of the query that hit the doc

  • A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand

Static summaries

  • In typical systems, the static summary is a subset of the document

  • Simplest heuristic: the first 50 (or so – this can be varied) words of the document

    • Summary cached at indexing time

  • More sophisticated: extract from each document a set of “key” sentences

    • Simple NLP heuristics to score each sentence

    • Summary is made up of top-scoring sentences.

  • Most sophisticated: NLP used to synthesize a summary

    • Seldom used in IR; cf. text summarization work

Dynamic summaries

  • Present one or more “windows” within the document that contain several of the query terms

    • “KWIC” snippets: Keyword in Context presentation

Techniques for dynamic summaries

  • Find small windows in doc that contain query terms

    • Requires fast window lookup in a document cache

  • Score each window wrt query

    • Use various features such as window width, position in document, etc.

    • Combine features through a scoring function – methodology to be covered Nov 12th

  • Challenges in evaluation: judging summaries

    • Easier to do pairwise comparisons rather than binary relevance assessments


  • For a navigational query such as united airlines user’s need likely satisfied on

  • Quicklinks provide navigational cues on that home page

Alternative results presentations?

Resources for this lecture

  • IIR 8

  • MIR Chapter 3

  • MG 4.5

  • Carbonell and Goldstein 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR 21.


Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 9: Query expansion 


  • Midterm in class on Thursday 28th

  • Material from first 8 lectures

  • Open book, open notes

  • You can use (and should bring!) a basic calculator

  • You cannot use any wired or wireless communication. Use of such communication will be regarded as an Honor Code violation.

  • You can preload the pdf of the book on to your laptop which you can use disconnected in the room.

Recap of the last lecture

  • Evaluating a search engine

    • Benchmarks

    • Precision and recall

  • Results summaries

Recap: Unranked retrieval evaluation:Precision and Recall

  • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)

  • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)

  • Precision P = tp/(tp + fp)

  • Recall R = tp/(tp + fn)

  • Relevant

  • Nonrelevant

  • Retrieved

  • tp

  • fp

  • Not Retrieved

  • fn

  • tn

Recap: A combined measure: F

  • Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):

  • People usually use balanced F1 measure

    • i.e., with β = 1 or a= ½

  • Harmonic mean is a conservative average

    • See CJ van Rijsbergen, Information Retrieval

This lecture

  • Improving results

    • For high recall. E.g., searching for aircraft doesn’t match with plane; nor thermodynamic with heat

  • Options for improving results…

    • Global methods

      • Query expansion

        • Thesauri

        • Automatic thesaurus generation

    • Local methods

      • Relevance feedback

      • Pseudo relevance feedback

Relevance Feedback

  • Relevance feedback: user feedback on relevance of docs in initial set of results

    • User issues a (short, simple) query

    • The user marks some results as relevant or non-relevant.

    • The system computes a better representation of the information need based on feedback.

    • Relevance feedback can go through one or more iterations.

  • Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate

Relevance feedback

  • We will use ad hoc retrieval to refer to regular retrieval without relevance feedback.

  • We now look at four examples of relevance feedback that highlight different aspects.

Similar pages

Relevance Feedback: Example

Results for Initial Query

Relevance Feedback

Results after Relevance Feedback

Ad hoc results for query canine           source: Fernando Diaz

Ad hoc results for query canine source: Fernando Diaz

User feedback: Select what is relevant                                    source: Fernando Diaz

Results after relevance feedback           source: Fernando Diaz

Initial query/results

  • Initial query: New space satellite applications

    • +     1. 0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer

    • +      2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan

    • 3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes

    • 4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget

    • 5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes Satellites for Climate Research

    • 6. 0.524, 08/22/90, Report Provides Support for the Critics Of Using Big Satellites to Study Climate

    • 7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada

    • +      8. 0.509, 12/02/87, Telecommunications Tale of Two Companies

  • User then marks relevant documents with “+”.

Expanded query after relevance feedback

  • 2.074 new                                            15.106 space

  • 30.816 satellite                                     5.660 application

  • 5.991 nasa                                             5.196 eos

  • 4.196 launch                                          3.972 aster

  • 3.516 instrument                                   3.446 arianespace

  • 3.004 bundespost                                  2.806 ss

  • 2.790 rocket                                          2.053 scientist

  • 2.003 broadcast                                     1.172 earth

  • 0.836 oil                                                 0.646 measure

Results for expanded query

  • 2       1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan

  • 1        2. 0.500, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer

  • 3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own

  • 4. 0.493, 07/31/89, NASA Uses ‘Warm’ Superconductors For Fast Circuit

  • 8         5. 0.492, 12/02/87, Telecommunications Tale of Two Companies

  • 6. 0.491, 07/09/91, Soviets May Adapt Parts of SS-20 Missile For Commercial Use

  • 7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers

  • 8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost $90 Million 

Key concept: Centroid

  • The centroid is the center of mass of a set of points

  • Recall that we represent documents as points in a high-dimensional space

  • Definition: Centroid

  • where C is a set of documents.

Rocchio Algorithm

  • The Rocchio algorithm uses the vector space model to pick a relevance feedback query

  • Rocchio seeks the query qop t that maximizes

  • Tries to separate docs marked relevant and non-relevant

  • Problem: we don’t know the truly relevant docs

The Theoretically Best Query

x non-relevant documents

o relevant documents

Rocchio 1971 Algorithm (SMART)

  • Used in practice:

  • Dr = set of known relevant doc vectors

  • Dnr = set of known irrelevant doc vectors

    • Different from Cr and Cnr           !

  • qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)

  • New query moves toward relevant documents and away from irrelevant documents

Subtleties to note

  • Tradeoff α vs. β/γ : If we have a lot of judged documents, we want a higher β/γ.

  • Some weights in query vector can go negative

    • Negative term weights are ignored (set to 0)

Relevance feedback on initial query

x known non-relevant documents

o known relevant documents

Relevance Feedback in vector spaces

  • We can modify the query based on relevance feedback and apply standard vector space model.

  • Use only the docs that were marked.

  • Relevance feedback can improve recall and precision

  • Relevance feedback is most useful for increasing recall in situations where recall is important

    • Users can be expected to review results and to take time to iterate

Positive vs Negative Feedback

  • Positive feedback is more valuable than negative feedback (so, set γ< β; e.g. γ= 0.25, β = 0.75).

  • Many systems only allow positive feedback (γ=0).


Aside: Vector Space can be Counterintuitive

q1 query “cholera”


x other documents

High-dimensional Vector Spaces

  • The queries “cholera” and “john snow” are far from each other in vector space.

  • How can the document “John Snow and Cholera” be close to both of them?

  • Our intuitions for 2- and 3-dimensional space don't work in >10,000 dimensions.

  • 3 dimensions: If a document is close to many queries, then some of these queries must be close to each other.

  • Doesn't hold for a high-dimensional space.

Relevance Feedback: Assumptions

  • A1: User has sufficient knowledge for initial query.

  • A2: Relevance prototypes are “well-behaved”.

    • Term distribution in relevant documents will be similar

    • Term distribution in non-relevant documents will be different from those in relevant documents

      • Either: All relevant documents are tightly clustered around a single prototype.

      • Or: There are different prototypes, but they have significant vocabulary overlap.

      • Similarities between relevant and irrelevant documents are small

Violation of A1

  • User does not have sufficient initial knowledge.

  • Examples:

    • Misspellings (Brittany Speers).

    • Cross-language information retrieval (hígado).

    • Mismatch of searcher’s vocabulary vs. collection vocabulary

      • Cosmonaut/astronaut

Violation of A2

  • There are several relevance prototypes.

  • Examples:

    • Burma/Myanmar

    • Contradictory government policies

    • Pop stars that worked at Burger King

  • Often: instances of a general concept

  • Good editorial content can address problem

    • Report on contradictory government policies

Relevance Feedback: Problems

  • Long queries are inefficient for typical IR engine.                           ⇐  Why?

    • Long response times for user.

    • High cost for retrieval system.

    • Partial solution:

      • Only reweight certain prominent terms

        • Perhaps top 20 by term frequency

  • Users are often reluctant to provide explicit feedback

  • It’s often harder to understand why a particular document was retrieved after applying relevance feedback

Evaluation of relevance feedback strategies

  • Use q0 and compute precision and recall graph

  • Use qm and compute precision recall graph

    • Assess on all documents in the collection

        • Spectacular improvements, but … it’s cheating!

        • Partly due to known relevant documents ranked higher

        • Must evaluate with respect to documents not seen by user

    • Use documents in residual collection (set of documents minus those assessed relevant)

        • Measures usually then lower than for original query

        • But a more realistic evaluation

        • Relative performance can be validly compared

  • Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.

Evaluation of relevance feedback

  • Second method – assess only the docs not rated by the user in the first round

    • Could make relevance feedback look worse than it really is

    • Can still assess relative performance of algorithms

  • Most satisfactory – use two collections each with their own relevance assessments

    • q0 and user feedback from first collection

    • qm run on second collection and measured

Evaluation: Caveat

  • True evaluation of usefulness must compare to other methods taking the same amount of time.

  • Alternative to relevance feedback: User revises and resubmits query.

  • Users may prefer revision/resubmission to having to judge relevance of documents.

  • There is no clear evidence that relevance feedback is the “best use” of the user’s time.

Relevance Feedback on the Web

  • Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback)

    • Google (link-based)                                          ←  α/β/γ ??

    • Altavista

    • Stanford WebBase

  • But some don’t because it’s hard to explain to average user:

    • Alltheweb

    • bing

    • Yahoo

  • Excite initially had true relevance feedback, but abandoned it due to lack of use.

Excite Relevance Feedback

  • Spink et al. 2000

  • Only about 4% of query sessions from a user used relevance feedback option

    • Expressed as “More like this” link next to each result

  • But about 70% of users only looked at first page of results and didn’t pursue things further

    • So 4% is about 1/8 of people extending search

  • Relevance feedback improved results about 2/3 of the time

Pseudo relevance feedback

  • Pseudo-relevance feedback automates the “manual” part of true relevance feedback.

  • Pseudo-relevance algorithm:

    • Retrieve a ranked list of hits for the user’s query

    • Assume that the top k documents are relevant.

    • Do relevance feedback (e.g., Rocchio)

  • Works very well on average

  • But can go horribly wrong for some queries.

  • Several iterations can cause query drift.

  • Why?

Query Expansion

  • In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents

  • In query expansion, users give additional input (good/bad search term) on words or phrase

Query assist

  • Would you expect such a feature to increase the query volume at a search engine?

How do we augment the user query?

  • Manual thesaurus

    • E.g. MedLine: physician, syn: doc, doctor, MD, medico

    • Can be query rather than just synonyms

  • Global Analysis: (static; of all documents in collection)

    • Automatically derived thesaurus

      • (co-occurrence statistics)

    • Refinements based on query log mining

      • Common on the web

  • Local Analysis: (dynamic)

    • Analysis of documents in result set

Thesaurus-based query expansion

  • For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus

    • feline → feline cat

  • May weight added terms less than original query terms.

  • Generally increases recall

  • Widely used in many science/engineering fields

  • May significantly decrease precision, particularly with ambiguous terms.

    • “interest rate”  “interest rate fascinate evaluate”

  • There is a high cost of manually producing a thesaurus

    • And for updating it for scientific changes

Automatic Thesaurus Generation

  • Attempt to generate a thesaurus automatically by analyzing the collection of documents

  • Fundamental notion: similarity between two words

  • Definition 1: Two words are similar if they co-occur with similar words.

  • Definition 2: Two words are similar if they occur in a given grammatical relation with the same words.

  • You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar.

  • Co-occurrence based is more robust, grammatical relations are more accurate.  



Co-occurrence Thesaurus

  • Simplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrix.

  • wi,j = (normalized) weight for (ti ,dj)

  • For each ti, pick terms with high values in C

Example of manual thesaurus

Automatic Thesaurus Generation

Automatic Thesaurus Generation

  • Discussion

  • Quality of associations is usually a problem.

  • Term ambiguity may introduce irrelevant statistically correlated terms.

    • “Apple computer”  “Apple red fruit computer”

  • Problems:

    • False positives: Words deemed similar that are not

    • False negatives: Words deemed dissimilar that are similar

  • Since terms are highly correlated anyway, expansion may not retrieve many additional documents.

Indirect relevance feedback

  • On the web, DirectHit introduced a form of indirect relevance feedback.

  • DirectHit ranked documents higher that users look at more often.

    • Clicked on links are assumed likely to be relevant

      • Assuming the displayed summaries are good, etc.

  • Globally: Not necessarily user or query specific.

    • This is the general area of clickstream mining

  • Today – handled as part of machine-learned ranking


  • IIR Ch 9

  • MG Ch. 4.7

  • MIR Ch. 5.2 – 5.4


Text Retrieval and Mining
Winter 2005

Lecture 12

What is XML?

  • eXtensible Markup Language

  • A framework for defining markup languages

  • No fixed collection of markup tags

  • Each XML language targeted for application

  • All XML languages share features

  • Enables building of generic tools

Basic Structure

  • An XML document is an ordered, labeled tree

  • character data leaf nodes contain the actual data (text strings)

  • element nodes, are each labeled with

    • a name (often called the element type), and

    • a set of attributes, each consisting of a name and a value,

    • can have child nodes

XML Example

XML Example

<chapter id="cmds">

   <chaptitle>FileCab</chaptitle<para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>


  • Elements are denoted by markup tags

  • thetext

  • Element start tag: foo

  • Attribute: attr1

  • The character data: thetext

  • Matching element end tag:


  • HTML is a markup language for a specific purpose (display in browsers)

  • XML is a framework for defining markup languages

  • HTML can be formalized as an XML language (XHTML)

  • XML defines logical structure only

  • HTML: same intention, but has evolved into a presentation language

XML: Design Goals

  • Separate syntax from semantics to provide a common framework for structuring information

  • Allow tailor-made markup for any imaginable application domain

  • Support internationalization (Unicode) and platform independence

  • Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML?

  • Represent semi-structured data (data that are structured, but don’t fit relational model)

  • XML is more flexible than DBs

  • XML is more structured than simple IR

  • You get a massive infrastructure for free

Applications of XML


  • CML – chemical markup language

  • WML – wireless markup language

  • ThML – theological markup language

    • Having a Humble Opinion of Self

      EVERY man naturally desires knowledge

      Aristotle, Metaphysics, i. 1.

      ; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars.

      Augustine, Confessions V. 4.

XML Schemas

  • Schema = syntax definition of XML language

  • Schema language = formal language for expressing XML schemas

  • Examples

    • Document Type Definition

    • XML Schema (W3C)

  • Relevance for XML IR

    • Our job is much easier if we have a (one) schema

XML Tutorial

  • (Anders Møller and Michael Schwartzbach)

  • Previous (and some following) slides are based on their tutorial

XML Indexing and Search

Native XML Database

  • Uses XML document as logical unit

  • Should support

    • Elements

    • Attributes

    • PCDATA (parsed character data)

    • Document order

  • Contrast with

    • DB modified for XML

    • Generic IR system modified for XML

XML Indexing and Search

  • Most native XML databases have taken a DB approach

    • Exact match

    • Evaluate path expressions

    • No IR type relevance ranking

  • Only a few that focus on relevance ranking

Data vs. Text-centric XML

  • Data-centric XML: used for messaging between enterprise applications

    • Mainly a recasting of relational data

  • Content-centric XML: used for annotating content

    • Rich in text

    • Demands good integration of text retrieval functionality

    • E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price

IR XML Challenge 1: Term Statistics

  • There is no document unit in XML

  • How do we compute tf and idf?

  • Global tf/idf over all text context is useless

  • Indexing granularity

IR XML Challenge 2: Fragments

  • IR systems don’t store content (only index)

  • Need to go to document for retrieving/displaying fragment

    • E.g., give me the Abstracts of Papers on existentialism

    • Where do you retrieve the Abstract from?

  • Easier in DB framework

IR XML Challenges 3: Schemas

  • Ideally:

    • There is one schema

    • User understands schema

  • In practice: rare

    • Many schemas

    • Schemas not known in advance

    • Schemas change

    • Users don’t understand schemas

  • Need to identify similar elements in different schemas

    • Example: employee

IR XML Challenges 4: UI

  • Help user find relevant nodes in schema

    • Author, editor, contributor, “from:”/sender

  • What is the query language you expose to the user?

    • Specific XML query language? No.

    • Forms? Parametric search?

    • A textbox?

  • In general: design layer between XML and user

IR XML Challenges 5: using a DB

  • Why you don’t want to use a DB

    • Spelling correction

    • Mid-word wildcards

    • Contains vs “is about”

    • DB has no notion of ordering

    • Relevance ranking

Querying XML

  • Today:

    • XQuery

    • XIRQL

  • Lecture 15

    • Vector space approaches


  • SQL for XML

  • Usage scenarios

    • Human-readable documents

    • Data-oriented documents

    • Mixed documents (e.g., patient records)

  • Relies on

    • XPath

    • XML Schema datatypes

  • Turing complete

  • XQuery is still a working draft.


  • The principal forms of XQuery expressions are:

    • path expressions

    • element constructors

    • FLWR ("flower") expressions

    • list expressions

    • conditional expressions

    • quantified expressions

    • datatype expressions

  • Evaluated with respect to a context


  • FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p

  • FOR generates an ordered list of bindings of publisher names to $p

  • LET associates to each binding a further binding of the list of book elements with that publisher to $b

  • at this stage, we have an ordered list of tuples of bindings: ($p,$b)

  • WHERE filters that list to retain only the desired tuples

  • RETURN constructs for each tuple a resulting value

Queries Supported by XQuery

  • Location/position (“chapter no.3”)

  • Simple attribute/value

    • /play/title contains “hamlet”

  • Path queries

    • title contains “hamlet”

    • /play//title contains “hamlet”

  • Complex graphs

    • Employees with two managers

  • Subsumes: hyperlinks

  • What about relevance ranking?

How XQuery makes ranking difficult

  • All documents in set A must be ranked above all documents in set B.

  • Fragments must be ordered in depth-first, left-to-right order.

XQuery: Order By Clause

  • for $d in document("depts.xml")//deptno

  • let $e := document("emps.xml")//emp[deptno = $d]

  • where count($e) >= 10

  • order by avg($e/salary) descending

  • return { $d, {count($e)}, {avg($e/salary)} }

XQuery Order By Clause

  • Order by clause only allows ordering by “overt” criterion

    • Say by an attribute value

  • Relevance ranking

    • Is often proprietary

    • Can’t be expressed easily as function of set to be ranked

    • Is better abstracted out of query formulation (cf. www)


  • University of Dortmund

    • Goal: open source XML search engine

  • Motivation

    • “Returnable” fragments are special

      • E.g., don’t return a some text fragment

    • Structured Document Retrieval Principle

    • Empower users who don’t know the schema

      • Enable search for any person no matter how schema encodes the data

      • Don’t worry about attribute/element

Atomic Units

  • Specified in schema

  • Only atomic units can be returned as result of search (unless unit specified)

  • Tf.idf weighting is applied to atomic units

  • Probabilistic combination of “evidence” from atomic units

XIRQL Indexing

Structured Document Retrieval Principle

  • A system should always retrieve the most specific part of a document answering a query.

  • Example query: xql

  • Document:

    • 0.3 XQL

    • 0.5 example

    • 0.8 XQL 0.7 syntax

    • Return section, not chapter

Augmentation weights

  • Ensure that Structured Document Retrieval Principle is respected.

  • Assume different query conditions are disjoint events -> independence.

  • P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636

  • Section ranked ahead of chapter


  • Example: person_name

  • Assign all elements and attributes with person semantics to this datatype

  • Allow user to search for “person” without specifying path

XIRQL: Summary

  • Relevance ranking

  • Fragment/context selection

  • Datatypes (person_name)

  • Semantic relativism

    • Attribute/element

Data structures for XML retrieval

A very basic introduction.

Data structures for XML retrieval

  • What are the primitives we need?

  • Inverted index: give me all elements matching text query Q

    • We know how to do this – treat each element as a document

  • Give me all elements (immediately) below any instance of the Book element

  • Combination of the above

Parent/child links

  • Number each element

  • Maintain a list of parent-child relationships

    • E.g., Chapter:21  Book:8

    • Enables immediate parent

  • But what about “the word Hamlet under a Scene element under a Play element?

General positional indexes

  • View the XML document as a text document

  • Build a positional index for each element

    • Mark the beginning and end for each element, e.g.,

  • Play →  Doc:1(27) →  Doc:1(2033) ---→  

  • /Play Doc:1(1122) →  Doc:1(5790) ---

  • Verse →  Doc:1(431)  → Doc:4(33)  ---

  • /Verse → Doc:1(867)  →  Doc:4(92) ---

  • Term:droppeth →  Doc:1(720)

Positional containment


                                                           droppeth under Verse under Play.

Summary of data structures

  • Path containment etc. can essentially be solved by positional inverted indexes

  • Retrieval consists of “merging” postings

  • All the compression tricks etc. from 276A are still applicable

  • Complications arise from insertion/deletion of elements, text within elements

    • Beyond the scope of this course


  • - XML resources at W3C

  • Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.

  • ORDPATHs: Insert-Friendly XML Node Labels.



Text Retrieval and Mining

Lecture 10

Recap of the last lecture

  • Improving search results

    • Especially for high recall. E.g., searching for aircraft so it matches with plane; thermodynamic with heat

  • Options for improving results…

    • Global methods

      • Query expansion

        • Thesauri

        • Automatic thesaurus generation

      • Global indirect relevance feedback

    • Local methods

      • Relevance feedback

      • Pseudo relevance feedback

Probabilistic relevance feedback

  • Rather than reweighting in a vector space…

  • If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model:

    • P(tk|R) = |Drk| / |Dr|

    • P(tk|NR) = |Dnrk| / |Dnr|

      • tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk.

Why probabilities in IR?

  • In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms.

  • Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties?

Probabilistic IR topics

  • Classical probabilistic retrieval model

    • Probability ranking principle, etc.

  • (Naïve) Bayesian Text Categorization

  • Bayesian networks for text retrieval

  • Language model approach to IR

    • An important emphasis in recent work

  • Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.

    • Traditionally: neat ideas, but they’ve never won on performance. It may be different now.

The document ranking problem

  • We have a collection of documents

  • User issues a query

  • A list of documents needs to be returned

  • Ranking method is core of an IR system:

    • In what order do we present documents to the user?

    • We want the “best” document to be first, second best second, etc….

  • Idea: Rank by probability of relevance of the document w.r.t. information need

    • P(relevant|documenti, query)

Recall a few probability basics

  • For events a and b:
  • Bayes’ Rule
  • Odds:

Probability Ranking Principle

  • Let x be a document in the collection.

  • Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.       ⇐  R={0,1} vs. NR/R

  • Need to find p(R|x) - probability that a document is relevant.
  • p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

Probability Ranking Principle (PRP)

  • Simple case: no selection costs or other utility concerns that would differentially weight errors

  • Bayes’ Optimal Decision Rule

    • x is relevant iff p(R|x) > p(NR|x)

  • PRP in action: Rank all documents by p(R|x)

  • Theorem:

    • Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 loss

    • Provable if all probabilities correct, etc. [e.g., Ripley 1996]

Probability Ranking Principle

  • More complex case: retrieval costs.

    • Let d be a document

    • C - cost of retrieval of relevant document

    • C’ - cost of retrieval of non-relevant document

  • Probability Ranking Principle: if

  • for all d’ not yet retrieved, then d is the next document to be retrieved

  • We won’t further consider loss/utility from now on

Probability Ranking Principle

  • How do we compute all those probabilities?

    • Do not know exact probabilities, have to use estimates

    • Binary Independence Retrieval (BIR) – which we discuss later today – is the simplest model

  • Questionable assumptions

    • “Relevance” of each document is independent of relevance of other documents.

      • Really, it’s bad to keep on returning duplicates

    • Boolean model of relevance

    • That one has a single step information need

      • Seeing a range of results might let user refine query

Probabilistic Retrieval Strategy

  • Estimate how terms contribute to relevance

    • How do things like tf, df, and length influence your judgments about document relevance?

      • One answer is the Okapi formulae (S. Robertson)

  • Combine to find document relevance probability

  • Order documents by decreasing probability

The Probability Ranking Principle

  • “If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

      • [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538)

Probabilistic Ranking

  • Basic concept:

  • "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.

  • By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically."

  • Van Rijsbergen

Binary Independence Model

  • Traditionally used in conjunction with PRP

  • “Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1):

  • xi=1  iff term i is present in document x.
  • “Independence”: terms occur in documents independently

  • Different documents can be modeled as same vector

  • Bernoulli Naive Bayes model (cf. text categorization!)

Binary Independence Model

  • Queries: binary term incidence vectors

  • Given query q,

    • for each document d need to compute p(R|q,d).

    • replace with computing p(R|q,x) where x is binary term incidence vector representing d Interested only in ranking

  • Will use odds and Bayes’ Rule:

Binary Independence Model

  • Using Independence Assumption:

  • So :

Binary Independence Model

  • Since xi is either 0 or 1:

  • Let Pi = P(xi = 1 | R,q);  rt = p (xi =1 | NR,q)

  • Assume, for all terms not occurring in the query (qi=0) pi = ri

  • Then...                                                                   (This can be changed (e.g.,

                                                                                        in relevance feedback)

Binary Independence Model

Binary Independence Model

  • Retrieval Status Value:

Binary Independence Model

  • All boils down to computing RSV.

  • So, how do we compute ci’s from our data ?

Binary Independence Model

  • Estimating RSV coefficients.

  • For each term i look at this table of document counts:

Estimation – key challenge

  • If non-relevant documents are approximated by the whole collection, then ri (prob. of occurrence in non-relevant documents for query) is n/N and

    • log (1– ri)/ri = log (N– n)/n ≈ log N/n = IDF!

  • pi (probability of occurrence in relevant documents) can be estimated in various ways:

    • from relevant documents if know some

      • Relevance weighting can be used in feedback loop

    • constant (Croft and Harper combination match) – then just get idf weighting of terms

    • proportional to prob. of occurrence in collection

      • more accurately, to log of this (Greiff, SIGIR 1998) 

Iteratively estimating pi

  1. Assume that pi constant over all xi in query

      • pi = 0.5 (even odds) for any given doc

  2. Determine guess of relevant document set:

      • V is fixed size set of highest ranked documents on this model (note: now a bit like tf.idf!)

  3. We need to improve our guesses for pi and ri, so

      • Use distribution of xi in docs in V. Let Vi be set of documents containing xi

        • pi = |Vi| / |V|

      • Assume if not retrieved then not relevant

        • ri = (ni – |Vi|) / (N – |V|)

  4. Go to 2. until converges then return ranking

Probabilistic Relevance Feedback

  1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents V, as above.
  2. Interact with the user to refine the description: learn some definite members of R and NR

  3. Reestimate pi and ri on the basis of these
      • Or can combine new information with original guess (use Bayesian prior):
  4. Repeat, thus generating a succession of approximations to R


  • Getting reasonable approximations of probabilities is possible.

  • Requires restrictive assumptions:

    • term independence

    • terms not in query don’t affect the outcome

    • boolean representation of documents/queries/relevance

    • document relevance values are independent

  • Some of these assumptions can be removed

  • Problem: either require partial relevance information or only can derive somewhat inferior term weights

Removing term independence

  • In general, index terms aren’t independent

  • Dependencies can be complex

  • van Rijsbergen (1979) proposed model of simple tree dependencies

    • Exactly Friedman and Goldszmidt’s Tree Augmented Naive Bayes (AAAI 13, 1996)

  • Each term dependent on one other

  • In 1970s, estimation problems held back success of this model

Food for thought

  • Think through the differences between standard tf.idf and the probabilistic retrieval model in the first iteration

  • Think through the differences between vector space (pseudo) relevance feedback and probabilistic (pseudo) relevance feedback

Good and Bad News

  • Standard Vector Space Model

    • Empirical for the most part; success measured by results

    • Few properties provable

  • Probabilistic Model Advantages

    • Based on a firm theoretical foundation

    • Theoretically justified optimal ranking scheme

  • Disadvantages

    • Making the initial guess to get V

    • Binary word-in-doc weights (not using term frequencies)

    • Independence of terms (can be alleviated)

    • Amount of computation

    • Has never worked convincingly better in practice

Bayesian Networks for Text Retrieval (Turtle and Croft 1990)

  • Standard probabilistic model assumes you can’t estimate P(R|D,Q)

    • Instead assume independence and use P(D|R)

  • But maybe you can with a Bayesian network*

  • What is a Bayesian network?

    • A directed acyclic graph

    • Nodes

      • Events or Variables

        • Assume values.

        • For our purposes, all Boolean

    • Links

      • model direct dependencies between nodes

Bayesian Networks

  •  a,b,c - propositions (events). Bayesian networks model causal relations between events

  • Inference in Bayesian Nets:

    • Given probability distributions for roots and conditional probabilities can compute apriori probability of any instance

    • Fixing assumptions (e.g., was observed) will cause recomputation of probabilities

  • For more information see:

  • R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. 1999. Probabilistic Networks and Expert Systems. Springer Verlag.

  • J. Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufman.

Toy Example

Independence Assumptions

  • Independence assumption:  P(t|g, f)=P(t|g)

  • Joint probability 

    P(f d n g t) =P(f) P(d) P(n|f) P(g|f d) P(t|g)

Chained inference

  • Evidence - a node takes on some value

  • Inference

    • Compute belief (probabilities) of other nodes

      • conditioned on the known evidence

    • Two kinds of inference: Diagnostic and Predictive

  • Computational complexity

    • General network: NP-hard

      • Tree-like networks are easily tractable

      • Much other work on efficient exact and approximate Bayesian network inference

        • Clever dynamic programming

        • Approximate inference (“loopy belief propagation”)

Model for Text Retrieval

  • Goal

    • Given a user’s information need (evidence), find probability a doc satisfies need

  • Retrieval model

    • Model docs in a document network

    • Model information need in a query network

Bayesian Nets for IR: Idea

Bayesian Nets for IR

  • Construct Document Network (once !)

  • For each query

    • Construct best Query Network

    • Attach it to Document Network

    • Find subset of di’s which maximizes the probability value of node I (best subset).

    • Retrieve these di’s as the answer to query.

Bayesian nets for text retrieval

Example: “reason trouble –two”

Link matrices and probabilities

  • Prior doc probability P(d) = 1/n

  • P(r|d)

    • within-document term frequency

    • tf x idf - based

  • P(c|r)

    • 1-to-1

    • thesaurus

  • P(q|c): canonical forms of query operators

    • Always use things like AND and NOT – never store a full CPT*

  • *conditional probability table


  • Prior probs don’t have to be 1/n.

  • “User information need” doesn’t have to be a query - can be words typed, in docs read, any combination …

  • Phrases, inter-document links

  • Link matrices can be modified over time.

    • User feedback.

    • The promise of “personalization”

Computational details

  • Document network built at indexing time

  • Query network built/scored at query time

  • Representation:

    • Link matrices from docs to any single term are like the postings entry for that term

    • Canonical link matrices are efficient to store and compute

  • Attach evidence only at roots of network

    • Can do single pass from roots to leaves

Bayes Nets in IR

  • Flexible ways of combining term weights, which can generalize previous approaches

    • Boolean model

    • Binary independence model

    • Probabilistic models with weaker assumptions

  • Efficient large-scale implementation

    • InQuery text retrieval system from U Mass

      • Turtle and Croft (1990) [Commercial version defunct?]

  • Need approximations to avoid intractable inference

  • Need to estimate all the probabilities by some means (whether more or less ad hoc)

  • Much new Bayes net technology yet to be applied?


  • S. E. Robertson and K. Spärck Jones. 1976. Relevance Weighting of Search Terms. Journal of the American Society for Information Sciences 27(3): 129–146.

  • C. J. van Rijsbergen. 1979. Information Retrieval. 2nd ed. London: Butterworths, chapter 6. [Most details of math]

  • N. Fuhr. 1992. Probabilistic Models in Information Retrieval. The Computer Journal, 35(3),243–255. [Easiest read, with BNs]

  • F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell. 1998. Is This Document Relevant? ... Probably: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4): 528–552.


  • [Adds very little material that isn’t in van Rijsbergen or Fuhr ]


  • H.R. Turtle and W.B. Croft. 1990. Inference Networks for Document Retrieval. Proc. ACM SIGIR: 1-24.

  • E. Charniak. Bayesian nets without tears. AI Magazine 12(4): 50-63 (1991).

  • D. Heckerman. 1995. A Tutorial on Learning with Bayesian Networks. Microsoft Technical Report MSR-TR-95-06


  • N. Fuhr. 2000. Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications. Journal of the American Society for Information Science 51(2): 95–110.

  • R. K. Belew. 2001. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge UP 2001.

  • MIR 2.5.4, 2.8


Text Retrieval and Mining

Lecture 12

[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]


  • Probabilistic models:
    Naïve Bayes Text Classification

    • Introduction to Text Classification

    • Probabilistic Language Models

    • Naïve Bayes text categorization


  • The Language Model Approach to IR

    • Basic query generation model

    • Alternative models

Standard Probabilistic IR

IR based on Language Model (LM)

  • A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
  • The LM approach directly exploits that idea!

Formal Language (Model)

  • Traditional generative model: generates strings

    • Finite state machines or regular grammars, etc.

  • Example:

 I wish
 I wish I wish
 I wish I wish I wish
 I wish I wish I wish I wish
 *wish I wish

Stochastic Language Models

  • Models probability of generating strings in the language (commonly all strings over alphabet ∑)  

Stochastic Language Models

  • Model probability of generating any string

Stochastic Language Models

  • A statistical model for generating text

    • Probability distribution over strings in a given language

Unigram and higher-order models

  • Unigram Language Models (Easy , effective !)
  • Bigram (generally, n-gram) Language Models
  • Other Language Models
  • Grammar-based models (PCFGs), etc.
  • Probably not the first thing to try in IR
  • Using Language Models in IR

    • Treat each document as the basis for a model (e.g., unigram sufficient statistics)

    • Rank document d based on P(d | q)

    • P(d | q) = P(q | d) x P(d) / P(q)

      • P(q) is the same for all documents, so ignore

      • P(d) [the prior] is often treated as the same for all d

        • But we could use criteria like authority, length, genre

      • P(q | d) is the probability of q given d’s model

    • Very general formal approach

    The fundamental problem of LMs

    • Usually we don’t know the model M

      • But have a sample of text representative of that model

    • Estimate a language model from a sample

    • Then compute the observation probability

    Language Models for IR

    • Language Modeling Approaches

      • Attempt to model query generation process

      • Documents are ranked by the probability that a query would be observed as a random sample from the respective document model

        • Multinomial approach

    Retrieval based on probabilistic LM

    • Treat the generation of queries as a random process.

    • Approach

      • Infer a language model for each document.

      • Estimate the probability of generating the query according to each of these models.

      • Rank the documents according to these probabilities.

      • Usually a unigram estimate of words is used

        • Some work on bigrams, paralleling van Rijsbergen

    Retrieval based on probabilistic LM

    • Intuition

      • Users …

        • Have a reasonable idea of terms that are likely to occur in documents of interest.

        • They will choose query terms that distinguish these documents from others in the collection.

    • Collection statistics …

      • Are integral parts of the language model.

      • Are not used heuristically as in many other approaches.

        • In theory. In practice, there’s usually some wiggle room for empirically set parameters

    Query generation probability (1)

    • Ranking formula

    • The probability of producing the query given the language model of document d using MLE is:

    Insufficient data

    • Zero probability   p(t | Md) = 0

      • May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]

    • General approach

      • A non-occurring term is possible, but no more likely than would be expected by chance in the collection.

      • If :

    Insufficient data

    • Zero probabilities spell disaster

      • We need to smooth probabilities

        • Discount nonzero probabilities

        • Give some probability mass to unseen things

    • There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation

      • [See FSNLP ch. 6 or CS224N if you want more]

    • A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution

    Mixture model

    • P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

    • Mixes the probability from the document with the general collection frequency of the word.

    • Correctly setting lis very important

    • A high value of lambda makes the search “conjunctive-like” – suitable for short queries

    • A low value is more suitable for long queries

    • Can tune l to optimize performance

      • Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

    Basic mixture model summary

    • General formulation of the LM for IR

      • The user has a document in mind, and generates the query from this document.

      • The equation represents the probability that the document that the user had in mind was in fact this one.

    • general language model

    • individual-document model


    • Document collection (2 documents)

      • d1: Xerox reports a profit but revenue is down

      • d2: Lucent narrows quarter loss but revenue decreases further

    • Model: MLE unigram from documents; l = ½

    • Query: revenue down

      • P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

      • = 1/8 x 3/32 = 3/256

      • P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

      • = 1/8 x 1/32 = 1/256

    • Ranking: d1 > d2

    Ponte and Croft Experiments

    • Data

      • TREC topics 202-250 on TREC disks 2 and 3

        • Natural language queries consisting of one sentence each

      • TREC topics 51-100 on TREC disk 3 using the concept fields

        • Lists of good terms

    Topic: Satellite Launch Contracts</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ><desc _tmplitem="15" >Description:</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >… </desc></p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ><con _tmplitem="15" >Concept(s):</p></LI></UL><OL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Contract, agreement</p><LI _tmplitem="15" ><p _tmplitem="15" >Launch vehicle, rocket, payload, satellite</p><LI _tmplitem="15" ><p _tmplitem="15" >Launch services, … </con></p></OL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4242-22"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4243-23-view"> <!-- <span _tmplitem="15" onclick="filterLatex(" id="4243"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4243-23"> Precision/recall results 202-250 </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4243-23" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><img _tmplitem="15" data-src=""/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4243-23"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4244-24-view"> <!-- <span _tmplitem="15" onclick="filterLatex(" id="4244"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4244-24"> Precision/recall results 51-100 </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4244-24" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4244-24"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4245-25-view"> <!-- <span _tmplitem="15" onclick="filterLatex(" id="4245"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4245-25"> The main difference is whether “Relevance” figures explicitly in the model or not </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4245-25" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM approach attempts to do away with modeling relevance</p></LI></UL></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM approach asssumes that documents and expressions of information problems are of the same type</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Computationally tractable, intuitively appealing</p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM vs. Prob. Model for IR</p></LI></UL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4245-25"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4246-26-view"> <!-- <span _tmplitem="15" onclick="filterLatex(" id="4246"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4246-26"> Problems of basic LM approach </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4246-26" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Assumption of equivalence between document and information problem representation is unrealistic</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Very simple models of language</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Can’t easily accommodate phrases, passages, Boolean operators</p></LI></UL></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Current extensions focus on putting relevance back into the model, etc.</p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM vs. Prob. Model for IR</p></LI></UL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4246-26"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4247-27-view"> <!-- <span _tmplitem="15" onclick="filterLatex(" id="4247"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4247-27"> Extension: 3-level model </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4247-27" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >3-level model</p></LI></UL><OL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Whole collection model ( )</p><LI _tmplitem="15" ><p _tmplitem="15" >Specific-topic model; relevant-documents model ( )</p><LI _tmplitem="15" ><p _tmplitem="15" >Individual-document model ( )</p><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Relevance hypothesis</p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >A request(query; topic) is generated from a specific-topic model { , }.</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Iff a document is relevant to the topic, the same model will apply to the document.</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >It will replace part of the individual-document model in explaining the document.</p></LI></UL></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability of relevance of a document</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability that this model explains part of the document</p></LI></UL></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability that the { , , } combination is better than the { , } combination</p></LI></UL></UL></UL></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div><div _tmplitem="15" ><img _tmplitem="15" src=""/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4247-27"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div>

    Precision/recall results 202-250

    Precision/recall results 51-100

    The main difference is whether “Relevance” figures explicitly in the model or not

    • LM approach attempts to do away with modeling relevance

    • LM approach asssumes that documents and expressions of information problems are of the same type

    • Computationally tractable, intuitively appealing

    • LM vs. Prob. Model for IR

    Problems of basic LM approach

    • Assumption of equivalence between document and information problem representation is unrealistic

      • Very simple models of language

      • Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

      • Can’t easily accommodate phrases, passages, Boolean operators

    • Current extensions focus on putting relevance back into the model, etc.

    • LM vs. Prob. Model for IR

    Extension: 3-level model

    • 3-level model

    1. Whole collection model (Mc)

    2. Specific-topic model; relevant-documents model (Mt)

    3. Individual-document model (Md)

      • Relevance hypothesis

        • A request(query; topic) is generated from a specific-topic model { Mc, Mt}.

        • Iff a document is relevant to the topic, the same model will apply to the document.

          • It will replace part of the individual-document model in explaining the document.

        • The probability of relevance of a document

          • The probability that this model explains part of the document

          • The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination

    3-level model

    Alternative Models of Text Generation

    Retrieval Using Language Models

    • Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)

    Query Likelihood

    • P(Q|Dm)

    • Major issue is estimating document model

      • i.e. smoothing techniques instead of tf.idf weights

    • Good retrieval results

      • e.g. UMass, BBN, Twente, CMU

    • Problems dealing with relevance feedback, query expansion, structured queries

    Document Likelihood

    • Rank by likelihood ratio P(D|R)/P(D|NR)

      • treat as a generation problem

      • P(w|R) is estimated by P(w|Qm)

      • Qm is the query or relevance model

      • P(w|NR) is estimated by collection probabilities P(w)

    • Issue is estimation of query model

      • Treat query as generated by mixture of topic and background

      • Estimate relevance model from related documents (query expansion)

      • Relevance feedback is easily incorporated

    • Good retrieval results

      • e.g. UMass at SIGIR 01

      • inconsistent with heterogeneous document collections

    Model Comparison

    • Estimate query and document models and compare

    • Suitable measure is KL divergence D(Qm||Dm)

      • equivalent to query-likelihood approach if simple empirical distribution used for query model

    • More general risk minimization framework has been proposed

      • Zhai and Lafferty 2001

    • Better results than query-likelihood or document-likelihood approaches

    Two-stage smoothing:Another Reason for Smoothing

    • p( “algorithms”|d1) = p(“algorithm”|d2)

    • p( “data”|d1) < p(“data”|d2)

    • p( “mining”|d1) < p(“mining”|d2)

    • But p(q|d1)>p(q|d2)!

    • We should make p(“the”) and p(“for”) less different for all docs. 

    Two-stage Smoothing

    How can one do relevance feedback if using language modeling approach?

    • Introduce a query model & treat feedback as query model updating

      • Retrieval function:

        • Query-likelihood => KL-Divergence

      • Feedback:

        • Expansion-based => Model-based

    Expansion-based vs. Model-based

    Feedback as Model Interpolation

    Translation model (Berger and Lafferty)

    • Basic LMs do not address issues of synonymy.

      • Or any deviation in expression of information need from language of documents

    • A translation model lets you generate query words not in document via “translation” to synonyms etc.

      • Or to do cross-language IR, or multimedia IR

      • Need to learn a translation model (using a dictionary or via statistical machine translation)

    Language models: pro & con

    • Novel way of looking at the problem of text retrieval based on probabilistic language modeling

        • Conceptually simple and explanatory

        • Formal mathematical model

        • Natural use of collection statistics, not heuristics (almost…)

      • LMs provide effective retrieval and can be improved to the extent that the following conditions can be met

        • Our language models are accurate representations of the data.

        • Users have some sense of term distribution.*

          • *Or we get more sophisticated with translation model

    Comparison With Vector Space

    • There’s some relation to traditional tf.idf models:

      • (unscaled) term frequency is directly in model

      • the probabilities do length normalization of term frequencies

      • the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

    Comparison With Vector Space

    • Similar in some ways

      • Term weights based on frequency

      • Terms often used as if they were independent

      • Inverse document/collection frequency used

      • Some form of length normalization useful

    • Different in others

      • Based on probability rather than similarity

        • Intuitions are probabilistic rather than geometric

      • Details of use of document length and term, document, and collection frequency differ


    • J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21.
    • D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.

    • A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.

    • D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.

    • [Several relevant newer papers at SIGIR 2325, 2000–2002.]

    • Workshop on Language Modeling and Information Retrieval, CMU 2001. .

    • The Lemur Toolkit for Language Modeling and Information Retrieval. . CMU/Umass LM and IR system in C(++), currently actively developed.

    Introduction to Information Retrieval

    Introduction to Information Retrieval

    CS276: Information Retrieval and Web Search

    Lecture 10: Text Classification;

    The Naive Bayes algorithm

    Standing queries

    • The path from IR to text classification:

      • You have an information need to monitor, say:

        • Unrest in the Niger delta region

      • You want to rerun an appropriate query periodically to find new news items on this topic

      • You will be sent new documents that are found

        • I.e., it’s not ranking but classification (relevant vs. not relevant)

    • Such queries are called standing queries

      • Long used by “information professionals”

      • A modern mass instantiation is Google Alerts

    • Standing queries are (hand-written) text classifiers

    Introduction to Information Retrieval

    Spam filteringAnother text classification task

    • From: ""<>

    • Subject: real estate is the only way... gem oalvgkay

    • Anyone can buy real estate with no money down

    • Stop paying rent TODAY !

    • There is no need to spend hundreds or even thousands for similar courses

    • I am 22 years old and I have already purchased 6 properties using the

    • methods outlined in this truly INCREDIBLE ebook.

    • Change your life NOW !

    • =================================================

    • Click Below to order:


    • =================================================

    Text classification

    • Today:

      • Introduction to Text Classification

        • Also widely known as “text categorization”

        • Same thing

      • Naïve Bayes text classification

        • Including a little on Probabilistic Language Models


    • Given:

      • A description of an instance, d ∈ X

        • X is the instance language or instance space.

          • Issue: how to represent text documents.

          • Usually some type of high-dimensional space – bag of words

      • A fixed set of classes:

      • C = {c1, c2,…, cJ}

    • Determine:

      • The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.

        • We want to know how to build classification functions (“classifiers”).

    Machine Learning:Supervised Classification

    • Given:

      • A description of an instance, d ∈ X

        • X is the instance language or instance space.

      • A fixed set of classes:

      • C = {c1, c2,…, cJ}

      • A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×C

    • Determine:

      • A learning method or algorithm which will enable us to learn a classifier γ:X→C

      • For a test document d, we assign it the class γ(d) ∈ C

    Document Classification

    • (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)

    More Text Classification ExamplesMany search engine functionalities use classification

    • Assigning labels to documents or web-pages:

    • Labels are most often topics such as Yahoo-categories

      • "finance," "sports," "news>world>asia>business"

    • Labels may be genres

      • "editorials" "movie-reviews" "news”

    • Labels may be opinion on a person/product

      • “like”, “hate”, “neutral”

    • Labels may be domain-specific

      • "interesting-to-me" : "not-interesting-to-me”

      • “contains adult language” : “doesn’t”

      • language identification: English, French, Chinese, …

      • search vertical: about Linux versus not

      • “link spam” : “not link spam"

    Classification Methods (1)

    • Manual classification

      • Used by the original Yahoo! Directory

      • Looksmart,, ODP, PubMed

      • Very accurate when job is done by experts

      • Consistent when the problem size and team is small

      • Difficult and expensive to scale

        • Means we need automatic classification methods for big problems

    Classification Methods (2)

    • Hand-coded rule-based classifiers

      • One technique used by CS dept’s spam filter, Reuters, CIA, etc.

      • It’s what Google Alerts is doing

        • Widely deployed in government and enterprise

      • Companies provide “IDE” for writing such rules

      • E.g., assign category if document contains a given boolean combination of words

      • Commercial systems have complex query languages (everything in IR query languages +score accumulators)

      • Accuracy is often very high if a rule has been carefully refined over time by a subject expert

      • Building and maintaining these rules is expensive

    A Verity topic A complex classification rule

    • Note:

      • maintenance issues (author, etc.)

      • Hand-weighting of terms

      • [Verity was bought by Autonomy.]

    Classification Methods (3)

    • Supervised learning of a document-label assignment function

      • Many systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)

        • k-Nearest Neighbors (simple, powerful)

        • Naive Bayes (simple, common method)

        • Support-vector machines (new, generally more powerful)

        • … plus many other methods

      • No free lunch: requires hand-classified training data

      • But data can be built up (and refined) by amateurs

    • Many commercial systems use a mixture of methods

    Relevance feedback

    • In relevance feedback, the user marks a few documents as relevant/nonrelevant

    • The choices can be viewed as classes or categories

    • The IR system then uses these judgments to build a better model of the information need

    • So, relevance feedback can be viewed as a form of text classification (deciding between several classes)

    Probabilistic relevance feedback

    • Rather than reweighting in a vector space…

    • If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier

        • such as the Naive Bayes model we will look at today:

      • P(tk|R) = |Drk| / |Dr|

      • P(tk|NR) = |Dnrk| / |Dnr|

        • tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.

    Recall a few probability basics

    • For events a and b:

    • Bayes’ Rule

    • Odds:

    Bayesian Methods

    • Learning and classification methods based on probability theory

    • Bayes theorem plays a critical role

    • Builds a generative model that approximates how data is produced

    • Has prior probability of each category given no information about an item.

    • Model produces a posterior probability

      • Distribution over the possible categories given an item

    • Naïve Bayes methods use a bag of words as the item description

    The bag of words representation

    The bag of words representation

    Bayes’ Rule for text classification

    For a document d and a class c

    Naive Bayes Classifiers

    • Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj ∈ C


    • MAP is “maximum a posteriori” = most likely class

    Naïve Bayes Classifier: Naïve Bayes Assumption

    • P(cj)

      • Can be estimated from the frequency of classes in the training examples.

    • P(x1,x2,…,xn|cj)

      • O(|X|n•|C|) parameters

      • Could only be estimated if a very, very large number of training examples was available.

    • Naïve Bayes Conditional Independence Assumption:

    • Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

    The Multivariate Bernoulli NB Classifier [Like Prob IR BIM; not language model; less used]

    • Conditional Independence Assumption: features detect term presence and are independent of each other given the class:

    • This model is appropriate for binary variables
      • Multivariate Bernoulli model

    Learning the Model

    • First attempt: maximum likelihood estimates

      • Simply use the frequencies in the data

    Problem with Maximum Likelihood

    • What if we have seen no training documents with the word muscle-ache and classified in the topic Flu?

    • Zero probabilities cannot be conditioned away, no matter the other evidence!

    Smoothing to Avoid Overfitting

    • Somewhat more subtle version

    Stochastic Language Models

    • Model probability of generating any string

    Stochastic Language Models

    • Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model

    Unigram and higher-order models

  • Unigram Language Models (Easy , effective !)
  • Bigram (generally, n-gram) Language Models
  • Other Language Models
  • Grammar-based models (PCFGs), etc.
  • Probably not the first thing to try in IR
  • Naïve Bayes via a class conditional language model = multinomial NB

    • The probability of the words is done as a class-specific unigram language model

    Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

    • Attributes are text positions, values are words.

    • Still too many possibilities

    • Assume that classification is independent of the positions of the words

    • Use same parameters for each position

    • Result is bag of words model (over tokens not types)

    Naive Bayes and Language Modeling

    • Naïve Bayes classifiers can use any sort of feature

      • URL, email address, dictionaries, network features

    • But if, as in the previous slides

      • We use only word features

      • we use all of the words in the text (not a subset)

    • Then

      • Naïve Bayes is basically the same as language modeling

    Multinomial Naive Bayes: Learning

    • From training corpus, extract Vocabulary
    • Calculate required P(cj) and P(xk | cj) terms
      • For each cj in C do
        • docsj ← subset of documents for which the target class is cj
        • Textj ← single document containing all docsj
          • for each word xk in Vocabulary
          • nk ← number of occurrences of xk in Textj

    Naive Bayes: Classifying

    • positions ← all word positions in current document

                      which contain tokens found in Vocabulary

    • Return cNB, where

    Naive Bayes: Time Complexity

    • Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D.

      • Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data.                          ← Why ?

      • Generally just O(|D|Lave) since usually |C||V| < |D|Lave

    • Test Time: O(|C| Lt) where Lt is the average length of a test document.

    • Very efficient overall, linearly proportional to the time needed to just read in all the data.

    Underflow Prevention: using logs

    • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

    • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

    • Class with highest final un-normalized log probability score is still the most probable.

    • Note that model is now just max of sum of weights…







    Chinese Beijing Chinese



    Chinese Chinese Shanghai



    Chinese Macao



    Tokyo Japan Chinese




    Chinese Chinese Chinese Tokyo Japan


    Two Naive Bayes Models

    • Model 1: Multivariate Bernoulli

      • One feature Xw for each word in dictionary

        • for loop iterates over dictionary

      • Xw = true in document d if w appears in d

      • Naive Bayes assumption:

        • Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

    • This is the model used in the binary independence model in classic probabilistic relevance feedback on hand-classified data

    Two Models

    • Model 2: Multinomial = Class conditional unigram

      • One feature Xi for each word pos in document

        • feature’s values are all words in dictionary

      • Value of Xi is the word in position i

      • Naïve Bayes assumption:

        • Given the document’s topic, word in one position in the document tells us nothing about words in other positions

      • Second assumption:

        • Word appearance does not depend on position

          P(Xi = w | c) = P(Xj = w | c

        • Just have one multinomial feature predicting all words

    • for all positions i,j, word w, and class c

    Parameter estimation

    Multivariate Bernoulli model:

    fraction of documents of topic cj in which word w appears

    Multinomial model:

    fraction of times in which word w appears among all words in documents of topic cj

    • Can create a mega-document for topic j by concatenating all documents in this topic
    • Use frequency of w in mega-document

    Which to use for classification?

    • Multinomial vs Multivariate Bernoulli?

    • Multinomial model is almost always more effective in text applications!

        • See results figures later

    • There has been exploration of multinomial naïve bayes variants which often work better in practice

      • Binarized multinomial Naïve Bayes, etc.

      • Topic of PA4

    Feature Selection: Why?

    • Text collections have a large number of features

      • 10,000 – 1,000,000 unique words … and more

    • May make using a particular classifier feasible

      • Some classifiers can’t deal with 1,000,000 features

    • Reduces training time

      • Training time for some methods is quadratic or worse in the number of features

    • Makes runtime models smaller and faster

    • Can improve generalization (performance)

      • Eliminates noise features

      • Avoids overfitting

    Feature Selection: How?

    • Two ideas:

      • Hypothesis testing statistics:

        • Are we confident that the value of one categorical variable is associated with the value of another

        • Chi-square test (χ2)

      • Information theory:

        • How much information does the value of one categorical variable give you about the value of another

        • Mutual information

    • They’re similar, but χ2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

    |2 statistic (CHI)

    • |2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals? 

    • The null hypothesis is rejected with confidence .999,

    • since 12.9 > 10.83 (the value for .999 confidence).

    |2 statistic (CHI)

    • There is a simpler formula for 2x2 χ2:

    • N = A + B + C + D

    • Value for complete independence of term and category?

    Feature selection via Mutual Information

    • In training set, choose k words which best discriminate (give most info on) the categories.

    • The Mutual Information between a word, class is:

      • For each word w and each category c

    Feature selection via MI (contd.)

    • For each category we build a list of k most discriminating terms.

    • For example (on 20 Newsgroups):

      • sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

      • car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

    • Greedy: does not account for correlations between terms

    • Why?

    Feature Selection

    • Mutual Information

      • Clear information-theoretic interpretation

      • May select rare uninformative terms

    • Chi-square

      • Statistical foundation

      • May select very slightly informative frequent terms that are not very useful for classification

    Feature Selection: Frequency

    • The simplest feature selection method:

      • Just use the commonest terms

      • No particular foundation

      • But it make sense why this works

        • They’re the words that can be well-estimated and are most often available as evidence

      • In practice, this is often 90% as good as better methods

    Feature selection for NB

    • In general feature selection is necessary for multivariate Bernoulli NB.

    • Otherwise you suffer from noise, multi-counting

    • “Feature selection” really means something different for multinomial NB. It means dictionary truncation

      • The multinomial NB model only has 1 feature

    • This “feature selection” normally isn’t needed for multinomial NB, but may help a fraction with quantities that are badly estimated

    Evaluating Categorization

    • Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).

      • Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data)

    • It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set).

    • Measures: precision, recall, F1, classification accuracy

    • Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.

      • Adequate if one class per document

      • Otherwise F measure for each class

    Naive Bayes vs. other methods

    WebKB Experiment (1998)

    • Classify webpages from CS departments into:

      • student, faculty, course,project

    • Train on ~5,000 hand-labeled web pages

      • Cornell, Washington, U.Texas, Wisconsin

    • Crawl and classify a new site (CMU)

    • Results:

    NB Model Comparison: WebKB


    • Naïve Bayes has found a home in spam filtering

      • Paul Graham’s A Plan for Spam

        • A Naive Bayes-like classifier with weird parameter estimation

      • Widely used in spam filters

      • But many features beyond words:

        • black hole lists, etc.

        • particular hand-crafted text patterns

    Naïve Bayes in Spam Filtering

    • SpamAssassin Features:

      • Basic (Naïve) Bayes spam probability

      • Mentions: Generic Viagra

      • Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

      • Phrase: impress ... girl

      • Phrase: 'Prestigious Non-Accredited Universities’

      • From: starts with many numbers

      • Subject is all capitals

      • HTML has a low ratio of text to image area

      • Relay in RBL,

      • RCVD line looks faked


    Naïve Bayes on spam email

    Violation of NB Assumptions

    • The independence assumptions do not really hold of documents written in natural language.

      • Conditional independence

      • Positional independence

    • Examples?

    Example: Sensors

    Naïve Bayes Posterior Probabilities

    • Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

    • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.

      • Output probabilities are commonly very close to 0 or 1.

    • Correct estimation ⇒ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)

    Naive Bayes is Not So Naiventitled

    • Very Fast Learning and Testing (basically just count the data)

    • Low Storage requirements

    • Very good in domains with many equally important features

    • More robust to irrelevant features than many learning methods

      • Irrelevant Features cancel each other without affecting results

    • More robust to concept drift (changing class definition over time)

    • Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems

      • Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

    • A good dependable baseline for text classification (but not the best)!

    Resources for today’s lecture

    • IIR 13

    • Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

    • Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.

    • Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.

    • Tom Mitchell, Machine Learning. McGraw-Hill, 1997.

      • Clear simple explanation of Naïve Bayes

    • Open Calais: Automatic Semantic Tagging

      • Free (but they can keep your data), provided by Thompson/Reuters (ex-ClearForest)

    • Weka: A data mining software package that includes an implementation of Naive Bayes

    • Reuters-21578 – the most famous text classification evaluation set

      • Still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)

    CS276: Information Retrieval and Web Search

    Pandu Nayak and Prabhakar Raghavan

    Lecture 11: Text Classification;

    Vector space classification

    [Borrows slides from Ray Mooney]

    Recap: Naïve Bayes classifiers

    • Classify based on prior weight of class and conditional parameter for what each word says:

    • Training is done by counting and dividing:

    • Don’t forget to smooth

    The rest of text classification

    • Today:

      • Vector space methods for Text Classification

        • Vector space classification using centroids (Rocchio)

        • K Nearest Neighbors

        • Decision boundaries, linear and nonlinear classifiers

        • Dealing with more than 2 classes

    • Later in the course

      • More text classification

        • Support Vector Machines

        • Text-specific issues in classification

    Recall: Vector Space Representation

    • Each document is a vector, one component for each term (= word).

    • Normally normalize vectors to unit length.

    • High-dimensional vector space:

      • Terms are axes

      • 10,000+ dimensions, or even 100,000+

      • Docs are vectors in this space

    • How can we do classification in this space?

    Classification Using Vector Spaces

    • As before, the training set is a set of documents, each labeled with its class (e.g., topic)

    • In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space

    • Premise 1: Documents in the same class form a contiguous region of space

    • Premise 2: Documents from different classes don’t overlap (much)

    • We define surfaces to delineate classes in the space

    Documents in a Vector Space

    Test Document of what class?

    Test Document = Government

    Aside: 2D/3D graphs can be misleading

    Using Rocchio for text classification

    • Relevance feedback methods can be adapted for text categorization

      • As noted before, relevance feedback can be viewed as 2-class classification

        • Relevant vs. nonrelevant documents

    • Use standard tf-idf weighted vectors to represent text documents

    • For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.

      • Prototype = centroid of members of class

    • Assign test documents to the category with the closest prototype vector based on cosine similarity.

    Illustration of Rocchio Text Categorization

    Definition of centroid

    • Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.

    • Note that centroid will in general not be a unit vector even when the inputs are unit vectors.

    Rocchio Properties

    • Forms a simple generalization of the examples in each class (a prototype).

    • Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.

    • Classification is based on similarity to class prototypes.

    • Does not guarantee classifications are consistent with the given training data. 

                                                                                                                          Why not?

    Rocchio Anomaly

    • Prototype models have problems with polymorphic (disjunctive) categories.

    Rocchio classification

    • Rocchio forms a simple representation for each class: the centroid/prototype

    • Classification is based on similarity to / distance from the prototype/centroid

    • It does not guarantee that classifications are consistent with the given training data

    • It is little used outside text classification

      • It has been used quite effectively for text classification

      • But in general worse than Naïve Bayes

    • Again, cheap to train and test documents

    kNN decision boundaries

    Example: k=6 (6NN)

    Nearest-Neighbor Learning Algorithm

    • Learning is just storing the representations of the training examples in D.

    • Testing instance x (under 1NN):

      • Compute similarity between x and all examples in D.

      • Assign x the category of the most similar example in D.

    • Does not explicitly compute a generalization or category prototypes.

    • Also called:

      • Case-based learning

      • Memory-based learning

      • Lazy learning

    • Rationale of kNN: contiguity hypothesis

    kNN Is Close to Optimal

    • Cover and Hart (1967)

    • Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data]

    • In particular, asymptotic error rate is 0 if Bayes rate is 0.

    • Assume: query point coincides with a training point.

    • Both query point and training point contribute error → 2 times Bayes rate

    k Nearest Neighbor

    • Using only the closest example (1NN) to determine the class is subject to errors due to:

      • A single atypical example.

      • Noise (i.e., an error) in the category label of a single training example.

    • More robust alternative is to find the k most-similar examples and return the majority category of these k examples.

    • Value of k is typically odd to avoid ties; 3 and 5 are most common.

    k Nearest Neighbor Classification

    • kNN = k Nearest Neighbor

    • To classify a document d into class c:

    • Define k-neighborhood N as k nearest neighbors of d

    • Count number of documents i in N that belong to c

    • Estimate P(c|d) as i/k

    • Choose as class argmaxc P(c|d) [ = majority class]

    Similarity Metrics

    • Nearest neighbor method depends on a similarity (or distance) metric.

    • Simplest for continuous m-dimensional instance space is Euclidean distance.

    • Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).

    • For text, cosine similarity of tf.idf weighted vectors is typically most effective.

    Illustration of 3 Nearest Neighbor for Text Vector Space

    3 Nearest Neighbor vs. Rocchio

    • Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB.

    Nearest Neighbor with Inverted Index

    • Naively, finding nearest neighbors requires a linear search through |D| documents in collection

    • But determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.

    • Use standard vector space inverted index methods to find the k nearest neighbors.

    • Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears.

      • Typically B << |D|

    kNN: Discussion

    • No feature selection necessary

    • Scales well with large number of classes

      • Don’t need to train n classifiers for n classes

    • Classes can influence each other

      • Small changes to one class can have ripple effect

    • Scores can be hard to convert to probabilities

    • No training necessary

      • Actually: perhaps not true. (Data editing, etc.)

    • May be expensive at test time

    • In most cases it’s more accurate than NB or Rocchio

    kNN vs. Naive Bayes

    • Bias/Variance tradeoff

      • Variance ≈ Capacity

    • kNN has high variance and low bias.

      • Infinite memory

    • NB has low variance and high bias.

      • Decision surface has to be linear (hyperplane – see later)

    • Consider asking a botanist: Is an object a tree?

      • Too much capacity/variance, low bias

        • Botanist who memorizes

        • Will always say “no” to new object (e.g., different # of leaves)

      • Not enough capacity/variance, high bias

        • Lazy botanist

        • Says “yes” if the object is green

      • You want the middle ground

                                                                                    (Example due to C. Burges)

    Bias vs. variance: Choosing the correct model capacity

    Linear classifiers and binary and multiclass classification

    • Consider 2 class problems

      • Deciding between two classes, perhaps, government and non-government

        • One-versus-rest classification

    • How do we define (and find) the separating surface?

    • How do we decide which region a test doc is in?

    Separation by Hyperplanes

    • A strong high-bias assumption is linear separability:

      • in 2 dimensions, can separate classes by a line

      • in higher dimensions, need hyperplanes
    • Can find separating hyperplane by linear programming

      • (or can iteratively fit solution via perceptron):

      • separator can be expressed as ax + by = c

    Linear programming / Perceptron

    • Find a,b,c, such that
    • ax + by > c for red points

    • ax + by < c for blue points.

    Which Hyperplane?

    • In general, lots of possible solutions for a,b,c.

    Which Hyperplane?

    • Lots of possible solutions for a,b,c.

    • Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]

      • E.g., perceptron

    • Most methods find an optimal separating hyperplane

    • Which points should influence optimality?

      • All points

        • Linear/logistic regression

        • Naïve Bayes

      • Only “difficult points” close to decision boundary

        • Support vector machines

    Linear classifier: Example

    • Class: “interest” (as in interest rate)

    • Example features of a linear classifier

                          wi     ti                                              wi     ti

                       0.70 prime                                        −0.71 dlrs

                       0.67 rate                                           −0.35 world

                       0.63 interest                                     −0.33 sees

                       0.60 rates                                          −0.25 year

                       0.46 discount                                    −0.24 group

                       0.43 bundesbank                               −0.24 dlr

    • To classify, find dot product of feature vector and weights

    Linear Classifiers

    • Many common text classifiers are linear classifiers

      • Naïve Bayes

      • Perceptron

      • Rocchio

      • Logistic regression

      • Support vector machines (with linear kernel)

      • Linear regression with threshold

    • Despite this similarity, noticeable performance differences

      • For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?

      • What to do for non-separable problems?

      • Different training methods pick different hyperplanes

    • Classifiers more powerful than linear often don’t perform better on text problems. Why?

    Rocchio is a linear classifier

    Two-class Rocchio as a linear classifier

    • Line or hyperplane defined by:

    • For Rocchio, set:

    • [Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher Linear Discriminant where you don’t model the variance (or assume it is spherical).]

    Naive Bayes is a linear classifier

    • Two-class Naive Bayes. We compute:

    • Decide class C if the odds is greater than 1, i.e., if the log odds is greater than 0.

    • So decision boundary is hyperplane:

    A nonlinear problem

    • A linear classifier like Naïve Bayes does badly on this task

    • kNN will do very well (assuming enough training data)

    High Dimensional Data

    • Pictures like the one at right are absolutely misleading!

    • Documents are zero along almost all axes

    • Most document pairs are very far apart (i.e., not strictly orthogonal, but only share very common words and a few scattered others)

    • In classification terms: often document sets are separable, for most any classification

    • This is part of why linear classifiers are quite successful in this domain

    More Than Two Classes

    • Any-of or multivalue classification

      • Classes are independent of each other.

      • A document can belong to 0, 1, or >1 classes.

      • Decompose into n binary problems

      • Quite common for documents

    • One-of or multinomial or polytomous classification

      • Classes are mutually exclusive.

      • Each document belongs to exactly one class

      • E.g., digit recognition is polytomous classification

        • Digits are mutually exclusive

    Set of Binary Classifiers: Any of

    • Build a separator between each class and its complementary set (docs from all other classes).

    • Given test doc, evaluate it for membership in each class.

    • Apply decision criterion of classifiers independently

    • Done

      • Though maybe you could do better by considering dependencies between categories

    Set of Binary Classifiers: One of

    • Build a separator between each class and its complementary set (docs from all other classes).

    • Given test doc, evaluate it for membership in each class.

    • Assign document to class with:

    • maximum score
    • maximum confidence

    • maximum probability

    Summary: Representation ofText Categorization Attributes

    • Representations of text are usually very high dimensional (one feature for each word)

    • High-bias algorithms that prevent overfitting in high-dimensional space should generally work best*

    • For most text categorization tasks, there are many relevant features and many irrelevant ones

    • Methods that combine evidence from many or all features (e.g. naive Bayes, kNN) often tend to work better than ones that try to isolate just a few relevant features*

                                                      *Although the results are a bit more mixed than often thought

    Which classifier do I use for a given text classification problem?

    • Is there a learning method that is optimal for all text classification problems?

    • No, because there is a tradeoff between bias and variance.

    • Factors to take into account:

      • How much training data is available?

      • How simple/complex is the problem? (linear vs. nonlinear decision boundary)

      • How noisy is the data?

      • How stable is the problem over time?

        • For an unstable problem, it’s better to use a simple and robust classifier.

    Resources for today’s lecture

    • IIR 14

    • Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

    • Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.

    • Trevor Hastie, Robert Tibshirani and Jerome Friedman, Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York.

    • Open Calais: Automatic Semantic Tagging

      • Free (but they can keep your data), provided by Thompson/Reuters

    • Weka: A data mining software package that includes an implementation of many ML algorithms