Introduction to Information Retrieval
CS276
Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 1: Boolean retrieval
Information Retrieval
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
new slide
Unstructured (text) vs. structured (database) data in 1996

Unstructured (text) vs. structured (database) data in 2009



Unstructured data in 1680
- Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
- One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?
- Why is that not the answer?
- Slow (for large corpora)
- NOT Calpurnia is non-trivial
- Other operations (e.g., find the word Romans near countrymen) not feasible
- Ranked retrieval (best documents to return)
- Later lectures
Term-document incidence
Brutus AND Caesar BUT NOT 1 if play contains word,
Calpurnia 0 otherwise
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) → bitwise AND.
110100 AND 110111 AND 101111 = 100100.
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) → bitwise AND.
110100 AND 110111 AND 101111 = 100100.
Basic assumptions of Information Retrieval
- Collection: Fixed set of documents
Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.
The classic search model

How good are the retrieved docs?
Precision : Fraction of retrieved docs that are relevant to user’s information need
Recall : Fraction of relevant docs in collection that are retrieved
More precise definitions and measurements to follow in later lectures
Bigger collections
Consider N = 1 million documents, each with about 1000 words.
Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
Say there are M = 500K distinct terms among these.
Can’t build the matrix
Inverted index
- For each term t, we
must store a list of all documents that contain t.
- Identify each by a docID, a document serial number
- Can we use fixed-size arrays for this?
Brutus | → | 1 | 2 | 4 | 11 | 31 | 45 | 173 | 174 | |
Caesar | → | 1 | 2 | 4 | 5 | 6 | 16 | 57 | 32 | |
Calpurnia | → | 2 | 31 | 54 | 101 |
|
What happens if the word Caesar is added to document 14?
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion Posting(below)
Brutus | → | 1 | 2 | 4 | 11 | 31 | 45 | 173 | 174 | |
Caesar | → | 1 | 2 | 4 | 5 | 6 | 16 | 57 | 32 | |
Calpurnia | → | 2 | 31 | 54 | 101 |
|
Brutus , Caeser and Calpurnia are dictionaries.
Numbers are postings.
Sorted by docID (more later on why).
Inverted index construction

Indexer steps: Token sequence

Indexer steps: Sort

Indexer steps: Dictionary & Postings

Where do we pay in storage?

The index we just built
How do we process a query?
Later - what kinds of queries can we process? ← Today’s focus
Query processing: AND
Consider processing the query: Brutus AND Caesar
Locate Brutus in the Dictionary;
Retrieve its postings.
Locate Caesar in the Dictionary;
Retrieve its postings.
Merge” the two postings:
The merge
- Walk through the two postings simultaneously, in time linear in the total number of postings entries

Intersecting two postings lists (a "merge algorithm")
Query optimization
What is the best order for query processing?
Consider a query that is an AND of n terms.
For each of the n terms, get its postings, then AND them together.
Brutus | → | 2 | 4 | 8 | 16 | 32 | 64 | 128 | ||
Caesar | → | 1 | 2 | 3 | 5 | 8 | 16 | 21 | 34 | |
Calpurnia | → | 13 | 16 |
|
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
↑
(This is why we kept document freq. in dictionary)
Brutus | → | 2 | 4 | 8 | 16 | 32 | 64 | 128 | ||
Caesar | → | 1 | 2 | 3 | 5 | 8 | 16 | 21 | 34 | |
Calpurnia | → | 13 | 16 |
|
Execute the query as (Calpurnia AND Brutus) AND Caesar.
Boolean queries: Exact match
- The Boolean retrieval model is being able to ask a query that is a Boolean expression:
Boolean Queries use AND, OR and NOT to join query terms
Views each document as a set of words
Is precise: document matches condition or not.
Perhaps the simplest model to build an IR system on
Primary commercial retrieval tool for 3 decades.
Many search systems you still use are Boolean:
Email, library catalog, Mac OS X Spotlight
Boolean queries: More General Merges
Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?
What can we achieve?
Boolean queries: More General Merges
Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?
What can we achieve?
Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
Tens of terabytes of data; 700,000 users
Majority of users still use boolean queries
Example query:
What is the statute of limitations in cases involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
/3 = within 3 words, /S = in same sentence
Example: WestLaw http://www.westlaw.com/
Another example query:
Requirements for disabled people to be able to access a workplace
disabl! /p access! /s work-site work-place (employment /3 place)
Note that SPACE is disjunction, not conjunction!
Long, precise queries; proximity operators; incrementally developed; not like web search
Many professional searchers still like Boolean search
You know exactly what you are getting
But that doesn’t mean it actually works better….
new slide
More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
Get doc. freq.’s for all terms.
Estimate the size of each OR by the sum of its doc. freq.’s (conservative).
Process in increasing order of OR sizes.
Exercise
- Recommend a query processing order for
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
Term Freq
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
Query processing exercises
Exercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?
Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?
Hint: Begin with the case of a Boolean formula query where each term appears only once in the query.
Exercise
Try the search feature at
Write down five search features you think it could do better
What’s ahead in IR?
Beyond term search
What about phrases?
Stanford University
Proximity: Find Gates NEAR Microsoft.
Need index to capture position information in docs.
Zones in documents: Find documents with
(author = Ullman) AND (text contains automata).
Evidence accumulation
1 vs. 0 occurrence of a search term
2 vs. 1 occurrence
3 vs. 2 occurrences, etc.
Usually more seems better
Need term frequency information in docs
Ranking search results
Boolean queries give inclusion or exclusion of docs.
Often we want to rank/group results
Need to measure proximity from query to each doc.
Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.
IR vs. databases: Structured vs unstructured data
Structured data tends to refer to information in “tables”
Employee Manager Salary Smith Jones 50000 Chang
Smith 60000 Ivy Smith 50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Unstructured data
Typically refers to free-form text
Allows
Keyword queries including operators
More sophisticated “concept” queries, e.g.,
find all web pages dealing with drug abuse
Classic model for searching text documents
Semi-structured data
In fact almost no data is “unstructured”
E.g., this slide has distinctly identified zones such as the Title and Bullets
Facilitates “semi-structured” search such as
Title contains data AND Bullets contain search
… to say nothing of linguistic structure
More sophisticated semi-structured search
Title is about Object Oriented Programming AND Author something like stro*rup
where * is the wild-card operator
Issues:
how do you process “about”?
how do you rank results?
The focus of XML search (IIR chapter 10)
Clustering, classification and ranking
Clustering: Given a set of docs, group them into clusters based on their contents.
Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.
Ranking: Can we learn how to best order a set of documents, e.g., a set of search results
The web and its challenges
Unusual and diverse documents
Unusual and diverse users, queries, information needs
Beyond terms, exploit ideas from social networks
link analysis, clickstreams ...
How do search engines work?
And how can we make them better?
More sophisticated information retrieval
Cross-language information retrieval
Question answering
Summarization
Text mining
…
Resources for today’s lecture
Introduction to Information Retrieval, chapter 1
Shakespeare:
Try the neat browse by keyword sequence feature!
Managing Gigabytes, chapter 3.2
Modern Information Retrieval, chapter 8.2
Any questions?
Introduction to Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 2: The term vocabulary and postings lists
Recap of the previous lecture
Basic inverted indexes:
Structure: Dictionary and Postings
Key step in construction: Sortingli>
Boolean query processing
Intersection by linear time “merging”
Simple optimizations
Overview of course topics
Plan for this lecture
Elaborate basic indexing
Preprocessing to form the term vocabulary
Documents
Tokenization
What terms do we put in the index?
Postings
Faster merges: skip lists
Positional postings and phrase queries
Recall the basic indexing pipeline

new slide
Parsing a document
- What format is it in?
pdf/word/excel/html?
What language is it in?
What character set is in use?
Each of these is a classification problem, which we will study later in the course.
But these tasks are often done heuristically …
Complications: Format/language
Documents being indexed can include docs from many different languages
A single index may have to contain terms of several languages.
Sometimes a document or its components can contain multiple languages/formats
French email with a German pdf attachment.
What is a unit document?
A file?
An email? (Perhaps one of many in an mbox.)
An email with 5 attachments?
A group of files (PPT or LaTeX as HTML pages)
Token and Terms
TOKEN AND TERMS
Tokenization
Input: “Friends, Romans, Countrymen”
Output: Tokens
Friends
Romans
Countrymen
A token is a sequence of characters in a document
Each such token is now a candidate for an index entry, after further processing
Described below
But what are valid tokens to emit?
Tokenization
Issues in tokenization:
Finland’s capital →
Finland? Finlands? Finland’s?
Hewlett-Packard → Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
co-education
lowercase, lower-case, lower case ?
It can be effective to get the user to put in possible hyphens
San Francisco: one token or two?
How do you decide it is one token?
Numbers
3/12/91 Mar. 12, 1991 12/3/91
55 B.C.
B-52
My PGP key is 324a3df234cb23e
(800) 234-2333
Often have embedded spaces
Older IR systems may not index numbers
But often very useful: think about things like looking up error codes/stacktraces on the web
(One answer is using n-grams: Lecture 3)
Will often index “meta-data” separately
Creation date, format, etc.
Tokenization: language issues
French
L'ensemble → one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
Until at least 2003, it didn’t on Google
- Internationalization!
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German retrieval systems benefit greatly from a compound splitter module
Can give a 15% performance boost for German
Tokenization: language issues
Chinese and Japanese have no spaces between words:
- 莎拉波娃现在居住在美国东南部的佛罗里达。
- Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple alphabets intermingled
Dates/amounts in multiple formats
End-user can express query entirely in hiragana!
Tokenization: language issues
Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right
Words are separated, but letter forms within a word form complex ligatures

← → ← → ← start
‘Algeria achieved its independence in 1962 after 132 years of French occupation.’
With Unicode, the surface presentation is complex, but the stored form is straightforward
Stop words
With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:
They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
But the trend is away from doing this:
Good compression techniques (lecture 5) means the space for including stopwords in a system is very small
Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Normalization to terms
We need to “normalize” words in indexed text as well as query words into the same form
We want to match U.S.A. and USA
Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary
We most commonly implicitly define equivalence classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA ⌊ USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory ⌊ antidiscriminatory
Normalization: other languages
Accents: e.g., French résumé vs. resume.
Umlauts: e.g., German: Tuebingen vs. Tübingen
Should be equivalent
Most important criterion:
How are your users like to write their queries for these words?
Even in languages that standardly have accents, users often may not type them
Often best to normalize to a de-accented term
Tuebingen, Tübingen, Tubingen ⌊ Tubingen
Normalization: other languages
Normalization of things like date forms
- 7月30日 vs. 7/30
- Japanese use of kana vs. Chinese characters
Tokenization and normalization may depend on the language and so is intertwined with language detection

Crucial: Need to “normalize” indexed text as well as query terms into the same form
Case folding
Reduce all letters to lower case
exception: upper case in mid-sentence?
- e.g., General Motors
- Fed vs. fed
- SAIL vs. sail
- Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization
| ![]() |
Normalization to terms
An alternative to equivalence classing is to do asymmetric expansion
An example of where this may be useful
Enter: window Search: window, windows
Enter: windows Search: Windows, windows, window
Enter: Windows Search: Windows
Potentially more powerful, but less efficient
Lemmatization
Reduce inflectional/variant forms to base form
E.g.,
am, are, is → be
car, cars, car's, cars' → car
the boy's cars are different colors the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form
Stemming
Reduce terms to their “roots” before indexing
“Stemming” suggest crude affix chopping
language dependent
e.g., automate(s), automatic, automation all reduced to automat.

Porter’s algorithm
Commonest algorithm for stemming English
Results suggest it’s at least as good as other stemming options
Conventions + 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.
Typical rules in Porter
sses → ss
ies →i
ational → ate
tional → tion
Rules sensitive to the measure of words
(m>1) EMENT →
replacement → replac
cement → cement
Other stemmers
Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250 rules)
Full morphological analysis – at most modest benefits for retrieval
Do stemming and other normalizations help?
English: very mixed results. Helps recall but harms precision
operative (dentistry) ⇒ oper
operational (research) ⇒ oper
operating (systems) ⇒ oper
Definitely useful for Spanish, German, Finnish, …
30% performance gains for Finnish!
new slide
Thesauri and soundex
Do we handle synonyms and homonyms?
E.g., by hand-constructed equivalence classes
car = automobile color = colour
We can rewrite to form equivalence-class terms
When the document contains automobile, index it under car-automobile (and vice-versa)
Or we can expand a query
When the query contains automobile, look under car as well
What about spelling mistakes?
One approach is soundex, which forms equivalence classes of words based on phonetic heuristics
More in lectures 3 and 9
Language-specificity
Many of the above features embody transformations that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the indexing process
Both open source and commercial plug-ins are available for handling these
Dictionary entries – first cut

Faster postıngs merges:
Faster
postıngs merges:
Skıp poınters/Skıp lısts
Recall basic merge
- Walk through the two postings simultaneously, in time linear in the total number of postings entries

Can we do better? Yes (if index isn’t changing too fast).
Augment postings with skip pointers (at indexing time)

Why?
To skip postings that will not figure in the search results.
How?
Where do we place skip pointers?
Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is smaller.
But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.
Where do we place skips?
Tradeoff:
More skips →shorter skip spans ⇒more likely to skip. But lots of comparisons to skip pointers.
Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips.

Placing skips
Simple heuristic: for postings of length L, use √L evenly-spaced skip pointers.
This ignores the distribution of query terms.
Easy if the index is relatively static; harder if L keeps changing because of updates.
This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based
The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!
Phrase queries and positiona...
Phrase
querıes and posıtıonal ındexes
Phrase queries
Want to be able to answer queries such as “stanford university” – as a phrase
Thus the sentence “I went to university at Stanford” is not a match.
The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works
Many more queries are implicit phrase queries
For this, it no longer suffices to store only
<term : docs> entries
A first attempt: Biword indexes
Index every consecutive pair of terms in the text as a phrase
For example the text “Friends, Romans, Countrymen” would generate the biwords
friends romans
romans countrymen
Each of these biwords is now a dictionary term
Two-word phrase query-processing is now immediate.
Longer phrase queries
Longer phrases are processed as we did with wild-cards:
stanford university palo alto can be broken into the Boolean query on biwords:
stanford university AND university palo AND palo alto
Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.
↑ Can have false positives!
Extended biwords
Parse the indexed text and perform part-of-speech-tagging (POST).
Bucket the terms into (say) Nouns (N) and articles/prepositions (X).
Call any string of terms of the form NX*N an extended biword.
Each such extended biword is now made a term in the dictionary.
Example: catcher in the rye
N X X N
Query processing: parse it into N’s and X’s
Segment query into enhanced biwords
Look up in index: catcher rye
Issues for biword indexes
False positives, as noted before
Index blowup due to bigger dictionary
Infeasible for more than biwords, big even for them
Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy
Solution 2: Positional indexes
In the postings, store for each term the position(s) in which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Positional index example

For phrase queries, we use a merge algorithm recursively at the document level
But we now need to deal with more than just equality
Processing a phrase query
Extract inverted index entries for each distinct term: to, be, or, not.
Merge their doc:position lists to enumerate all positions with “to be or not to be”.
to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Same general method for proximity searches
Proximity queries
LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
Again, here, /k means “within k words of”.
Clearly, positional indexes can be used for such queries; biword indexes cannot.
Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?
This is a little tricky to do correctly and efficiently
See Figure 2.12 of IIR
There’s likely to be a problem on it!
Positional index size
You can compress position values/offsets: we’ll talk about that in lecture 5
Nevertheless, a positional index expands postings storage substantially
Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system.
Positional index size
Need an entry for each occurrence, not just once per document
Index size depends on average document size ← Why?
Average web page has <1000 _tmplitem="94" terms
SEC filings, books, even some epic poems … easily 100,000 terms
Consider a term with frequency 0.1%

Rules of thumb
A positional index is 2–4 as large as a non-positional index
Positional index size 35–50% of volume of original text
Caveat: all of this holds for “English-like” languages
Combination schemes
These two approaches can be profitably combined
For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists
Even more so for phrases like “The Who”
Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme
A typical web query mixture was executed in ¼ of the time of using just a positional index
It required 26% more space than having a positional index alone
Resources for today’s lecture
IIR 2
MG 3.6, 4.3; MIR 7.2
Porter’s stemmer:
Skip Lists theory: Pugh (1990)
Multilevel skip lists give same O(log n) efficiency as trees
H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp. 215-221.
Introduction to Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 3: Dictionaries and tolerant retrieval
Recap of the previous lecture
The type/token distinction
Terms are normalized types put in the dictionary
Tokenization problems:
Hyphens, apostrophes, compounds, CJK
Term equivalence classing:
Numbers, case folding, stemming, lemmatization
Skip pointers
Encoding a tree-like structure in a postings list
Biword indexes for phrases
Positional indexes for phrases/proximity queries
This lecture
Dictionary data structures
“Tolerant” retrieval
Wild-card queries
Spelling correction
Soundex
Dictionary data structures for inverted indexes
The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

A naïve dictionary
An array of struct:
char[20] int Postings *
20 bytes 4/8 bytes 4/8 bytes
How do we store a dictionary in memory efficiently?
How do we quickly look up elements at query time?
Dictionary data structures
Two main choices:
Hashtables
Trees
Some IR systems use hashtables, some trees
Hashtables
Each vocabulary term is hashed to an integer
(We assume you’ve seen hashtables before)
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search [tolerant retrieval]
If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything
Root


Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
Trees
Simplest: binary tree
More usual: B-trees
Trees require a standard ordering of characters and hence strings … but we typically have one
Pros:
Solves the prefix problem (terms starting with hyp)
Cons:
Slower: O(log M) [and this requires balanced tree]
Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
Wild-card queries
Wild-card queries: *
mon*: find all docs containing any word beginning with “mon”.
Easy with binary tree (or B-tree) lexicon: retrieve all words in range:
mon ≤ w < moo
*mon: find words ending in “mon”: harder
Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Exercise: from this, how can we enumerate all terms
meeting the wild-card query pro*cent ?
Query processing
At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.
We still have to look up the postings for each enumerated term.
E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean AND queries.
B-trees handle *’s at the end of a query term
How can we handle *’s in the middle of query term?
co*tion
We could look up co* AND *tion in a B-tree and intersect the two term sets
Expensive
The solution: transform wild-card queries so that the *’s occur at the end
This gives rise to the Permuterm Index.
Permuterm index
For term hello, index under:
hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
Queries:
X lookup on X$ X* lookup on $X*
*X lookup on X$* *X* lookup on X*
X*Y lookup on Y$X* X*Y*Z ??? Exercise!
↑
Query = hel*o
X=hel, Y=o
Lookup o$hel*
Permuterm query processing
Rotate query wild-card to the right
Now use B-tree lookup as before.
Permuterm problem: ≈ quadruples lexicon size
↑
Empirical observation for English.
Bigram (k-gram) indexes
Enumerate all k-grams (sequence of k chars) occurring in any term
e.g., from text “April is the cruelest month” we get the 2-grams (bigrams)
$ is a special word boundary symbol
Maintain a second inverted index from bigrams to dictionary terms that match each bigram.
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
Bigram index example
The k-gram index finds terms based on a query consisting of k-grams (here k=2).

Processing wild-cards
Query mon* can now be run as
$m AND mo AND on
↓
Gets terms that match AND version of our wildcard query.
But we’d enumerate moon.
Must post-filter these terms against query.
Surviving enumerated terms are then looked up in the term-document inverted index.
Fast, space efficient (compared to permuterm).
Processing wild-card queries
As before, we must execute a Boolean query for each enumerated, filtered term.
Wild-cards can result in expensive query execution (very large disjunctions…)
pyth* AND prog*
If you encourage “laziness” people will respond!

Spelling Correction
Spell correction
Two principal uses
Correcting document(s) being indexed
Correcting user queries to retrieve “right” answers
Two main flavors:
Isolated word
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words
e.g., from →form
Context-sensitive
Look at surrounding words,
e.g., I flew form Heathrow to Narita.
Document correction
Especially needed for OCR’ed documents
Correction algorithms are tuned for this: rn/m
Can use domain-specific knowledge
E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).
But also: web pages and even printed material have typos
Goal: the dictionary contains fewer misspellings
But often we don’t change the documents and instead fix the query-document mapping
Query mis-spellings
Our principal focus here
E.g., the query Alanis Morisett
We can either
Retrieve documents indexed by the correct spelling, OR
Return several suggested alternative queries with the correct spelling
Did you mean …
Isolated word correction
Fundamental premise – there is a lexicon from which the correct spellings come
Two basic choices for this
A standard lexicon such as
Webster’s English Dictionary
An “industry-specific” lexicon – hand-maintained
The lexicon of the indexed corpus
E.g., all words on the web
All names, acronyms etc.
(Including the mis-spellings)
Isolated word correction
Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q
What’s “closest”?
We’ll study several alternatives
Edit distance (Levenshtein distance)
Weighted edit distance
n-gram overlap
Edit distance
Given two strings S1 and S2, the minimum number of operations to convert one to the other
Operations are typically character-level
Insert, Delete, Replace, (Transposition)
E.g., the edit distance from dof to dog is 1
From cat to act is 2 (Just 1 with transpose.)
from cat to dog is 3.
Generally found by dynamic programming.
See http://www.merriampark.com/ld.htm for a nice example plus an applet.
Weighted edit distance
As above, but the weight of an operation depends on the character(s) involved
Meant to capture OCR or keyboard errors
Example: m more likely to be mis-typed as n than as q
Therefore, replacing m by n is a smaller edit distance than by q
This may be formulated as a probability model
Requires weight matrix as input
Modify dynamic programming to handle weights
Using edit distances
Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2)
Intersect this set with list of “correct” words
Show terms you found to user as suggestions
Alternatively,
We can look up all possible corrections in our inverted index and return all docs … slow
We can run with a single most likely correction
The alternatives disempower the user, but save a round of interaction with the user
Edit distance to all dictionary terms?
Given a (mis-spelled) query – do we compute its edit distance to every dictionary term?
Expensive and slow
Alternative?
How do we cut the set of candidate dictionary terms?
One possibility is to use n-gram overlap for this
This can also be used by itself for spelling correction.
n-gram overlap
Enumerate all the n-grams in the query string as well as in the lexicon
Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams
Threshold by number of matching n-grams
Variants – weight by keyboard layout, etc.
Example with trigrams
Suppose the text is november
Trigrams are nov, ove, vem, emb, mbe, ber.
The query is december
Trigrams are dec, ece, cem, emb, mbe, ber.
So 3 trigrams overlap (of 6 in each term)
How can we turn this into a normalized measure of overlap?
One option – Jaccard coefficient
A commonly-used measure of overlap
Let X and Y be two sets; then the J.C. is

Equals 1 when X and Y have the same elements and zero when they are disjoint
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a match
E.g., if J.C. > 0.8, declare a match
Matching trigrams
Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo ⇒ alone → lore → sloth
or ⇒ border → lore → morbid
rd ⇒ ardent → border → card
↑
Standard postings “merge” will enumerate …
Adapt this to using Jaccard (or another) measure.
Context-sensitive spell correction
Text: I flew from Heathrow to Narita.
Consider the phrase query “flew form Heathrow”
We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.
Context-sensitive correction
Need surrounding context to catch this.
First idea: retrieve dictionary terms close (in weighted edit distance) to each query term
Now try all possible resulting phrases with one word “fixed” at a time
flew from heathrow
fled form heathrow
flea form heathrow
Hit-based spelling correction: Suggest the alternative that has lots of hits.
Exercise
Suppose that for “flew form Heathrow” we have 7 alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in this scheme?
Another approach
Break phrase query into a conjunction of biwords (Lecture 2).
Look for biwords that need only one term corrected.
Enumerate only phrases containing “common” biwords.
General issues in spell correction
We enumerate multiple alternatives for “Did you mean?”
Need to figure out which to present to the user
The alternative hitting most docs
Query log analysis
More generally, rank alternatives probabilistically
argmaxcorr P(corr | query)
From Bayes rule, this is equivalent to
argmaxcorr P(query | corr) * P(corr)↑ ↑
Noisy channel Language model
Soundex
Soundex
Class of heuristics to expand a query into phonetic equivalents
Language specific – mainly for names
E.g., chebyshev → tchebycheff
Invented for the U.S. census … in 1918
Soundex – typical algorithm
- Turn every token to be indexed into a 4-character reduced form
Do the same with query terms
Build and search an index on the reduced forms
(when the query calls for a soundex match)
Soundex – typical algorithm
Retain the first letter of the word.
Change all occurrences of the following letters to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.Change letters to digits as follows:
B, F, P, V →1
C, G, J, K, Q, S, X, Z →2
D,T →3
L →4
M, N →5
R →6
Soundex continued
Remove all pairs of consecutive digits.
Remove all zeros from the resulting string.
Pad the resulting string with trailing zeros and return the first four positions, which will be of the form
. E.g., Herman becomes H655.
↑
Will hermann generate the same code?
What queries can we process?
We have
Positional inverted index with skip pointers
Wild-card index
Spell-correction
Soundex
Queries such as
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)
Soundex
Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, …)
How useful is soundex?
Not very – for information retrieval
Okay for “high recall” tasks (e.g., Interpol), though biased to names of certain nationalities
Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR
Exercise
Draw yourself a diagram showing the various indexes in a search engine incorporating all the functionality we have talked about
Identify some of the key design choices in the index pipeline:
Does stemming happen before the Soundex index?
What about n-grams?
Given a query, how would you parse and dispatch sub-queries to the various indexes?
Resources
IIR 3, MG 4.2
Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992.
J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - practice and experience 25(3), March 1995. http://citeseer.ist.psu.edu/zobel95finding.html
Mikael Tillenius: Efficient Generation and Ranking of Spelling Error Corrections. Master’s thesis at Sweden’s Royal Institute of Technology. http://citeseer.ist.psu.edu/179155.html
Nice, easy reading on spell correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
Introduction to Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 4: Index Construction
Plan
| ![]() |
This time:
Index construction
new slide
Index construction
How do we construct an index?
What strategies can we use with limited main memory?
Hardware basics
Many design decisions in information retrieval are based on the characteristics of hardware
We begin by reviewing hardware basics
Hardware basics
Access to data in memory is much faster than access to data on disk.
Disk seeks: No data is transferred from disk while the disk head is being positioned.
Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks.
Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).
Block sizes: 8KB to 256 KB.
Hardware basics
Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB.
Available disk space is several (2–3) orders of magnitude larger.
Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.
Hardware assumptions for this lecture
symbol statistic value
s average seek time 5 ms = 5 x 10−3 s
b transfer time per byte 0.02 μs = 2 x 10−8 s
processor’s clock rate 109 s−1
p low-level operation 0.01 μs = 10−8 s
(e.g., compare & swap a word)
size of main memory several GB
size of disk space 1 TB or more
RCV1: Our collection for this lecture
Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course.
The collection we’ll use isn’t really large enough either, but it’s publicly available and is at least a more plausible example.
As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.
This is one year of Reuters newswire (part of 1995 and 1996)
A Reuters RCV1 document
Reuters RCV1 statistics
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) 400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
non-positional postings 100,000,000
4.5 bytes per word token vs. 7.5 bytes per word type: why?
new slide
Recall IIR 1 index construction
Documents are parsed to extract words and these are saved with the Document ID.

Key step
- After all documents have been parsed, the inverted file is sorted by terms. ↑
We focus on this sort step.We have 100M items to sort. | ![]() |
Scaling index construction
- In-memory index construction does not scale
Can’t stuff entire collection into memory, sort, then write back
How can we construct an index for very large collections?
Taking into account the hardware constraints we just learned about . . .
Memory, disk, speed, etc.
Sort-based index construction
As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit compression tricks (you can, but much more complex)
The final postings for any term are incomplete until the end.
At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.
T = 100,000,000 in the case of RCV1
So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of >150 years of newswire
Thus: We need to store intermediate results on disk.
Sort using disk as “memory”?
Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?
No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.
We need an external sorting algorithm.
Bottleneck
Parse and build postings entries one doc at a time
Now sort postings entries by term (then by doc within each term)
Doing this with random disk seeks would be too slow
– must sort T=100M records
BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)
12-byte (4+4+4) records (term, doc, freq).
These are generated as we parse docs.
Must now sort 100M such 12-byte records by term.
Define a Block ~ 10M such records
Can easily fit a couple into memory.
Will have 10 such blocks to start with.
Basic idea of algorithm:
Accumulate postings for each block, sort, write to disk.
Then merge the blocks into one long sorted order.

Sorting 10 blocks of 10M records
First, read each block and sort within:
Quicksort takes 2N ln N expected steps
In our case 2 x (10M ln 10M) steps
Exercise: estimate total time to read each block from disk and and quicksort it.
10 times this estimate – gives us 10 sorted runs of 10M records each.
Done straightforwardly, need 2 copies of data on disk
But can optimize this
new slide
How to merge the sorted runs?
Can do binary merges, with a merge tree of log210 = 4 layers.
During each layer, read into memory runs in blocks of 10M, merge, write back.

How to merge the sorted runs?
But it is more efficient to do a multi-way merge, where you are reading from all blocks simultaneously
Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk, then you’re not killed by disk seeks.
Remaining problem with sort-based algorithm
Our assumption was: we can keep the dictionary in memory.
We need the dictionary (which grows dynamically) in order to implement a term to termID mapping.
Actually, we could work with term,docID postings instead of termID,docID postings . . .
. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)
new slide
SPIMI: Single-pass in-memory indexing
Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.
With these two ideas we can generate a complete inverted index for each block.
These separate indexes can then be merged into one big index.
SPIMI-Invert
Merging of blocks is analogous to BSBI.
SPIMI: Compression
Compression makes SPIMI even more efficient.
Compression of terms
Compression of postings
See next lecture
Distributed indexing
- For web-scale indexing (don’t try this at home!):
must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?
Web search engine data centers
Web search data centers (Google, Bing, Baidu) mainly contain commodity machines.
Data centers are distributed around the world.
Estimate: Google ~1 million servers, 3 million processors/cores (Gartner 2007)
Massive data centers
If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system?
Answer: 63%
Exercise: Calculate the number of servers failing per minute for an installation of 1 million servers.
Distributed indexing
Maintain a master machine directing the indexing job – considered “safe”.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle machine from a pool.
Parallel tasks
We will use two sets of parallel tasks
Parsers
Inverters
Break the input document collection into splits
Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI)
Parsers
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term, doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms’ first letters
(e.g., a-f, g-p, q-z) – here j = 3.
Now to complete the index inversion
Inverters
An inverter collects all (term,doc) pairs (= postings) for one term-partition.
Sorts and writes to postings lists
Data flow

MapReduce
The index construction algorithm we just described is an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing …
… without having to write code for the distribution part.
They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.
MapReduce
Index construction was just one phase.
Another phase: transforming a term-partitioned index into a document-partitioned index.
Term-partitioned: one machine handles a subrange of terms
Document-partitioned: one machine handles a subrange of documents
As we’ll discuss in the web part of the course, most search engines use a document-partitioned index … better load balancing, etc.
Schema for index construction in MapReduce
Schema of map and reduce functions
map: input → list(k, v) reduce: (k,list(v)) → output
Instantiation of the schema for index construction
map: collection → list(termID, docID)
reduce: (
, , …) → (postings list1, postings list2, …)
Example for index construction
Map:
d1 : C came, C c’ed.
d2 : C died. →
- (C,d1>, <came,d1>, <C,d1>, <c'ed, d1>, >C, d2>, <died, d2>
Reduce:
(<C,(d1,d2,d1)>, <died,(d2)>, <came, (d1)>, <c'ed,(d1)>)→ <C,(d1:2,d2:1)>, <died,(d2:1)>, <came, (d1:1)>, <c'ed,(d1:1)>)
new slide
Dynamic indexing
Up to now, we have assumed that collections are static.
They rarely are:
Documents come in over time and need to be inserted.
Documents are deleted and modified.
This means that the dictionary and postings lists have to be modified:
Postings updates for terms already in dictionary
New terms added to dictionary
Simplest approach
Maintain “big” main index
New docs go into “small” auxiliary index
Search across both, merge results
Deletions
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this invalidation bit-vector
Periodically, re-index into one main index
Issues with main and auxiliary indexes
Problem of frequent merges – you touch stuff a lot
Poor performance during merge
Actually:
Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list.
Merge is the same as a simple append.
But then we would need a lot of files – inefficient for OS.
Assumption for the rest of the lecture: The index is one big file.
In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)
Logarithmic merge
Maintain a series of indexes, each twice as large as the previous one
At any time, some of these powers of 2 are instantiated
Keep smallest (Z0) in memory
Larger ones (I0, I1, …) on disk
If Z0 gets too big (> n), write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)
Or merge with I1 to form Z2
Logarithmic merge
Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge.
Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T)
So logarithmic merge is much more efficient for index construction
But query processing now requires the merging of O(log T) indexes
Whereas it is O(1) if you just have a main and auxiliary index
Further issues with multiple indexes
Collection-wide statistics are hard to maintain
E.g., when we spoke of spell-correction: which of several corrected alternatives do we present to the user?
We said, pick the one with the most hits
How do we maintain the top ones with multiple indexes and invalidation bit vectors?
One possibility: ignore everything but the main index for such ordering
Will see more such statistics used in results ranking
Dynamic indexing at search engines
- All the large search engines now do dynamic indexing
Their indices have frequent incremental changes
News items, blogs, new topical web pages
Sarah Palin, …
But (sometimes/typically) they also periodically reconstruct the index from scratch
Query processing is then switched to the new index, and the old index is deleted

Other sorts of indexes
Positional indexes
Same sort of sorting problem … just larger ← WHY?
Building character n-gram indexes:
As text is parsed, enumerate n-grams.
For each n-gram, need pointers to all dictionary terms containing it – the “postings”.
Note that the same “postings entry” will arise repeatedly in parsing the docs – need efficient hashing to keep track of this.
E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous
Only need to process each term once
new slide
new slide
Resources for today’s lecture
Chapter 4 of IIR
MG Chapter 5
Original publication on MapReduce: Dean and Ghemawat (2004)
Original publication on SPIMI: Heinz and Zobel (2003)
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 5: Index Compression
Course work
Problem set 1 due Thursday
Programming exercise 1 will be handed out today
Last lecture – index construction
- Sort-based indexing
- Naïve in-memory inversion
- Blocked Sort-Based Indexing
- Merge sort is effective for disk-based sorting (avoid seeks!)
- Single-Pass In-Memory Indexing
- No global dictionary
- Generate separate dictionary for each block
- Don’t sort postings
- Accumulate postings in postings lists as they occur
- Distributed indexing using MapReduce
- Dynamic indexing: Multiple indices, logarithmic merge
Today

Collection statistics in more detail (with RCV1)
How big will the dictionary and postings be?
Dictionary compression
Postings compression
Why compression (in general)?
- Use less disk space
- Saves a little money
- Keep more stuff in memory
- Increases speed
- Increase speed of data transfer from disk to memory
- [read compressed data | decompress] is faster than [read uncompressed data]
- Premise: Decompression algorithms are fast
- True of the decompression algorithms we use
Why compression for inverted indexes?
Dictionary
Make it small enough to keep in main memory
Make it so small that you can keep some postings lists in main memory too
Postings file(s)
Reduce disk space needed
Decrease time needed to read postings lists from disk
Large search engines keep a significant part of the postings in memory.
Compression lets you keep more in memory
We will devise various IR-specific compression schemes
Recall Reuters RCV1
- symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) ~400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
non-positional postings 100,000,000
Index parameters vs. what we index (details IIR Table 5.1, p.80)

Lossless vs. lossy compression
Lossless compression: All information is preserved.
What we mostly do in IR.
Lossy compression: Discard some information
Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.
Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.
Almost no loss quality for top k list.
Vocabulary vs. collection size
How big is the term vocabulary?
That is, how many distinct words are there?
Can we assume an upper bound?
Not really: At least 7020 = 1037 different words of length 20
In practice, the vocabulary will keep growing with the collection size
Especially with Unicode :)
Vocabulary vs. collection size
Heaps’ law: M = kTb
M is the size of the vocabulary, T is the number of tokens in the collection
Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½
It is the simplest possible relationship between the two in log-log space
An empirical finding (“empirical law”)
Heaps’ Law
For RCV1, the dashed line log10M = 0.49 log10T + 1.64 is the best least squares fit. Thus, M = 101.64T0.49 so k = 101.64 ≈ 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms. | ![]() |
Exercises
- What is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps’ law?
Compute the vocabulary size M for this scenario:
Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.
Assume a search engine indexes a total of 20,000,000,000 (2 × 1010) pages, containing 200 tokens on average
What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?
Zipf’s law
Heaps’ law gives the vocabulary size in collections.
We also study the relative frequencies of terms.
In natural language, there are a few very frequent terms and very many very rare terms.
Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
cfi ∝ 1/i = K/i where K is a normalizing constant
cfi is collection frequency: the number of occurrences of the term ti in the collection.
Zipf consequences
If the most frequent term (the) occurs cf1 times
then the second most frequent term (of) occurs cf1/2 times
the third most frequent term (and) occurs cf1/3 times …
Equivalent: cfi = K/i where K is a normalizing factor, so
log cfi = log K - log i
Linear relationship between log cfi and log i
Another power law relationship
Zipf’s law for Reuters RCV1
Compression
Now, we will consider compressing the space for the dictionary and postings
Basic Boolean index only
No study of positional indexes, etc.
We will consider compression schemes
Why compress the dictionary?
- Search begins with the dictionary
We want to keep it in memory
Memory footprint competition with other applications
Embedded/mobile devices may have very little memory
Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time
So, compressing the dictionary is important
Dictionary storage - first cut
- Array of fixed-width entries
- ~400,000 terms; 28 bytes/term = 11.2 MB.

↓ 20 bytes 4 bytes each
Dictionary search structure
Fixed-width terms are wasteful
Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms.
And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.
Written English averages ~4.5 characters/word.
Exercise: Why is/isn’t this the number to use for estimating the dictionary size?
Ave. dictionary word in English: ~8 characters
How do we use ~8 characters per dictionary term?
Short words dominate token counts but not type average.
Compressing the term list: Dictionary-as-a-String
Store dictionary as a (long) string of characters:
Pointer to next word shows end of current word
- Hope to save up to 60% of dictionary space.

Space for dictionary as a string
- 4 bytes per term for Freq.
4 bytes per term for pointer to Postings. → Now avg. 11 bytes/term,
3 bytes per term pointer ...Not 20
Avg. 8 bytes per term in term string
400K terms x 19 ⇒7.6 MB (against 11.2MB for fixed width)
Blocking
Store pointers to every kth term string.
Example below: k=4.
Need to store term lengths (1 extra byte)

Net
- Example for block size k = 4
Where we used 3 bytes/pointer without blocking
3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.
Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.
Why not go with larger k?
Exercise
Estimate the space usage (and savings compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.
Dictionary search without blocking
| ![]() |
Dictionary search with blocking

Büinary search down to 4-term block;
Then linear search through terms in block.
Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares
Exercise
Estimate the impact on search performance (and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16.
Front coding
Front-coding:
Sorted words commonly have long common prefix – store differences only
(for last k-1 in a block of k)
8automata8automate9automatic10automation
→ 8automat*a1♦e2♦ic3♦ion
↑ ↑
Encodes automat Extra length beyond automat.
Begins to resemble general string compression.
RCV1 dictionary compression summary
|
|
|
|
|
|
|
|
|
|
Postings compression
The postings file is much larger than the dictionary, factor of at least 10.
Key desideratum: store each posting compactly.
A posting for our purposes is a docID.
For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers.
Alternatively, we can use log2 800,000 ≈ 20 bits per docID.
Our goal: use far fewer than 20 bits per docID.
Postings: two conflicting forces
A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.
A term like the occurs in virtually every doc, so 20 bits/posting is too expensive.
Prefer 0/1 bitmap vector in this case
Postings file entry
We store the list of docs containing a term in increasing order of docID.
computer: 33,47,154,159,202 …
Consequence: it suffices to store gaps.
33,14,107,5,43 …
Hope: most gaps can be encoded/stored with far fewer than 20 bits.
Three postings entries
Variable length encoding
Aim:
For arachnocentric, we will use ~20 bits/gap entry.
For the, we will use ~1 bit/gap entry.
If the average gap for a term is G, we want to use ~log2G bits/gap entry.
Key challenge: encode every integer (gap) with about as few bits as needed for that integer.
This requires a variable length encoding
Variable length codes achieve this by using short codes for small numbers
Variable Byte (VB) codes
For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits
Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c
If G ≤127, binary-encode it in the 7 available bits and set c =1
Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm
At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.
Example

Other variable unit codes
Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).
Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases.
Variable byte codes:
Used by many commercial/research systems
Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).
There is also recent work on word-aligned codes that pack a variable number of gaps into one word
Unary code
Represent n as n 1s with a final 0.
Unary code for 3 is 1110.
Unary code for 40 is
11111111111111111111111111111111111111110 .
Unary code for 80 is:
111111111111111111111111111111111111111111111111111111111111111111111111111111110
This doesn’t look promising, but….
Gamma codes
We can compress better with bit-level codes
The Gamma code is the best known of these.
Represent a gap G as a pair length and offset
offset is G in binary, with the leading bit cut off
For example 13 → 1101 → 101
length is the length of offset
For 13 (offset 101), this is 3.
We encode length with unary code: 1110.
Gamma code of 13 is the concatenation of length and offset: 1110101
Gamma code examples

Gamma code properties
G is encoded using 2 ⌊ log G⌋ + 1 bits
Length of offset is ⌊ log G ⌋ bits
Length of length is ⌊log G ⌋ + 1 bits
All gamma codes have an odd number of bits
Almost within a factor of 2 of best possible, log2 G
Gamma code is uniquely prefix-decodable, like VB
Gamma code can be used for any distribution
Gamma code is parameter-free
Gamma seldom used in practice
Machines have word boundaries – 8, 16, 32, 64 bits
Operations that cross word boundaries are slower
Compressing and manipulating at the granularity of bits can be slow
Variable byte encoding is aligned and thus potentially more efficient
Regardless of efficiency, variable byte is conceptually simpler at little additional space cost
RCV1 compression

Index compression summary
- We can now create an index for highly efficient Boolean retrieval that is very space efficient
Only 4% of the total size of the collection
Only 10-15% of the total size of the text in the collection
However, we’ve ignored positional information
Hence, space savings are less for indexes used in practice
But techniques substantially the same.
Resources for today’s lecture
IIR 5
MG 3.3, 3.4.
F. Scholer, H.E. Williams and J. Zobel. 2002. Compression of Inverted Indexes For Fast Query Evaluation. Proc. ACM-SIGIR 2002.
Variable byte codes
V. N. Anh and A. Moffat. 2005. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval 8: 151–166.
Word aligned codes
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Recap of lecture 5
- Collection and vocabulary statistics: Heaps’ and Zipf’s laws
- Dictionary compression for Boolean indexes
- Dictionary string, blocks, front coding
- Postings compression: Gap encoding, prefix-unique codes
- Variable-Byte and Gamma codes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This lecture; IIR Sections 6.2-6.4.3
Ranked retrieval
Scoring documents
Term frequency
Collection statistics
Weighting schemes
Vector space scoring
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of their needs and the collection.
Also good for applications: Applications can easily consume 1000s of results.
Not good for the majority of users.
Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Problem with Boolean search:
feast or famine
Boolean queries often result in either too few (=0) or too many (1000s) results.
Query 1: “standard user dlink 650” → 200,000 hits
Query 2: “standard user dlink 650 no card found”: 0 hits
It takes a lot of skill to come up with a query that produces a manageable number of hits.
AND gives too few; OR gives too many
Ranked retrieval models
Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query
Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language
In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa
Feast or famine: not a problem in ranked retrieval
When a system produces a ranked result set, large result sets are not an issue
Indeed, the size of the result set is not an issue
We just show the top k ( ≈ 10) results
We don’t overwhelm the user
Premise: the ranking algorithm works
Scoring as the basis of ranked retrieval
We wish to return in order the documents most likely to be useful to the searcher
How can we rank-order the documents in the collection with respect to a query?
Assign a score – say in [0, 1] – to each document
This score measures how well document and query “match”.
Query-document matching scores
We need a way of assigning a score to a query/document pair
Let’s start with a one-term query
If the query term does not occur in the document: score should be 0
The more frequent the query term in the document, the higher the score (should be)
We will look at a number of alternatives for this.
Take 1: Jaccard coefficient
Recall from Lecture 3: A commonly used measure of overlap of two sets A and B
jaccard(A,B) = |A ∩ B| / |A ∪ B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
Jaccard coefficient: Scoring example
What is the query-document match score that the Jaccard coefficient computes for each of the two documents below?
Query: ides of march
Document 1: caesar died in march
Document 2: the long march
Issues with Jaccard for scoring
It doesn’t consider term frequency (how many times a term occurs in a document)
Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information
We need a more sophisticated way of normalizing for length
Later in this lecture, we’ll use

. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.
Recall (Lecture 1): Binary term-document incidence matrix

Each document is represented by a binary vector ∈ {0,1}|V|
Term-document count matrices
Consider the number of occurrences of a term in a document:
Each document is a count vector in ℕv: a column below

Bag of words model
Vector representation doesn’t consider the ordering of words in a document
John is quicker than Mary and Mary is quicker than John have the same vectors
This is called the bag of words model.
In a sense, this is a step back: The positional index was able to distinguish these two documents.
We will look at “recovering” positional information later in this course.
For now: bag of words model
Term frequency tf
The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
We want to use tf when computing query-document match scores. But how?
Raw term frequency is not what we want:
A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with term frequency.
NB: frequency = count in IR
Log-frequency weighting
The log frequency weight of term t in d is

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in both q and d:
- The score is 0 if none of the query terms is present in the document.
score :
Document frequency
Rare terms are more informative than frequent terms
Recall stop words
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
A document containing this term is very likely to be relevant to the query arachnocentric
→ We want a high weight for rare terms like arachnocentric.
Document frequency, continued
Frequent terms are less informative than rare terms
Consider a query term that is frequent in the collection (e.g., high, increase, line)
A document containing such a term is more likely to be relevant than a document that doesn’t
But it’s not a sure indicator of relevance.
→ For frequent terms, we want high positive weights for words like high, increase, and line
But lower weights than for rare terms.
We will use document frequency (df) to capture this.
idf weight
dft is the document frequency of t: the number of documents that contain t
dft is an inverse measure of the informativeness of t
dft ≤ N
We define the idf (inverse document frequency) of t by

We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
Will turn out the base of the log is immaterial.
idf example, suppose N = 1 million
There is one idf value for each term t in a collection.
idft = log10 = ( N/dft)
|
|
|
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
Effect of idf on ranking
- Does idf have an effect on ranking for one-term queries, like
iPhone
idf has no effect on ranking one term queries
idf affects the ranking of documents for queries with at least two terms
For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.
Collection vs. Document frequency
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
Example:
|
|
|
|
|
|
|
|
|
- Which word is a better search term (and should get a higher weight)?
tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight.
Wt,d = log(1+ tft,d) x log10(N/dft)
Best known weighting scheme in information retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
Score for a document given a query
Score(q,d) = ∑t∈q∩d tf.idft,d
There are many variants
How “tf” is computed (with/without logs)
Whether the terms in the query are also weighted
…
Binary → count → weight matrix
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Documents as vectors
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
These are very sparse vectors - most entries are zero.
Queries as vectors
- Key idea 1: Do the same for queries: represent them as vectors in the space
Key idea 2: Rank documents according to their proximity to the query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.
Instead: rank more relevant documents higher than less relevant documents
Formalizing vector space proximity
First cut: distance between two points
( = distance between the end points of the two vectors)
Euclidean distance?
Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large for vectors of different lengths.
Why distance is a bad idea
The Euclidean distance between q and d2 is large even though the distribution of t terms in the query q and the distribution of terms in the document d2 are very similar.

Use angle instead of distance
Thought experiment: take a document d and append it to itself. Call this document d′.
“Semantically” d and d′ have the same content
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal similarity.
Key idea: Rank documents according to angle with query.
From angles to cosines
The following two notions are equivalent.
Rank documents in decreasing order of the angle between query and document
Rank documents in increasing order of cosine(query,document)
Cosine is a monotonically decreasing function for the interval [0o, 180o]
From angles to cosines
But how – and why – should we be computing cosines?

Length normalization
A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)
Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
Long and short documents now have comparable weights
cosine(query,document)

qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
Cosine for length-normalized vectors
For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

for q, d length-normalized.
Cosine similarity illustrated
Cosine similarity amongst 3 documents
- How similar are the novels
- SaS: Sense and Sensibility
- PaP: Pride and Prejudice, and
- WH: Wuthering Heights?
- Term frequencies (counts)
- Note: To simplify this example, we don’t do idf weighting.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 documents example contd.

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Computing cosine scores
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?

Weighting may differ in queries vs documents
Many search engines allow for different weightings for queries vs. documents
SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table
A very standard weighting scheme is: lnc.ltc
Document: logarithmic tf (l as first character), no idf and cosine normalization
Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …
A bad idea?
tf-idf example: lnc.ltc
Document: car insurance auto insurance
Query: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8
Doc length =
Summary – vector space ranking
Represent the query as a weighted tf-idf vector
Represent each document as a weighted tf-idf vector
Compute the cosine similarity score for the query vector and each document vector
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user
Resources for today’s lecture
IIR 6.2 – 6.4.3
http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
Term weighting and cosine similarity tutorial for SEO folk!
CS276
Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 7: Scoring and results assembly
Lecture 6 – I introduced a bug
In my anxiety to avoid taking the log of zero, I rewrote

In fact this was unnecessary, since the zero case is treated
specially above; net the FIRST version above is right.
Recap: tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight.
wt,d= (1+log10 tft,d) x log10(N/dft)
Best known weighting scheme in information retrieval
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
Recap: Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the space
Key idea 2: Rank documents according to their proximity to the query in this space
proximity = similarity of vectors
Recap: cosine(query,document)

cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
This lecture
Speeding up vector space ranking
Putting together a complete search system
Will require learning about a number of miscellaneous topics and heuristics
Computing cosine scores
Efficient cosine ranking
Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines.
Efficient ranking:
Computing a single cosine efficiently.
Choosing the K largest cosine values efficiently.
Can we do this without computing all N cosines?
Efficient cosine ranking
What we’re doing in effect: solving the K-nearest neighbor problem for a query vector
In general, we do not know how to do this efficiently for high-dimensional spaces
But it is solvable for short queries, and standard indexes support this well
Special case – unweighted queries
No weighting on query terms
Assume each query term occurs only once
Then for ranking, don’t need to normalize query vector
Slight simplification of algorithm from Lecture 6
Computing the K largest cosines: selection vs. sorting
Typically we want to retrieve the top K docs (in the cosine ranking for the query)
not to totally order all docs in the collection
Can we pick off docs with K highest cosines?
Let J = number of docs with nonzero cosines
We seek the K best of these J
Use heap for selecting top K
Binary tree in which each node’s value > the values of children
Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.
For J=1M, K=100, this is about 10% of the cost of sorting.

Bottlenecks
Primary computational bottleneck in scoring: cosine computation
Can we avoid all this computation?
Yes, but may sometimes get it wrong
a doc not in the top K may creep into the list of K output docs
Is this such a bad thing?
Cosine similarity is only a proxy
User has a task and a query formulation
Cosine matches docs to query
Thus cosine is anyway a proxy for user happiness
If we get a list of K docs “close” to the top K by cosine measure, should be ok
Generic approach
Find a set A of contenders, with K < |A| << N
A does not necessarily contain the top K, but has many docs from among the top K
Return the top K docs in A
Think of A as pruning non-contenders
The same approach is also used for other (non-cosine) scoring functions
Will look at several schemes following this approach
Index elimination
Basic algorithm cosine computation algorithm only considers docs containing at least one query term
Take this further:
Only consider high-idf query terms
Only consider docs containing many query terms
High-idf query terms only
For a query such as catcher in the rye
Only accumulate scores from catcher and rye
Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much
Benefit:
Postings of low-idf terms have many docs these (many) docs get eliminated from set A of contenders
Docs containing many query terms
Any doc with at least one query term is a candidate for the top K output list
For multi-term queries, only compute scores for docs containing several of the query terms
Say, at least 3 out of 4
Imposes a “soft conjunction” on queries seen on web search engines (early Google)
Easy to implement in postings traversal
3 of 4 query terms

Scores only computed for docs 8, 16 and 32.
Champion lists
Precompute for each dictionary term t, the r docs of highest weight in t’s postings
Call this the champion list for t
(aka fancy list or top docs for t)
Note that r has to be chosen at index build time
Thus, it’s possible that r < K
At query time, only compute scores for docs in the champion list of some query term
Pick the K top-scoring docs from amongst these
Exercises
How do Champion Lists relate to Index Elimination? Can they be used together?
How can Champion Lists be implemented in an inverted index?
Note that the champion list has nothing to do with small docIDs
Quantitative
Static quality scores
We want top-ranking documents to be both relevant and authoritative
Relevance is being modeled by cosine scores
Authority is typically a query-independent property of a document
Examples of authority signals
Wikipedia among websites
Articles in certain newspapers
A paper with many citations
Many bitly’s, diggs or del.icio.us marks
(Pagerank)
Modeling authority
Assign to each document a query-independent quality score in [0,1] to each document d
Denote this by g(d)
Thus, a quantity like the number of citations is scaled into [0,1]
Exercise: suggest a formula for this.
Net score
Consider a simple total score combining cosine relevance and authority
net-score(q,d) = g(d) + cosine(q,d)
Can use some other linear combination
Indeed, any function of the two “signals” of user happiness – more later
Now we seek the top K docs by net score
Top K by net score – fast methods
First idea: Order all postings by g(d)
Key: this is a common ordering for all postings
Thus, can concurrently traverse query terms’ postings for
Postings intersection
Cosine score computation
Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)
Why order postings by g(d)?
Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal
In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early
Short of computing scores for all docs in postings
Champion lists in g(d)-ordering
Can combine champion lists with g(d)-ordering
Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd
Seek top-K results from only the docs in these champion lists
High and low lists
For each term, we maintain two postings lists called high and low
Think of high as the champion list
When traversing postings on a query, only traverse high lists first
If we get more than K docs, select the top K and stop
Else proceed to get docs from the low lists
Can be used even for simple cosine scores, without global quality g(d)
A means for segmenting index into two tiers
Impact-ordered postings
We only want to compute scores for docs for which wft,d is high enough
We sort each postings list by wft,d
Now: not all postings in a common order!
How do we compute scores in order to pick off top K?
Two ideas follow
1. Early termination
When traversing t’s postings, stop early after either
a fixed number of r docs
wft,d drops below some threshold
Take the union of the resulting sets of docs
One from the postings of each query term
Compute only the scores for docs in this union
2. idf-ordered terms
When considering the postings of query terms
Look at them in order of decreasing idf
High idf terms likely to contribute most to score
As we update score contribution from each query term
Stop if doc scores relatively unchanged
Can apply to cosine or some other net scores
Cluster pruning: preprocessing
Pick √N docs at random: call these leaders
For every other doc, pre-compute nearest leader
Docs attached to a leader: its followers;
Likely: each leader has ~ √N followers.
Cluster pruning: query processing
Process a query as follows:
Given query Q, find its nearest leader L.
Seek K nearest docs from among L’s followers.
Visualization

Why use random sampling
Fast
Leaders reflect data distribution
General variants
Have each follower attached to b1=3 (say) nearest leaders.
From query, find b2=4 (say) nearest leaders and their followers.
Can recurse on leader/follower construction.
Exercises
To find the nearest leader in step 1, how many cosine computations do we do?
Why did we have √N in the first place?
What is the effect of the constants b1, b2 on the previous slide?
Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs.
Likely under random sampling.
Parametric and zone indexes
Thus far, a doc has been a sequence of terms
- In fact documents have multiple parts, some with special semantics:
- Author
- Title
- Date of publication
- Language
- Format
- etc.
These constitute the metadata about a document
Fields
We sometimes wish to search by these metadata
E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick
Year = 1601 is an example of a field
Also, author last name = shakespeare, etc.
Field or parametric index: postings for each field value
Sometimes build range trees (e.g., for dates)
Field query typically treated as conjunction
(doc must be authored by shakespeare)
Zone
A zone is a region of the doc that can contain an arbitrary amount of text, e.g.,
Title
Abstract
References …
Build inverted indexes on zones as well to permit querying
E.g., “find docs with merchant in the title zone and matching the query gentle rain”
Example zone indexes
↑
Encode zones in dictionary vs. postings.
↓

Tiered indexes
Break postings up into a hierarchy of lists
Most important
…
Least important
Can be done by g(d) or another measure
Inverted index thus broken up into tiers of decreasing importance
At query time use top tier unless it fails to yield K docs
If so drop to lower tiers
Example tiered index
Query term proximity
Free text queries: just a set of terms typed into the query box – common on the web
Users prefer docs in which query terms occur within close proximity of each other
Let w be the smallest window in a doc containing all query terms, e.g.,
For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words)
Would like scoring function to take this into account – how?
Query parsers
Free text query from user may in fact spawn one or more queries to the indexes, e.g., query rising interest rates
Run the query as a phrase query
If <K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates
If we still have <K docs, run the vector space query rising interest rates
Rank matching docs by vector space scoring
This sequence is issued by a query parser
Aggregate scores
We’ve seen that score functions can combine cosine, static quality, proximity, etc.
How do we know the best combination?
Some applications – expert-tuned
Increasingly common: machine-learned
See May 19th lecture
Putting it all together
Resources
IIR 7, 6.1
CS276
Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 8: Evaluation
This lecture
How do we know if our results are any good?
Evaluating a search engine
Benchmarks
Precision and recall
Results summaries:
Making our good results usable to a user
Measures for a search engine
How fast does it index
Number of documents/hour
(Average document size)
How fast does it search
Latency as a function of index size
Expressiveness of query language
Ability to express complex information needs
Speed on complex queries
Uncluttered UI
Is it free?
Measures for a search engine
All of the preceding criteria are measurable: we can quantify speed/size
we can make expressiveness precise
The key measure: user happiness
What is this?
Speed of response/size of index are factors
But blindingly fast, useless answers won’t make a user happy
Need a way of quantifying user happiness
Measuring user happiness
Issue: who is the user we are trying to make happy?
Depends on the setting
Web engine:
User finds what s/he wants and returns to the engine
Can measure rate of return users
User completes task – search as a means, not end
See Russell http://dmrussell.googlepages.com/JCDL-talk-June-2007-short.pdf
eCommerce site: user finds what s/he wants and buys
Is it the end-user, or the eCommerce site, whose happiness we measure?
Measure time to purchase, or fraction of searchers who become buyers?
Measuring user happiness
Enterprise (company/govt/academic): Care about “user productivity”
How much time do my users save when looking for information?
Many other criteria having to do with breadth of access, secure access, etc.
Happiness: elusive to measure
Most common proxy: relevance of search results
But how do you measure relevance?
We will detail a methodology here, then examine its issues
Relevance measurement requires 3 elements:
A benchmark document collection
A benchmark suite of queries
A usually binary assessment of either Relevant or Nonrelevant for each query and each document
Some work on more-than-binary, but not the standard
Evaluating an IR system
Note: the information need is translated into a query
Relevance is assessed relative to the information need not the query
E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
Query: wine red white heart attack effective
Evaluate whether the doc addresses the information need, not whether it has these words
Standard relevance benchmarks
TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years
Reuters and other benchmark doc collections used
“Retrieval tasks” specified
sometimes as queries
Human experts mark, for each query and for each doc, Relevant or Nonrelevant
or at least for subset of docs that some system returned for that query
Unranked retrieval evaluation:Precision and Recall
Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved
= P(retrieved|relevant)
|
| |
|
|
|
|
|
|
- Precision P = tp/(tp + fp)
- Recall R = tp/(tp + fn)
Should we instead use the accuracy measure for evaluation?
Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”
The accuracy of an engine: the fraction of these classifications that are correct
(tp + tn) / ( tp + fp + fn + tn)
Accuracy is a commonly used evaluation measure in machine learning classification work
Why is this not a very useful evaluation measure in IR?
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.
Precision/Recall
You can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
In a good system, precision decreases as either the number of docs retrieved or recall increases
This is not a theorem, but a result with strong empirical confirmation
Difficulties in using precision/recall
Should average over large document collection/query ensembles
Need human relevance assessments
People aren’t reliable assessors
Assessments have to be binary
Nuanced assessments?
Heavily skewed by collection/authorship
Results may not translate from one domain to another
A combined measure: F
Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):

People usually use balanced F1 measure
i.e., with β = 1 or a= ½
Harmonic mean is a conservative average
See CJ van Rijsbergen, Information Retrieval
F1 and other averages

Evaluating ranked results
Evaluation of ranked results:
The system can return any number of results
By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve
A precision-recall curve

Averaging over queries
A precision-recall graph for one query isn’t a very sensible thing to look at
You need to average performance over a whole bunch of queries.
But there’s a technical issue:
Precision-recall calculations place some points on the graph
How do you determine a value (interpolate) between the points?
Interpolated precision
Idea: If locally precision increases with increasing recall, then you should get to count that…
So you take the max of precisions to right of value

Evaluation
Graphs are good, but people want summary measures!
Precision at fixed retrieval level
Precision-at-k: Precision of top k results
Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages
But: averages badly and has an arbitrary parameter of k
11-point interpolated average precision
The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them
Evaluates performance at all recall levels
Typical (good) 11 point precisions
SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
Yet more evaluation measures…
Mean average precision (MAP)
Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved
Avoids interpolation, use of fixed recall levels
MAP for query collection is arithmetic ave.
Macro-averaging: each query counts equally
R-precision
If we have a known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of the top Rel docs returned
Perfect system could score 1.0.
Variance
For a test collection, it is usual that a system does crummily on some information needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7)
Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query.
That is, there are easy information needs and hard ones!
Creating Test Collections
for IR Evaluation
Test Collections
From document collections to test collections
- Still need
- Test queries
Relevance assessments
Test queries
Must be germane to docs available
Best designed by domain experts
Random query terms generally not a good idea
Relevance assessments
Human judges, time-consuming
Are human panels perfect?
Kappa measure for inter-judge (dis)agreement
Kappa measure
Agreement measure among judges
Designed for categorical judgments
Corrects for chance agreement
Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
P(A) – proportion of time judges agree
P(E) – what agreement would be by chance
Kappa = 0 for chance agreement, 1 for total agreement.
Kappa Measure: Example
P(A)? P(E)?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kappa Example
P(A) = 370/400 = 0.925
P(nonrelevant) = (10+20+70+70)/800 = 0.2125
P(relevant) = (10+20+300+300)/800 = 0.7878
P(E) = 0.2125^2 + 0.7878^2 = 0.665
Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
Kappa > 0.8 = good agreement
0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)
Depends on purpose of study
For >2 judges: average pairwise kappas
TREC
TREC Ad Hoc task from first 8 TRECs is standard IR task
50 detailed information needs a year
Human evaluation of pooled results returned
More recently other related things: Web track, HARD
A TREC query (TREC 5)
Number: 225
Description:
What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?
Standard relevance benchmarks: Others
GOV2
Another TREC/NIST collection
25 million web pages
Largest collection that is easily available
But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index
NTCIR
East Asian language and cross-language information retrieval
Cross Language Evaluation Forum (CLEF)
This evaluation series has concentrated on European languages and cross-language information retrieval.
Many others
Impact of Inter-judge Agreement
Impact on absolute performance measure can be significant (0.32 vs 0.39)
Little impact on ranking of different systems or relative performance
Suppose we want to know if algorithm A is better than algorithm B
A standard information retrieval experiment will give us a reliable answer to this question.
Critique of pure relevance
Relevance vs Marginal Relevance
A document can be redundant even if it is highly relevant
Duplicates
The same information from different sources
Marginal relevance is a better measure of utility for the user.
Using facts/entities as evaluation units more directly measures true relevance.
But harder to create evaluation set
See Carbonell reference
Can we avoid human judgment?
No
Makes experimental work hard
Especially on a large scale
In some very specific settings, can use proxies
E.g.: for approximate vector space retrieval, we can compare the cosine distance closeness of the closest docs to those found by an approximate retrieval algorithm
But once we have test collections, we can reuse them (so long as we don’t overtrain too badly)
Evaluation at large search engines
Search engines have test collections of queries and hand-ranked results
Recall is difficult to measure on the web
Search engines often use precision at top k, e.g., k = 10
. . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.
NDCG (Normalized Cumulative Discounted Gain)
Search engines also use non-relevance-based measures.
Clickthrough on first result
Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.
Studies of user behavior in the lab
A/B testing
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation
Evaluate with an “automatic” measure like clickthrough on first result
Now we can directly see if the innovation does improve user happiness.
Probably the evaluation methodology that large search engines trust most
In principle less powerful than doing a multivariate regression analysis, but easier to understand
Results presentation
Result Summaries
Having ranked the documents matching a query, we wish to present a results list
Most commonly, a list of the document titles plus a short summary, aka “10 blue links”
Summaries
The title is often automatically extracted from document metadata. What about the summaries?
This description is crucial.
User can identify good/relevant hits based on description.
Two basic kinds:
Static
Dynamic
A static summary of a document is always the same, regardless of the query that hit the doc
A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand
Static summaries
In typical systems, the static summary is a subset of the document
Simplest heuristic: the first 50 (or so – this can be varied) words of the document
Summary cached at indexing time
More sophisticated: extract from each document a set of “key” sentences
Simple NLP heuristics to score each sentence
Summary is made up of top-scoring sentences.
Most sophisticated: NLP used to synthesize a summary
Seldom used in IR; cf. text summarization work
Dynamic summaries
Present one or more “windows” within the document that contain several of the query terms
“KWIC” snippets: Keyword in Context presentation






Techniques for dynamic summaries
Find small windows in doc that contain query terms
Requires fast window lookup in a document cache
Score each window wrt query
Use various features such as window width, position in document, etc.
Combine features through a scoring function – methodology to be covered Nov 12th
Challenges in evaluation: judging summaries
Easier to do pairwise comparisons rather than binary relevance assessments
Quicklinks
For a navigational query such as united airlines user’s need likely satisfied on www.united.com
Quicklinks provide navigational cues on that home page


Alternative results presentations?
Resources for this lecture
IIR 8
MIR Chapter 3
MG 4.5
Carbonell and Goldstein 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR 21.
CS276
Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 9: Query expansion
Reminder
Midterm in class on Thursday 28th
Material from first 8 lectures
Open book, open notes
You can use (and should bring!) a basic calculator
You cannot use any wired or wireless communication. Use of such communication will be regarded as an Honor Code violation.
You can preload the pdf of the book on to your laptop which you can use disconnected in the room.
Recap of the last lecture
Evaluating a search engine
Benchmarks
Precision and recall
Results summaries
Recap: Unranked retrieval evaluation:Precision and Recall
Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
Precision P = tp/(tp + fp)
Recall R = tp/(tp + fn)
|
| |
|
|
|
|
|
|
Recap: A combined measure: F
Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):

People usually use balanced F1 measure
i.e., with β = 1 or a= ½
Harmonic mean is a conservative average
See CJ van Rijsbergen, Information Retrieval
This lecture
Improving results
For high recall. E.g., searching for aircraft doesn’t match with plane; nor thermodynamic with heat
Options for improving results…
Global methods
Query expansion
Thesauri
Automatic thesaurus generation
Local methods
Relevance feedback
Pseudo relevance feedback
Relevance Feedback
Relevance feedback: user feedback on relevance of docs in initial set of results
User issues a (short, simple) query
The user marks some results as relevant or non-relevant.
The system computes a better representation of the information need based on feedback.
Relevance feedback can go through one or more iterations.
Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
Relevance feedback
We will use ad hoc retrieval to refer to regular retrieval without relevance feedback.
We now look at four examples of relevance feedback that highlight different aspects.
Similar pages
Relevance Feedback: Example
Image search engine

Results for Initial Query
Relevance Feedback

Results after Relevance Feedback
Ad hoc results for query canine source: Fernando Diaz
Ad hoc results for query canine source: Fernando Diaz
User feedback: Select what is relevant source: Fernando Diaz
Results after relevance feedback source: Fernando Diaz
Initial query/results
Initial query: New space satellite applications
+ 1. 0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer
+ 2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan
3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes
4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget
5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes Satellites for Climate Research
6. 0.524, 08/22/90, Report Provides Support for the Critics Of Using Big Satellites to Study Climate
7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada
+ 8. 0.509, 12/02/87, Telecommunications Tale of Two Companies
User then marks relevant documents with “+”.
Expanded query after relevance feedback
2.074 new 15.106 space
30.816 satellite 5.660 application
5.991 nasa 5.196 eos
4.196 launch 3.972 aster
3.516 instrument 3.446 arianespace
3.004 bundespost 2.806 ss
2.790 rocket 2.053 scientist
2.003 broadcast 1.172 earth
0.836 oil 0.646 measure
Results for expanded query
2 1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan
1 2. 0.500, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer
3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own
4. 0.493, 07/31/89, NASA Uses ‘Warm’ Superconductors For Fast Circuit
8 5. 0.492, 12/02/87, Telecommunications Tale of Two Companies
6. 0.491, 07/09/91, Soviets May Adapt Parts of SS-20 Missile For Commercial Use
7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers
8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost $90 Million
Key concept: Centroid
The centroid is the center of mass of a set of points
Recall that we represent documents as points in a high-dimensional space
Definition: Centroid

where C is a set of documents.
Rocchio Algorithm
The Rocchio algorithm uses the vector space model to pick a relevance feedback query
- Rocchio seeks the query qop t that maximizes
Tries to separate docs marked relevant and non-relevant

Problem: we don’t know the truly relevant docs
The Theoretically Best Query

x non-relevant documents
o relevant documents
Rocchio 1971 Algorithm (SMART)
Used in practice:

Dr = set of known relevant doc vectors
Dnr = set of known irrelevant doc vectors
Different from Cr and Cnr !
qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)
New query moves toward relevant documents and away from irrelevant documents
Subtleties to note
Tradeoff α vs. β/γ : If we have a lot of judged documents, we want a higher β/γ.
Some weights in query vector can go negative
Negative term weights are ignored (set to 0)
Relevance feedback on initial query

x known non-relevant documents
o known relevant documents
Relevance Feedback in vector spaces
We can modify the query based on relevance feedback and apply standard vector space model.
Use only the docs that were marked.
Relevance feedback can improve recall and precision
Relevance feedback is most useful for increasing recall in situations where recall is important
Users can be expected to review results and to take time to iterate
Positive vs Negative Feedback
Positive feedback is more valuable than negative feedback (so, set γ< β; e.g. γ= 0.25, β = 0.75).
Many systems only allow positive feedback (γ=0).
Why?
Aside: Vector Space can be Counterintuitive

q1 query “cholera”
o www.ph.ucla.edu/epi/snow.html
x other documents
High-dimensional Vector Spaces
The queries “cholera” and “john snow” are far from each other in vector space.
How can the document “John Snow and Cholera” be close to both of them?
Our intuitions for 2- and 3-dimensional space don't work in >10,000 dimensions.
3 dimensions: If a document is close to many queries, then some of these queries must be close to each other.
Doesn't hold for a high-dimensional space.
Relevance Feedback: Assumptions
A1: User has sufficient knowledge for initial query.
A2: Relevance prototypes are “well-behaved”.
Term distribution in relevant documents will be similar
Term distribution in non-relevant documents will be different from those in relevant documents
Either: All relevant documents are tightly clustered around a single prototype.
Or: There are different prototypes, but they have significant vocabulary overlap.
Similarities between relevant and irrelevant documents are small
Violation of A1
User does not have sufficient initial knowledge.
Examples:
Misspellings (Brittany Speers).
Cross-language information retrieval (hígado).
Mismatch of searcher’s vocabulary vs. collection vocabulary
Cosmonaut/astronaut
Violation of A2
There are several relevance prototypes.
Examples:
Burma/Myanmar
Contradictory government policies
Pop stars that worked at Burger King
Often: instances of a general concept
Good editorial content can address problem
Report on contradictory government policies
Relevance Feedback: Problems
Long queries are inefficient for typical IR engine. ⇐ Why?
Long response times for user.
High cost for retrieval system.
Partial solution:
Only reweight certain prominent terms
Perhaps top 20 by term frequency
Users are often reluctant to provide explicit feedback
It’s often harder to understand why a particular document was retrieved after applying relevance feedback
Evaluation of relevance feedback strategies
Use q0 and compute precision and recall graph
Use qm and compute precision recall graph
Assess on all documents in the collection
Spectacular improvements, but … it’s cheating!
Partly due to known relevant documents ranked higher
Must evaluate with respect to documents not seen by user
Use documents in residual collection (set of documents minus those assessed relevant)
Measures usually then lower than for original query
But a more realistic evaluation
Relative performance can be validly compared
Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.
Evaluation of relevance feedback
Second method – assess only the docs not rated by the user in the first round
Could make relevance feedback look worse than it really is
Can still assess relative performance of algorithms
Most satisfactory – use two collections each with their own relevance assessments
q0 and user feedback from first collection
qm run on second collection and measured
Evaluation: Caveat
True evaluation of usefulness must compare to other methods taking the same amount of time.
Alternative to relevance feedback: User revises and resubmits query.
Users may prefer revision/resubmission to having to judge relevance of documents.
There is no clear evidence that relevance feedback is the “best use” of the user’s time.
Relevance Feedback on the Web
Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback)
Google (link-based) ← α/β/γ ??
Altavista
Stanford WebBase
But some don’t because it’s hard to explain to average user:
Alltheweb
bing
Yahoo
Excite initially had true relevance feedback, but abandoned it due to lack of use.
Excite Relevance Feedback
Spink et al. 2000
Only about 4% of query sessions from a user used relevance feedback option
Expressed as “More like this” link next to each result
But about 70% of users only looked at first page of results and didn’t pursue things further
So 4% is about 1/8 of people extending search
Relevance feedback improved results about 2/3 of the time
Pseudo relevance feedback
Pseudo-relevance feedback automates the “manual” part of true relevance feedback.
Pseudo-relevance algorithm:
Retrieve a ranked list of hits for the user’s query
Assume that the top k documents are relevant.
Do relevance feedback (e.g., Rocchio)
Works very well on average
But can go horribly wrong for some queries.
Several iterations can cause query drift.
Why?
Query Expansion
In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents
In query expansion, users give additional input (good/bad search term) on words or phrase
Query assist
Would you expect such a feature to increase the query volume at a search engine?
How do we augment the user query?
Manual thesaurus
E.g. MedLine: physician, syn: doc, doctor, MD, medico
Can be query rather than just synonyms
Global Analysis: (static; of all documents in collection)
Automatically derived thesaurus
(co-occurrence statistics)
Refinements based on query log mining
Common on the web
Local Analysis: (dynamic)
Analysis of documents in result set
Thesaurus-based query expansion
For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus
feline → feline cat
May weight added terms less than original query terms.
Generally increases recall
Widely used in many science/engineering fields
May significantly decrease precision, particularly with ambiguous terms.
“interest rate” “interest rate fascinate evaluate”
There is a high cost of manually producing a thesaurus
And for updating it for scientific changes
Automatic Thesaurus Generation
Attempt to generate a thesaurus automatically by analyzing the collection of documents
Fundamental notion: similarity between two words
Definition 1: Two words are similar if they co-occur with similar words.
Definition 2: Two words are similar if they occur in a given grammatical relation with the same words.
You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar.
Co-occurrence based is more robust, grammatical relations are more accurate.
↑
Why?
Co-occurrence Thesaurus
Simplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrix.
wi,j = (normalized) weight for (ti ,dj)

For each ti, pick terms with high values in C
Example of manual thesaurus

Automatic Thesaurus Generation
Automatic Thesaurus Generation
Discussion
Quality of associations is usually a problem.
Term ambiguity may introduce irrelevant statistically correlated terms.
“Apple computer” “Apple red fruit computer”
Problems:
False positives: Words deemed similar that are not
False negatives: Words deemed dissimilar that are similar
Since terms are highly correlated anyway, expansion may not retrieve many additional documents.
Indirect relevance feedback
On the web, DirectHit introduced a form of indirect relevance feedback.
DirectHit ranked documents higher that users look at more often.
Clicked on links are assumed likely to be relevant
Assuming the displayed summaries are good, etc.
Globally: Not necessarily user or query specific.
This is the general area of clickstream mining
Today – handled as part of machine-learned ranking
Resources
IIR Ch 9
MG Ch. 4.7
MIR Ch. 5.2 – 5.4
CS276B
Text Retrieval and Mining
Winter 2005
Lecture 12
What is XML?
eXtensible Markup Language
A framework for defining markup languages
No fixed collection of markup tags
Each XML language targeted for application
All XML languages share features
Enables building of generic tools
Basic Structure
An XML document is an ordered, labeled tree
character data leaf nodes contain the actual data (text strings)
element nodes, are each labeled with
a name (often called the element type), and
a set of attributes, each consisting of a name and a value,
can have child nodes
XML Example
XML Example
Elements
Elements are denoted by markup tags
thetext
Element start tag: foo
Attribute: attr1
The character data: thetext
Matching element end tag:
XML vs HTML
HTML is a markup language for a specific purpose (display in browsers)
XML is a framework for defining markup languages
HTML can be formalized as an XML language (XHTML)
XML defines logical structure only
HTML: same intention, but has evolved into a presentation language
XML: Design Goals
Separate syntax from semantics to provide a common framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the future of (semi)structured information (do some of the work now done by databases)
Why Use XML?
Represent semi-structured data (data that are structured, but don’t fit relational model)
XML is more flexible than DBs
XML is more structured than simple IR
You get a massive infrastructure for free
Applications of XML
XHTML
CML – chemical markup language
WML – wireless markup language
ThML – theological markup language
Having a Humble Opinion of Self
EVERY man naturally desires knowledge
; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars.Aristotle , Metaphysics, i. 1.Augustine, Confessions V. 4.
XML Schemas
Schema = syntax definition of XML language
Schema language = formal language for expressing XML schemas
Examples
Document Type Definition
XML Schema (W3C)
Relevance for XML IR
Our job is much easier if we have a (one) schema
XML Tutorial
(Anders Møller and Michael Schwartzbach)
Previous (and some following) slides are based on their tutorial
XML Indexing and Search
Native XML Database
Uses XML document as logical unit
Should support
Elements
Attributes
PCDATA (parsed character data)
Document order
Contrast with
DB modified for XML
Generic IR system modified for XML
XML Indexing and Search
Most native XML databases have taken a DB approach
Exact match
Evaluate path expressions
No IR type relevance ranking
Only a few that focus on relevance ranking
Data vs. Text-centric XML
Data-centric XML: used for messaging between enterprise applications
Mainly a recasting of relational data
Content-centric XML: used for annotating content
Rich in text
Demands good integration of text retrieval functionality
E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price
IR XML Challenge 1: Term Statistics
There is no document unit in XML
How do we compute tf and idf?
Global tf/idf over all text context is useless
Indexing granularity
IR XML Challenge 2: Fragments
IR systems don’t store content (only index)
Need to go to document for retrieving/displaying fragment
E.g., give me the Abstracts of Papers on existentialism
Where do you retrieve the Abstract from?
Easier in DB framework
IR XML Challenges 3: Schemas
Ideally:
There is one schema
User understands schema
In practice: rare
Many schemas
Schemas not known in advance
Schemas change
Users don’t understand schemas
Need to identify similar elements in different schemas
Example: employee
IR XML Challenges 4: UI
Help user find relevant nodes in schema
Author, editor, contributor, “from:”/sender
What is the query language you expose to the user?
Specific XML query language? No.
Forms? Parametric search?
A textbox?
In general: design layer between XML and user
IR XML Challenges 5: using a DB
Why you don’t want to use a DB
Spelling correction
Mid-word wildcards
Contains vs “is about”
DB has no notion of ordering
Relevance ranking
Querying XML
Today:
XQuery
XIRQL
Lecture 15
Vector space approaches
XQuery
SQL for XML
Usage scenarios
Human-readable documents
Data-oriented documents
Mixed documents (e.g., patient records)
Relies on
XPath
XML Schema datatypes
Turing complete
XQuery is still a working draft.
XQuery
The principal forms of XQuery expressions are:
path expressions
element constructors
FLWR ("flower") expressions
list expressions
conditional expressions
quantified expressions
datatype expressions
Evaluated with respect to a context
FLWR
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
Queries Supported by XQuery
Location/position (“chapter no.3”)
Simple attribute/value
/play/title contains “hamlet”
Path queries
title contains “hamlet”
/play//title contains “hamlet”
Complex graphs
Employees with two managers
Subsumes: hyperlinks
What about relevance ranking?
How XQuery makes ranking difficult
All documents in set A must be ranked above all documents in set B.
Fragments must be ordered in depth-first, left-to-right order.
XQuery: Order By Clause
for $d in document("depts.xml")//deptno
let $e := document("emps.xml")//emp[deptno = $d]
where count($e) >= 10
order by avg($e/salary) descending
return
{ $d, {count($e)} ,{avg($e/salary)} }
XQuery Order By Clause
Order by clause only allows ordering by “overt” criterion
Say by an attribute value
Relevance ranking
Is often proprietary
Can’t be expressed easily as function of set to be ranked
Is better abstracted out of query formulation (cf. www)
XIRQL
University of Dortmund
Goal: open source XML search engine
Motivation
“Returnable” fragments are special
E.g., don’t return a
some text fragment
Structured Document Retrieval Principle
Empower users who don’t know the schema
Enable search for any person no matter how schema encodes the data
Don’t worry about attribute/element
Atomic Units
Specified in schema
Only atomic units can be returned as result of search (unless unit specified)
Tf.idf weighting is applied to atomic units
Probabilistic combination of “evidence” from atomic units
XIRQL Indexing
Structured Document Retrieval Principle
A system should always retrieve the most specific part of a document answering a query.
Example query: xql
Document:
0.3 XQL
0.5 example
0.8 XQL 0.7 syntax
Return section, not chapter
Augmentation weights
Ensure that Structured Document Retrieval Principle is respected.
Assume different query conditions are disjoint events -> independence.
P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636
Section ranked ahead of chapter
Datatypes
Example: person_name
Assign all elements and attributes with person semantics to this datatype
Allow user to search for “person” without specifying path
XIRQL: Summary
Relevance ranking
Fragment/context selection
Datatypes (person_name)
Semantic relativism
Attribute/element
Data structures for XML retrieval
A very basic introduction.
Data structures for XML retrieval
What are the primitives we need?
Inverted index: give me all elements matching text query Q
We know how to do this – treat each element as a document
Give me all elements (immediately) below any instance of the Book element
Combination of the above
Parent/child links
Number each element
Maintain a list of parent-child relationships
E.g., Chapter:21 Book:8
Enables immediate parent
But what about “the word Hamlet under a Scene element under a Play element?
General positional indexes
View the XML document as a text document
Build a positional index for each element
Mark the beginning and end for each element, e.g.,
Play → Doc:1(27) → Doc:1(2033) ---→
/Play → Doc:1(1122) → Doc:1(5790) ---→
Verse → Doc:1(431) → Doc:4(33) ---→
/Verse → Doc:1(867) → Doc:4(92) ---→
Term:droppeth → Doc:1(720)
Positional containment

droppeth under Verse under Play.
Summary of data structures
Path containment etc. can essentially be solved by positional inverted indexes
Retrieval consists of “merging” postings
All the compression tricks etc. from 276A are still applicable
Complications arise from insertion/deletion of elements, text within elements
Beyond the scope of this course
Resources
Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer
www.w3.org/XML - XML resources at W3C
Ronald Bourret on native XML databases: http://www.rpbourret.com/xml/ProdsNative.htm
Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.
ORDPATHs: Insert-Friendly XML Node Labels.
www.cs.umb.edu/~poneil/ordpath.pdf
CS276A
Text Retrieval and Mining
Lecture 10
Recap of the last lecture
Improving search results
Especially for high recall. E.g., searching for aircraft so it matches with plane; thermodynamic with heat
Options for improving results…
Global methods
Query expansion
Thesauri
Automatic thesaurus generation
Global indirect relevance feedback
Local methods
Relevance feedback
Pseudo relevance feedback
Probabilistic relevance feedback
Rather than reweighting in a vector space…
If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model:
P(tk|R) = |Drk| / |Dr|
P(tk|NR) = |Dnrk| / |Dnr|
tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk.
Why probabilities in IR?

In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms.
Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties?
Probabilistic IR topics
Classical probabilistic retrieval model
Probability ranking principle, etc.
(Naïve) Bayesian Text Categorization
Bayesian networks for text retrieval
Language model approach to IR
An important emphasis in recent work
Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.
Traditionally: neat ideas, but they’ve never won on performance. It may be different now.
The document ranking problem
We have a collection of documents
User issues a query
A list of documents needs to be returned
Ranking method is core of an IR system:
In what order do we present documents to the user?
We want the “best” document to be first, second best second, etc….
Idea: Rank by probability of relevance of the document w.r.t. information need
P(relevant|documenti, query)
Recall a few probability basics
- For events a and b:
- Bayes’ Rule

- Odds:

Probability Ranking Principle
Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance. ⇐ R={0,1} vs. NR/R
- Need to find p(R|x) - probability that a document x is relevant.

p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.
Probability Ranking Principle (PRP)
Simple case: no selection costs or other utility concerns that would differentially weight errors
Bayes’ Optimal Decision Rule
x is relevant iff p(R|x) > p(NR|x)
PRP in action: Rank all documents by p(R|x)
Theorem:
Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 loss
Provable if all probabilities correct, etc. [e.g., Ripley 1996]
Probability Ranking Principle
More complex case: retrieval costs.
Let d be a document
C - cost of retrieval of relevant document
C’ - cost of retrieval of non-relevant document
Probability Ranking Principle: if

for all d’ not yet retrieved, then d is the next document to be retrieved
We won’t further consider loss/utility from now on
Probability Ranking Principle
How do we compute all those probabilities?
Do not know exact probabilities, have to use estimates
Binary Independence Retrieval (BIR) – which we discuss later today – is the simplest model
Questionable assumptions
“Relevance” of each document is independent of relevance of other documents.
Really, it’s bad to keep on returning duplicates
Boolean model of relevance
That one has a single step information need
Seeing a range of results might let user refine query
Probabilistic Retrieval Strategy
Estimate how terms contribute to relevance
How do things like tf, df, and length influence your judgments about document relevance?
One answer is the Okapi formulae (S. Robertson)
Combine to find document relevance probability
Order documents by decreasing probability
The Probability Ranking Principle
“If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”
[1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538)
Probabilistic Ranking
Basic concept:
"For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.
By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically."
Van Rijsbergen
Binary Independence Model
Traditionally used in conjunction with PRP
“Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1):
- xi=1 iff term i is present in document x.
“Independence”: terms occur in documents independently
Different documents can be modeled as same vector
- Bernoulli Naive Bayes model (cf. text categorization!)
Binary Independence Model
Queries: binary term incidence vectors
Given query q,
for each document d need to compute p(R|q,d).
replace with computing p(R|q,x) where x is binary term incidence vector representing d Interested only in ranking
Will use odds and Bayes’ Rule:

Binary Independence Model

Using Independence Assumption:

So :

Binary Independence Model

Since xi is either 0 or 1:

Let Pi = P(xi = 1 | R,q); rt = p (xi =1 | NR,q)
Assume, for all terms not occurring in the query (qi=0) pi = ri
Then... (This can be changed (e.g.,
in relevance feedback)
Binary Independence Model

Binary Independence Model

Retrieval Status Value:

Binary Independence Model
All boils down to computing RSV.

So, how do we compute ci’s from our data ?
Binary Independence Model
Estimating RSV coefficients.
For each term i look at this table of document counts:

Estimation – key challenge
- If non-relevant documents are approximated by the whole collection, then ri (prob. of occurrence in non-relevant documents for query) is n/N and
log (1– ri)/ri = log (N– n)/n ≈ log N/n = IDF!
pi (probability of occurrence in relevant documents) can be estimated in various ways:
from relevant documents if know some
Relevance weighting can be used in feedback loop
constant (Croft and Harper combination match) – then just get idf weighting of terms
proportional to prob. of occurrence in collection
more accurately, to log of this (Greiff, SIGIR 1998)
Iteratively estimating pi
Assume that pi constant over all xi in query
pi = 0.5 (even odds) for any given doc
Determine guess of relevant document set:
V is fixed size set of highest ranked documents on this model (note: now a bit like tf.idf!)
We need to improve our guesses for pi and ri, so
Use distribution of xi in docs in V. Let Vi be set of documents containing xi
pi = |Vi| / |V|
Assume if not retrieved then not relevant
ri = (ni – |Vi|) / (N – |V|)
Go to 2. until converges then return ranking
Probabilistic Relevance Feedback
- Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents V, as above.
Interact with the user to refine the description: learn some definite members of R and NR
- Reestimate pi and ri on the basis of these
- Or can combine new information with original guess (use Bayesian prior):
Repeat, thus generating a succession of approximations to R.

PRP and BIR
Getting reasonable approximations of probabilities is possible.
Requires restrictive assumptions:
term independence
terms not in query don’t affect the outcome
boolean representation of documents/queries/relevance
document relevance values are independent
Some of these assumptions can be removed
Problem: either require partial relevance information or only can derive somewhat inferior term weights
Removing term independence
| ![]() |
Food for thought
Think through the differences between standard tf.idf and the probabilistic retrieval model in the first iteration
Think through the differences between vector space (pseudo) relevance feedback and probabilistic (pseudo) relevance feedback
Good and Bad News
Standard Vector Space Model
Empirical for the most part; success measured by results
Few properties provable
Probabilistic Model Advantages
Based on a firm theoretical foundation
Theoretically justified optimal ranking scheme
Disadvantages
Making the initial guess to get V
Binary word-in-doc weights (not using term frequencies)
Independence of terms (can be alleviated)
Amount of computation
Has never worked convincingly better in practice
Bayesian Networks for Text Retrieval (Turtle and Croft 1990)
Standard probabilistic model assumes you can’t estimate P(R|D,Q)
Instead assume independence and use P(D|R)
But maybe you can with a Bayesian network*
What is a Bayesian network?
A directed acyclic graph
Nodes
Events or Variables
Assume values.
For our purposes, all Boolean
Links
model direct dependencies between nodes
Bayesian Networks
a,b,c - propositions (events). Bayesian networks model causal relations between events
![]() |
|
For more information see:
R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. 1999. Probabilistic Networks and Expert Systems. Springer Verlag.
J. Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufman.
Toy Example

Independence Assumptions
![]() |
|
Chained inference
Evidence - a node takes on some value
Inference
Compute belief (probabilities) of other nodes
conditioned on the known evidence
Two kinds of inference: Diagnostic and Predictive
Computational complexity
General network: NP-hard
Tree-like networks are easily tractable
Much other work on efficient exact and approximate Bayesian network inference
Clever dynamic programming
Approximate inference (“loopy belief propagation”)
Model for Text Retrieval
Goal
Given a user’s information need (evidence), find probability a doc satisfies need
Retrieval model
Model docs in a document network
Model information need in a query network
Bayesian Nets for IR: Idea

Bayesian Nets for IR
Construct Document Network (once !)
For each query
Construct best Query Network
Attach it to Document Network
Find subset of di’s which maximizes the probability value of node I (best subset).
Retrieve these di’s as the answer to query.
Bayesian nets for text retrieval

Example: “reason trouble –two”

Link matrices and probabilities
|
|
*conditional probability table
Extensions
Prior probs don’t have to be 1/n.
“User information need” doesn’t have to be a query - can be words typed, in docs read, any combination …
Phrases, inter-document links
Link matrices can be modified over time.
User feedback.
The promise of “personalization”
Computational details
Document network built at indexing time
Query network built/scored at query time
Representation:
Link matrices from docs to any single term are like the postings entry for that term
Canonical link matrices are efficient to store and compute
Attach evidence only at roots of network
Can do single pass from roots to leaves
Bayes Nets in IR
Flexible ways of combining term weights, which can generalize previous approaches
Boolean model
Binary independence model
Probabilistic models with weaker assumptions
Efficient large-scale implementation
InQuery text retrieval system from U Mass
Turtle and Croft (1990) [Commercial version defunct?]
Need approximations to avoid intractable inference
Need to estimate all the probabilities by some means (whether more or less ad hoc)
Much new Bayes net technology yet to be applied?
Resources
S. E. Robertson and K. Spärck Jones. 1976. Relevance Weighting of Search Terms. Journal of the American Society for Information Sciences 27(3): 129–146.
C. J. van Rijsbergen. 1979. Information Retrieval. 2nd ed. London: Butterworths, chapter 6. [Most details of math] http://www.dcs.gla.ac.uk/Keith/Preface.html
N. Fuhr. 1992. Probabilistic Models in Information Retrieval. The Computer Journal, 35(3),243–255. [Easiest read, with BNs]
F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell. 1998. Is This Document Relevant? ... Probably: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4): 528–552.
http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/
[Adds very little material that isn’t in van Rijsbergen or Fuhr ]
Resources
H.R. Turtle and W.B. Croft. 1990. Inference Networks for Document Retrieval. Proc. ACM SIGIR: 1-24.
E. Charniak. Bayesian nets without tears. AI Magazine 12(4): 50-63 (1991). http://www.aaai.org/Library/Magazine/Vol12/12-04/vol12-04.html
D. Heckerman. 1995. A Tutorial on Learning with Bayesian Networks. Microsoft Technical Report MSR-TR-95-06
http://www.research.microsoft.com/~heckerman/
N. Fuhr. 2000. Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications. Journal of the American Society for Information Science 51(2): 95–110.
R. K. Belew. 2001. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge UP 2001.
MIR 2.5.4, 2.8
CS276A
Text Retrieval and Mining
Lecture 12
[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]
Recap
Probabilistic models:
Naïve Bayes Text Classification
Introduction to Text Classification
Probabilistic Language Models
Naïve Bayes text categorization
Today
The Language Model Approach to IR
Basic query generation model
Alternative models
Standard Probabilistic IR

IR based on Language Model (LM)

- A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
The LM approach directly exploits that idea!
Formal Language (Model)
Traditional generative model: generates strings
Finite state machines or regular grammars, etc.
Example:
![]() | I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … *wish I wish |
Stochastic Language Models
Models probability of generating strings in the language (commonly all strings over alphabet ∑)

Stochastic Language Models
Model probability of generating any string

Stochastic Language Models
A statistical model for generating text
Probability distribution over strings in a given language

Unigram and higher-order models



Using Language Models in IR
Treat each document as the basis for a model (e.g., unigram sufficient statistics)
Rank document d based on P(d | q)
P(d | q) = P(q | d) x P(d) / P(q)
P(q) is the same for all documents, so ignore
P(d) [the prior] is often treated as the same for all d
But we could use criteria like authority, length, genre
P(q | d) is the probability of q given d’s model
Very general formal approach
The fundamental problem of LMs
Usually we don’t know the model M
But have a sample of text representative of that model

Estimate a language model from a sample
Then compute the observation probability

Language Models for IR
Language Modeling Approaches
Attempt to model query generation process
Documents are ranked by the probability that a query would be observed as a random sample from the respective document model
Multinomial approach

Retrieval based on probabilistic LM
Treat the generation of queries as a random process.
Approach
Infer a language model for each document.
Estimate the probability of generating the query according to each of these models.
Rank the documents according to these probabilities.
Usually a unigram estimate of words is used
Some work on bigrams, paralleling van Rijsbergen
Retrieval based on probabilistic LM
Intuition
Users …
Have a reasonable idea of terms that are likely to occur in documents of interest.
They will choose query terms that distinguish these documents from others in the collection.
Collection statistics …
Are integral parts of the language model.
Are not used heuristically as in many other approaches.
In theory. In practice, there’s usually some wiggle room for empirically set parameters
Query generation probability (1)
Ranking formula

The probability of producing the query given the language model of document d using MLE is:

Insufficient data
Zero probability p(t | Md) = 0
May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]
General approach
A non-occurring term is possible, but no more likely than would be expected by chance in the collection.
If :

Insufficient data
Zero probabilities spell disaster
We need to smooth probabilities
Discount nonzero probabilities
Give some probability mass to unseen things
There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or to counts, Dirichlet priors, discounting, and interpolation
[See FSNLP ch. 6 or CS224N if you want more]
A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution
Mixture model
- P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)
Mixes the probability from the document with the general collection frequency of the word.
Correctly setting lis very important
A high value of lambda makes the search “conjunctive-like” – suitable for short queries
A low value is more suitable for long queries
Can tune l to optimize performance
Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)
Basic mixture model summary
General formulation of the LM for IR

The user has a document in mind, and generates the query from this document.
The equation represents the probability that the document that the user had in mind was in fact this one.
general language model
individual-document model
Example
Document collection (2 documents)
d1: Xerox reports a profit but revenue is down
d2: Lucent narrows quarter loss but revenue decreases further
Model: MLE unigram from documents; l = ½
Query: revenue down
P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
= 1/8 x 3/32 = 3/256
P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]
= 1/8 x 1/32 = 1/256
Ranking: d1 > d2
Ponte and Croft Experiments
Data
TREC topics 202-250 on TREC disks 2 and 3
Natural language queries consisting of one sentence each
TREC topics 51-100 on TREC disk 3 using the concept fields
| ![]() |
Precision/recall results 202-250
Precision/recall results 51-100
The main difference is whether “Relevance” figures explicitly in the model or not
LM approach attempts to do away with modeling relevance
LM approach asssumes that documents and expressions of information problems are of the same type
Computationally tractable, intuitively appealing
LM vs. Prob. Model for IR
Problems of basic LM approach
Assumption of equivalence between document and information problem representation is unrealistic
Very simple models of language
Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance
Can’t easily accommodate phrases, passages, Boolean operators
Current extensions focus on putting relevance back into the model, etc.
LM vs. Prob. Model for IR
Extension: 3-level model
3-level model
Whole collection model (Mc)
Specific-topic model; relevant-documents model (Mt)
Individual-document model (Md)
Relevance hypothesis
A request(query; topic) is generated from a specific-topic model { Mc, Mt}.
Iff a document is relevant to the topic, the same model will apply to the document.
It will replace part of the individual-document model in explaining the document.
The probability of relevance of a document
The probability that this model explains part of the document
The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination
3-level model

Alternative Models of Text Generation

Retrieval Using Language Models

Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)
Query Likelihood
P(Q|Dm)
Major issue is estimating document model
i.e. smoothing techniques instead of tf.idf weights
Good retrieval results
e.g. UMass, BBN, Twente, CMU
Problems dealing with relevance feedback, query expansion, structured queries
Document Likelihood
Rank by likelihood ratio P(D|R)/P(D|NR)
treat as a generation problem
P(w|R) is estimated by P(w|Qm)
Qm is the query or relevance model
P(w|NR) is estimated by collection probabilities P(w)
Issue is estimation of query model
Treat query as generated by mixture of topic and background
Estimate relevance model from related documents (query expansion)
Relevance feedback is easily incorporated
Good retrieval results
e.g. UMass at SIGIR 01
inconsistent with heterogeneous document collections
Model Comparison
Estimate query and document models and compare

Suitable measure is KL divergence D(Qm||Dm)
equivalent to query-likelihood approach if simple empirical distribution used for query model
More general risk minimization framework has been proposed
Zhai and Lafferty 2001
Better results than query-likelihood or document-likelihood approaches
Two-stage smoothing:Another Reason for Smoothing

p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
But p(q|d1)>p(q|d2)!
We should make p(“the”) and p(“for”) less different for all docs.
Two-stage Smoothing

How can one do relevance feedback if using language modeling approach?
Introduce a query model & treat feedback as query model updating
Retrieval function:
Query-likelihood => KL-Divergence
Feedback:
Expansion-based => Model-based
Expansion-based vs. Model-based

Feedback as Model Interpolation

Translation model (Berger and Lafferty)
Basic LMs do not address issues of synonymy.
Or any deviation in expression of information need from language of documents
A translation model lets you generate query words not in document via “translation” to synonyms etc.
Or to do cross-language IR, or multimedia IR

Need to learn a translation model (using a dictionary or via statistical machine translation)
Language models: pro & con
Novel way of looking at the problem of text retrieval based on probabilistic language modeling
Conceptually simple and explanatory
Formal mathematical model
Natural use of collection statistics, not heuristics (almost…)
LMs provide effective retrieval and can be improved to the extent that the following conditions can be met
Our language models are accurate representations of the data.
Users have some sense of term distribution.*
*Or we get more sophisticated with translation model
Comparison With Vector Space
There’s some relation to traditional tf.idf models:
(unscaled) term frequency is directly in model
the probabilities do length normalization of term frequencies
the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking
Comparison With Vector Space
Similar in some ways
Term weights based on frequency
Terms often used as if they were independent
Inverse document/collection frequency used
Some form of length normalization useful
Different in others
Based on probability rather than similarity
Intuitions are probabilistic rather than geometric
Details of use of document length and term, document, and collection frequency differ
Resources
- J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21.
D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.
A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.
D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.
[Several relevant newer papers at SIGIR 23–25, 2000–2002.]
Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ .
The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.
Introduction to Information Retrieval
Introduction to Information Retrieval
CS276: Information Retrieval and Web Search
Lecture 10: Text Classification;
The Naive Bayes algorithm
Standing queries
The path from IR to text classification:
You have an information need to monitor, say:
Unrest in the Niger delta region
You want to rerun an appropriate query periodically to find new news items on this topic
You will be sent new documents that are found
I.e., it’s not ranking but classification (relevant vs. not relevant)
Such queries are called standing queries
Long used by “information professionals”
A modern mass instantiation is Google Alerts
Standing queries are (hand-written) text classifiers
Introduction to Information Retrieval

Spam filteringAnother text classification task
From: ""
<takworlld@hotmail.com>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
Text classification
Today:
Introduction to Text Classification
Also widely known as “text categorization”
Same thing
Naïve Bayes text classification
Including a little on Probabilistic Language Models
Categorization/Classification
Given:
A description of an instance, d ∈ X
X is the instance language or instance space.
Issue: how to represent text documents.
Usually some type of high-dimensional space – bag of words
A fixed set of classes:
C = {c1, c2,…, cJ}
Determine:
The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.
We want to know how to build classification functions (“classifiers”).
Machine Learning:Supervised Classification
Given:
A description of an instance, d ∈ X
X is the instance language or instance space.
A fixed set of classes:
C = {c1, c2,…, cJ}
A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×C
Determine:
A learning method or algorithm which will enable us to learn a classifier γ:X→C
For a test document d, we assign it the class γ(d) ∈ C
Document Classification

(Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)
More Text Classification ExamplesMany search engine functionalities use classification
Assigning labels to documents or web-pages:
Labels are most often topics such as Yahoo-categories
"finance," "sports," "news>world>asia>business"
Labels may be genres
"editorials" "movie-reviews" "news”
Labels may be opinion on a person/product
“like”, “hate”, “neutral”
Labels may be domain-specific
"interesting-to-me" : "not-interesting-to-me”
“contains adult language” : “doesn’t”
language identification: English, French, Chinese, …
search vertical: about Linux versus not
“link spam” : “not link spam"
Classification Methods (1)
Manual classification
Used by the original Yahoo! Directory
Looksmart, about.com, ODP, PubMed
Very accurate when job is done by experts
Consistent when the problem size and team is small
Difficult and expensive to scale
Means we need automatic classification methods for big problems
Classification Methods (2)
Hand-coded rule-based classifiers
One technique used by CS dept’s spam filter, Reuters, CIA, etc.
It’s what Google Alerts is doing
Widely deployed in government and enterprise
Companies provide “IDE” for writing such rules
E.g., assign category if document contains a given boolean combination of words
Commercial systems have complex query languages (everything in IR query languages +score accumulators)
Accuracy is often very high if a rule has been carefully refined over time by a subject expert
Building and maintaining these rules is expensive
A Verity topic A complex classification rule
![]() |
|
Classification Methods (3)
Supervised learning of a document-label assignment function
Many systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)
k-Nearest Neighbors (simple, powerful)
Naive Bayes (simple, common method)
Support-vector machines (new, generally more powerful)
… plus many other methods
No free lunch: requires hand-classified training data
But data can be built up (and refined) by amateurs
Many commercial systems use a mixture of methods
Relevance feedback
In relevance feedback, the user marks a few documents as relevant/nonrelevant
The choices can be viewed as classes or categories
The IR system then uses these judgments to build a better model of the information need
So, relevance feedback can be viewed as a form of text classification (deciding between several classes)
Probabilistic relevance feedback
Rather than reweighting in a vector space…
If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier
such as the Naive Bayes model we will look at today:
P(tk|R) = |Drk| / |Dr|
P(tk|NR) = |Dnrk| / |Dnr|
tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.
Recall a few probability basics
For events a and b:
- Bayes’ Rule

- Odds:

Bayesian Methods
Learning and classification methods based on probability theory
Bayes theorem plays a critical role
Builds a generative model that approximates how data is produced
Has prior probability of each category given no information about an item.
Model produces a posterior probability
Distribution over the possible categories given an item
Naïve Bayes methods use a bag of words as the item description
The bag of words representation

The bag of words representation

Bayes’ Rule for text classification
For a document d and a class c
Naive Bayes Classifiers
Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj ∈ C
d={x1,x2,......,xn}

- MAP is “maximum a posteriori” = most likely class
Naïve Bayes Classifier: Naïve Bayes Assumption
P(cj)
Can be estimated from the frequency of classes in the training examples.
P(x1,x2,…,xn|cj)
O(|X|n•|C|) parameters
Could only be estimated if a very, very large number of training examples was available.
Naïve Bayes Conditional Independence Assumption:
Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).
The Multivariate Bernoulli NB Classifier [Like Prob IR BIM; not language model; less used]

Conditional Independence Assumption: features detect term presence and are independent of each other given the class:
- This model is appropriate for binary variables
Multivariate Bernoulli model

Learning the Model

First attempt: maximum likelihood estimates
- Simply use the frequencies in the data

Problem with Maximum Likelihood


What if we have seen no training documents with the word muscle-ache and classified in the topic Flu?

Zero probabilities cannot be conditioned away, no matter the other evidence!

Smoothing to Avoid Overfitting

Somewhat more subtle version

Stochastic Language Models
Model probability of generating any string

Stochastic Language Models
Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model

Unigram and higher-order models



Naïve Bayes via a class conditional language model = multinomial NB

The probability of the words is done as a class-specific unigram language model
Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method
Attributes are text positions, values are words.

Still too many possibilities
Assume that classification is independent of the positions of the words
Use same parameters for each position
Result is bag of words model (over tokens not types)
Naive Bayes and Language Modeling
Naïve Bayes classifiers can use any sort of feature
URL, email address, dictionaries, network features
But if, as in the previous slides
We use only word features
we use all of the words in the text (not a subset)
Then
Naïve Bayes is basically the same as language modeling
Multinomial Naive Bayes: Learning
- From training corpus, extract Vocabulary
- Calculate required P(cj) and P(xk | cj) terms
- For each cj in C do

- docsj ← subset of documents for which the target class is cj
- Textj ← single document containing all docsj
- for each word xk in Vocabulary
- nk ← number of occurrences of xk in Textj

Naive Bayes: Classifying
positions ← all word positions in current document
which contain tokens found in Vocabulary
Return cNB, where

Naive Bayes: Time Complexity
Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D.
Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data. ← Why ?
Generally just O(|D|Lave) since usually |C||V| < |D|Lave
Test Time: O(|C| Lt) where Lt is the average length of a test document.
Very efficient overall, linearly proportional to the time needed to just read in all the data.
Underflow Prevention: using logs
Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.
Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.
Class with highest final un-normalized log probability score is still the most probable.
Note that model is now just max of sum of weights…

Example
Doc | Words | Class | |
Training | 1 | Chinese Beijing Chinese | c |
2 | Chinese Chinese Shanghai | c | |
3 | Chinese Macao | c | |
4 | Tokyo Japan Chinese | j | |
Test | 5 | Chinese Chinese Chinese Tokyo Japan | ? |
Two Naive Bayes Models
Model 1: Multivariate Bernoulli
One feature Xw for each word in dictionary
for loop iterates over dictionary
Xw = true in document d if w appears in d
Naive Bayes assumption:
Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears
This is the model used in the binary independence model in classic probabilistic relevance feedback on hand-classified data
Two Models
Model 2: Multinomial = Class conditional unigram
One feature Xi for each word pos in document
feature’s values are all words in dictionary
Value of Xi is the word in position i
Naïve Bayes assumption:
Given the document’s topic, word in one position in the document tells us nothing about words in other positions
Second assumption:
Word appearance does not depend on position
P(Xi = w | c) = P(Xj = w | c)
Just have one multinomial feature predicting all words
for all positions i,j, word w, and class c
Parameter estimation
Multivariate Bernoulli model:
![]() | fraction of documents of topic cj in which word w appears |
Multinomial model:
![]() | fraction of times in which word w appears among all words in documents of topic cj |
- Can create a mega-document for topic j by concatenating all documents in this topic
- Use frequency of w in mega-document
Which to use for classification?
Multinomial vs Multivariate Bernoulli?
Multinomial model is almost always more effective in text applications!
See results figures later
There has been exploration of multinomial naïve bayes variants which often work better in practice
Binarized multinomial Naïve Bayes, etc.
Topic of PA4
Feature Selection: Why?
Text collections have a large number of features
10,000 – 1,000,000 unique words … and more
May make using a particular classifier feasible
Some classifiers can’t deal with 1,000,000 features
Reduces training time
Training time for some methods is quadratic or worse in the number of features
Makes runtime models smaller and faster
Can improve generalization (performance)
Eliminates noise features
Avoids overfitting
Feature Selection: How?
Two ideas:
Hypothesis testing statistics:
Are we confident that the value of one categorical variable is associated with the value of another
Chi-square test (χ2)
Information theory:
How much information does the value of one categorical variable give you about the value of another
Mutual information
They’re similar, but χ2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)
|2 statistic (CHI)
|2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals?

The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).

|2 statistic (CHI)
There is a simpler formula for 2x2 χ2:


N = A + B + C + D
Value for complete independence of term and category?
Feature selection via Mutual Information
In training set, choose k words which best discriminate (give most info on) the categories.
The Mutual Information between a word, class is:
For each word w and each category c

Feature selection via MI (contd.)
For each category we build a list of k most discriminating terms.
For example (on 20 Newsgroups):
sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …
rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …
Greedy: does not account for correlations between terms
Why?
Feature Selection
Mutual Information
Clear information-theoretic interpretation
May select rare uninformative terms
Chi-square
Statistical foundation
May select very slightly informative frequent terms that are not very useful for classification
Feature Selection: Frequency
The simplest feature selection method:
Just use the commonest terms
No particular foundation
But it make sense why this works
They’re the words that can be well-estimated and are most often available as evidence
In practice, this is often 90% as good as better methods
Feature selection for NB
In general feature selection is necessary for multivariate Bernoulli NB.
Otherwise you suffer from noise, multi-counting
“Feature selection” really means something different for multinomial NB. It means dictionary truncation
The multinomial NB model only has 1 feature
This “feature selection” normally isn’t needed for multinomial NB, but may help a fraction with quantities that are badly estimated
Evaluating Categorization
Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data)
It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set).
Measures: precision, recall, F1, classification accuracy
Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.
Adequate if one class per document
Otherwise F measure for each class
Naive Bayes vs. other methods

WebKB Experiment (1998)
Classify webpages from CS departments into:
student, faculty, course,project
Train on ~5,000 hand-labeled web pages
| ![]() |
Results:

NB Model Comparison: WebKB

SpamAssassin
Naïve Bayes has found a home in spam filtering
Paul Graham’s A Plan for Spam
A Naive Bayes-like classifier with weird parameter estimation
Widely used in spam filters
But many features beyond words:
black hole lists, etc.
particular hand-crafted text patterns
Naïve Bayes in Spam Filtering
SpamAssassin Features:
Basic (Naïve) Bayes spam probability
Mentions: Generic Viagra
Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
Phrase: impress ... girl
Phrase: 'Prestigious Non-Accredited Universities’
From: starts with many numbers
Subject is all capitals
HTML has a low ratio of text to image area
Relay in RBL, http://www.mail-abuse.com/enduserinfo_rbl.html
RCVD line looks faked
http://spamassassin.apache.org/tests_3_3_x.html
Naïve Bayes on spam email

Violation of NB Assumptions
The independence assumptions do not really hold of documents written in natural language.
Conditional independence
Positional independence
Examples?
Example: Sensors

Naïve Bayes Posterior Probabilities
Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.
However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.
Output probabilities are commonly very close to 0 or 1.
Correct estimation ⇒ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)
Naive Bayes is Not So Naiventitled
Very Fast Learning and Testing (basically just count the data)
Low Storage requirements
Very good in domains with many equally important features
More robust to irrelevant features than many learning methods
Irrelevant Features cancel each other without affecting results
More robust to concept drift (changing class definition over time)
Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems
Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.
A good dependable baseline for text classification (but not the best)!
Resources for today’s lecture
IIR 13
Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.
Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.
Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
Tom Mitchell, Machine Learning. McGraw-Hill, 1997.
Clear simple explanation of Naïve Bayes
Open Calais: Automatic Semantic Tagging
Free (but they can keep your data), provided by Thompson/Reuters (ex-ClearForest)
Weka: A data mining software package that includes an implementation of Naive Bayes
Reuters-21578 – the most famous text classification evaluation set
Still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 11: Text Classification;
Vector space classification
[Borrows slides from Ray Mooney]
Recap: Naïve Bayes classifiers
Classify based on prior weight of class and conditional parameter for what each word says:

Training is done by counting and dividing:

Don’t forget to smooth
The rest of text classification
Today:
Vector space methods for Text Classification
Vector space classification using centroids (Rocchio)
K Nearest Neighbors
Decision boundaries, linear and nonlinear classifiers
Dealing with more than 2 classes
Later in the course
More text classification
Support Vector Machines
Text-specific issues in classification
Recall: Vector Space Representation
Each document is a vector, one component for each term (= word).
Normally normalize vectors to unit length.
High-dimensional vector space:
Terms are axes
10,000+ dimensions, or even 100,000+
Docs are vectors in this space
How can we do classification in this space?
Classification Using Vector Spaces
As before, the training set is a set of documents, each labeled with its class (e.g., topic)
In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space
Premise 1: Documents in the same class form a contiguous region of space
Premise 2: Documents from different classes don’t overlap (much)
We define surfaces to delineate classes in the space
Documents in a Vector Space

Test Document of what class?

Test Document = Government

Aside: 2D/3D graphs can be misleading
Using Rocchio for text classification
Relevance feedback methods can be adapted for text categorization
As noted before, relevance feedback can be viewed as 2-class classification
Relevant vs. nonrelevant documents
Use standard tf-idf weighted vectors to represent text documents
For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.
Prototype = centroid of members of class
Assign test documents to the category with the closest prototype vector based on cosine similarity.
Illustration of Rocchio Text Categorization

Definition of centroid

Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.
Note that centroid will in general not be a unit vector even when the inputs are unit vectors.
Rocchio Properties
Forms a simple generalization of the examples in each class (a prototype).
Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.
Classification is based on similarity to class prototypes.
Does not guarantee classifications are consistent with the given training data.
Why not?
Rocchio Anomaly
- Prototype models have problems with polymorphic (disjunctive) categories.

Rocchio classification
Rocchio forms a simple representation for each class: the centroid/prototype
Classification is based on similarity to / distance from the prototype/centroid
It does not guarantee that classifications are consistent with the given training data
It is little used outside text classification
It has been used quite effectively for text classification
But in general worse than Naïve Bayes
Again, cheap to train and test documents
kNN decision boundaries

Example: k=6 (6NN)

Nearest-Neighbor Learning Algorithm
Learning is just storing the representations of the training examples in D.
Testing instance x (under 1NN):
Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Does not explicitly compute a generalization or category prototypes.
Also called:
Case-based learning
Memory-based learning
Lazy learning
Rationale of kNN: contiguity hypothesis
kNN Is Close to Optimal
Cover and Hart (1967)
Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data]
In particular, asymptotic error rate is 0 if Bayes rate is 0.
Assume: query point coincides with a training point.
Both query point and training point contribute error → 2 times Bayes rate
k Nearest Neighbor
Using only the closest example (1NN) to determine the class is subject to errors due to:
A single atypical example.
Noise (i.e., an error) in the category label of a single training example.
More robust alternative is to find the k most-similar examples and return the majority category of these k examples.
Value of k is typically odd to avoid ties; 3 and 5 are most common.
k Nearest Neighbor Classification
kNN = k Nearest Neighbor
To classify a document d into class c:
Define k-neighborhood N as k nearest neighbors of d
Count number of documents i in N that belong to c
Estimate P(c|d) as i/k
Choose as class argmaxc P(c|d) [ = majority class]
Similarity Metrics
Nearest neighbor method depends on a similarity (or distance) metric.
Simplest for continuous m-dimensional instance space is Euclidean distance.
Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).
For text, cosine similarity of tf.idf weighted vectors is typically most effective.
Illustration of 3 Nearest Neighbor for Text Vector Space

3 Nearest Neighbor vs. Rocchio
Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB.

Nearest Neighbor with Inverted Index
Naively, finding nearest neighbors requires a linear search through |D| documents in collection
But determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.
Use standard vector space inverted index methods to find the k nearest neighbors.
Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears.
Typically B << |D|
kNN: Discussion
No feature selection necessary
Scales well with large number of classes
Don’t need to train n classifiers for n classes
Classes can influence each other
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
No training necessary
Actually: perhaps not true. (Data editing, etc.)
May be expensive at test time
In most cases it’s more accurate than NB or Rocchio
kNN vs. Naive Bayes
Bias/Variance tradeoff
Variance ≈ Capacity
kNN has high variance and low bias.
Infinite memory
NB has low variance and high bias.
Decision surface has to be linear (hyperplane – see later)
Consider asking a botanist: Is an object a tree?
Too much capacity/variance, low bias
Botanist who memorizes
Will always say “no” to new object (e.g., different # of leaves)
Not enough capacity/variance, high bias
Lazy botanist
Says “yes” if the object is green
You want the middle ground
(Example due to C. Burges)
Bias vs. variance: Choosing the correct model capacity
Linear classifiers and binary and multiclass classification
Consider 2 class problems
Deciding between two classes, perhaps, government and non-government
One-versus-rest classification
How do we define (and find) the separating surface?
How do we decide which region a test doc is in?
Separation by Hyperplanes
A strong high-bias assumption is linear separability:
in 2 dimensions, can separate classes by a line
- in higher dimensions, need hyperplanes
Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):
- separator can be expressed as ax + by = c

Linear programming / Perceptron

- Find a,b,c, such that
ax + by > c for red points
- ax + by < c for blue points.
Which Hyperplane?

In general, lots of possible solutions for a,b,c.
Which Hyperplane?
Lots of possible solutions for a,b,c.
Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]
E.g., perceptron
Most methods find an optimal separating hyperplane
| ![]() |
Linear classifier: Example
Class: “interest” (as in interest rate)
Example features of a linear classifier
wi ti wi ti
0.70 prime −0.71 dlrs
0.67 rate −0.35 world
0.63 interest −0.33 sees
0.60 rates −0.25 year
0.46 discount −0.24 group
0.43 bundesbank −0.24 dlr
To classify, find dot product of feature vector and weights
Linear Classifiers
Many common text classifiers are linear classifiers
Naïve Bayes
Perceptron
Rocchio
Logistic regression
Support vector machines (with linear kernel)
Linear regression with threshold
Despite this similarity, noticeable performance differences
For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?
What to do for non-separable problems?
Different training methods pick different hyperplanes
Classifiers more powerful than linear often don’t perform better on text problems. Why?
Rocchio is a linear classifier
Two-class Rocchio as a linear classifier
Line or hyperplane defined by:

For Rocchio, set:

[Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher Linear Discriminant where you don’t model the variance (or assume it is spherical).]
Naive Bayes is a linear classifier
Two-class Naive Bayes. We compute:

Decide class C if the odds is greater than 1, i.e., if the log odds is greater than 0.
So decision boundary is hyperplane:

A nonlinear problem
A linear classifier like Naïve Bayes does badly on this task
kNN will do very well (assuming enough training data)
High Dimensional Data
| ![]() |
More Than Two Classes
Any-of or multivalue classification
Classes are independent of each other.
A document can belong to 0, 1, or >1 classes.
Decompose into n binary problems
Quite common for documents
One-of or multinomial or polytomous classification
Classes are mutually exclusive.
Each document belongs to exactly one class
E.g., digit recognition is polytomous classification
Digits are mutually exclusive
Set of Binary Classifiers: Any of
Build a separator between each class and its complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Apply decision criterion of classifiers independently
Done
Though maybe you could do better by considering dependencies between categories
Set of Binary Classifiers: One of
Build a separator between each class and its complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Assign document to class with:
| ![]() |
Summary: Representation ofText Categorization Attributes
Representations of text are usually very high dimensional (one feature for each word)
High-bias algorithms that prevent overfitting in high-dimensional space should generally work best*
For most text categorization tasks, there are many relevant features and many irrelevant ones
Methods that combine evidence from many or all features (e.g. naive Bayes, kNN) often tend to work better than ones that try to isolate just a few relevant features*
*Although the results are a bit more mixed than often thought
Which classifier do I use for a given text classification problem?
Is there a learning method that is optimal for all text classification problems?
No, because there is a tradeoff between bias and variance.
Factors to take into account:
How much training data is available?
How simple/complex is the problem? (linear vs. nonlinear decision boundary)
How noisy is the data?
How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust classifier.
Resources for today’s lecture
IIR 14
Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.
Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.
Trevor Hastie, Robert Tibshirani and Jerome Friedman, Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York.
Open Calais: Automatic Semantic Tagging
Free (but they can keep your data), provided by Thompson/Reuters
Weka: A data mining software package that includes an implementation of many ML algorithms