9 Pages: 123456789

### Introduction to Information Retrieval

CS276

Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 1: Boolean retrieval

### Information Retrieval

• Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

### Unstructured data in 1680

• Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?
• Why is that not the answer?
• Slow (for large corpora)
• NOT Calpurnia is non-trivial
• Other operations (e.g., find the word Romans near countrymen) not feasible
• Ranked retrieval (best documents to return)
• Later lectures

### Term-document incidence

Brutus AND Caesar BUT NOT                     1 if play contains word

Calpurnia                                                                 0 otherwise

### Incidence vectors

• So we have a 0/1 vector for each term.

• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  → bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

### Incidence vectors

• So we have a 0/1 vector for each term.

• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  → bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

### Basic assumptions of Information Retrieval

• Collection: Fixed set of documents
• Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.

### How good are the retrieved docs?

• Precision : Fraction of retrieved docs that are relevant to user’s information need

• Recall : Fraction of relevant docs in collection that are retrieved

• More precise definitions and measurements to follow in later lectures

### Bigger collections

• Consider N = 1 million documents, each with about 1000 words.

• Avg 6 bytes/word including spaces/punctuation

• 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.

### Can’t build the matrix

• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s.                               Why?

• matrix is extremely sparse.

• What’s a better representation?

• We only record the 1 positions.

### Inverted index

• For each term t, we must store a list of all documents that contain t.
• Identify each by a docID, a document serial number
• Can we use fixed-size arrays for this?
 Brutus → 1 2 4 11 31 45 173 174 Caesar → 1 2 4 5 6 16 57 32 Calpurnia → 2 31 54 101

What happens if the word Caesar is added to document 14?

### Inverted index

• We need variable-size postings lists

• On disk, a continuous run of postings is normal and best

• In memory, can use linked lists or variable length arrays

• Some tradeoffs in size/ease of insertion                         Posting(below)

 Brutus → 1 2 4 11 31 45 173 174 Caesar → 1 2 4 5 6 16 57 32 Calpurnia → 2 31 54 101

Brutus , Caeser and Calpurnia are dictionaries.

Numbers are postings.

Sorted by docID (more later on why).

### The index we just built

• How do we process a query?

• Later - what kinds of queries can we process?                ← Today’s focus

### Query processing: AND

• Consider processing the query: Brutus AND Caesar

• Locate Brutus in the Dictionary;

• Retrieve its postings.

• Locate Caesar in the Dictionary;

• Retrieve its postings.

• Merge” the two postings:

### The merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

### Query optimization

• What is the best order for query processing?

• Consider a query that is an AND of n terms.

• For each of the n terms, get its postings, then AND them together.

 Brutus → 2 4 8 16 32 64 128 Caesar → 1 2 3 5 8 16 21 34 Calpurnia → 13 16
Query: Brutus AND Calpurnia AND Caesar

### Query optimization example

• Process in order of increasing freq:

↑

(This is why we kept document freq. in dictionary)

 Brutus → 2 4 8 16 32 64 128 Caesar → 1 2 3 5 8 16 21 34 Calpurnia → 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.

### Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a query that is a Boolean expression:
• Boolean Queries use AND, OR and NOT to join query terms

• Views each document as a set of words

• Is precise: document matches condition or not.

• Perhaps the simplest model to build an IR system on

• Primary commercial retrieval tool for 3 decades.

• Many search systems you still use are Boolean:

• Email, library catalog, Mac OS X Spotlight

### Boolean queries: More General Merges

• Exercise: Adapt the merge for the queries:

• Brutus AND NOT Caesar

• Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

What can we achieve?

### Boolean queries: More General Merges

• Exercise: Adapt the merge for the queries:

• Brutus AND NOT Caesar

• Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

What can we achieve?

### Example: WestLaw http://www.westlaw.com/

• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

• Tens of terabytes of data; 700,000 users

• Majority of users still use boolean queries

• Example query:

• What is the statute of limitations in cases involving the federal tort claims act?

• LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

• /3 = within 3 words, /S = in same sentence

### Example: WestLaw http://www.westlaw.com/

• Another example query:

• Requirements for disabled people to be able to access a workplace

• disabl! /p access! /s work-site work-place (employment /3 place)

• Note that SPACE is disjunction, not conjunction!

• Long, precise queries; proximity operators; incrementally developed; not like web search

• Many professional searchers still like Boolean search

• You know exactly what you are getting

• But that doesn’t mean it actually works better….

### More general optimization

• e.g., (madding OR crowd) AND (ignoble OR strife)

• Get doc. freq.’s for all terms.

• Estimate the size of each OR by the sum of its doc. freq.’s (conservative).

• Process in increasing order of OR sizes.

### Exercise

• Recommend a query processing order for
• (tangerine OR trees) AND

• (kaleidoscope OR eyes)

Term                            Freq

eyes                              213312

kaleidoscope                 87009

skies                             271658

tangerine                       46653

trees                             316812

### Query processing exercises

• Exercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?

• Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?

• Hint: Begin with the case of a Boolean formula query where each term appears only once in the query.

### Exercise

• Try the search feature at

• Write down five search features you think it could do better

• Beyond term search

• Stanford University

• Proximity: Find Gates NEAR Microsoft.

• Need index to capture position information in docs.

• Zones in documents: Find documents with
(author = Ullman) AND (text contains automata).

### Evidence accumulation

• 1 vs. 0 occurrence of a search term

• 2 vs. 1 occurrence

• 3 vs. 2 occurrences, etc.

• Usually more seems better

• Need term frequency information in docs

### Ranking search results

• Boolean queries give inclusion or exclusion of docs.

• Often we want to rank/group results

• Need to measure proximity from query to each doc.

• Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.

### IR vs. databases: Structured vs unstructured data

• Structured data tends to refer to information in “tables”

 Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000
• Typically allows numerical range and exact match

• (for text) queries, e.g.,

• Salary < 60000 AND Manager = Smith.

### Unstructured data

• Typically refers to free-form text

• Allows

• Keyword queries including operators

• More sophisticated “concept” queries, e.g.,

• find all web pages dealing with drug abuse

• Classic model for searching text documents

### Semi-structured data

• In fact almost no data is “unstructured”

• E.g., this slide has distinctly identified zones such as the Title and Bullets

• Facilitates “semi-structured” search such as

• Title contains data AND Bullets contain search

… to say nothing of linguistic structure

### More sophisticated semi-structured search

• Title is about Object Oriented Programming AND Author something like stro*rup

• where * is the wild-card operator

• Issues:

• how do you process “about”?

• how do you rank results?

• The focus of XML search (IIR chapter 10)

### Clustering, classification and ranking

• Clustering: Given a set of docs, group them into clusters based on their contents.

• Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.

• Ranking: Can we learn how to best order a set of documents, e.g., a set of search results

### The web and its challenges

• Unusual and diverse documents

• Unusual and diverse users, queries, information needs

• Beyond terms, exploit ideas from social networks

• How do search engines work?
And how can we make them better?

### More sophisticated information retrieval

• Cross-language information retrieval

• Summarization

• Text mining

### Resources for today’s lecture

• Introduction to Information Retrieval, chapter 1

• Shakespeare:

• Try the neat browse by keyword sequence feature!

• Managing Gigabytes, chapter 3.2

• Modern Information Retrieval, chapter 8.2

### Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 2: The term vocabulary and postings lists

### Recap of the previous lecture

• Basic inverted indexes:

• Structure: Dictionary and Postings

• Key step in construction: Sortingli>

• Boolean query processing

• Intersection by linear time “merging”

• Simple optimizations

• Overview of course topics

### Plan for this lecture

Elaborate basic indexing

• Preprocessing to form the term vocabulary

• Documents

• Tokenization

• What terms do we put in the index?

• Postings

• Faster merges: skip lists

• Positional postings and phrase queries

### Parsing a document

• What format is it in?
• pdf/word/excel/html?

• What language is it in?

• What character set is in use?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

### Complications: Format/language

• Documents being indexed can include docs from many different languages

• A single index may have to contain terms of several languages.

• Sometimes a document or its components can contain multiple languages/formats

• French email with a German pdf attachment.

• What is a unit document?

• A file?

• An email? (Perhaps one of many in an mbox.)

• An email with 5 attachments?

• A group of files (PPT or LaTeX as HTML pages)

## TOKEN AND TERMS

### Tokenization

• Input: “Friends, Romans, Countrymen

• Output: Tokens

• Friends

• Romans

• Countrymen

• A token is a sequence of characters in a document

• Each such token is now a candidate for an index entry, after further processing

• Described below

• But what are valid tokens to emit?

### Tokenization

• Issues in tokenization:

• Finland’s capital →

• Finland? Finlands? Finland’s?

• Hewlett-Packard → Hewlett and Packard as two tokens?

• state-of-the-art: break up hyphenated sequence.

• co-education

• lowercase, lower-case, lower case ?

• It can be effective to get the user to put in possible hyphens

• San Francisco: one token or two?

• How do you decide it is one token?

### Numbers

• 3/12/91                Mar. 12, 1991                     12/3/91

• 55 B.C.

• B-52

• My PGP key is 324a3df234cb23e

• (800) 234-2333

• Often have embedded spaces

• Older IR systems may not index numbers

• But often very useful: think about things like looking up error codes/stacktraces on the web

• (One answer is using n-grams: Lecture 3)

• Will often index “meta-data” separately

• Creation date, format, etc.

### Tokenization: language issues

• French

• L'ensemble    one token or two?

• L ? L’ ? Le ?

• Want l’ensemble to match with un ensemble

• Until at least 2003, it didn’t on Google

• Internationalization!
• German noun compounds are not segmented

• Lebensversicherungsgesellschaftsangestellter

• ‘life insurance company employee’

• German retrieval systems benefit greatly from a compound splitter module

• Can give a 15% performance boost for German

### Tokenization: language issues

• Chinese and Japanese have no spaces between words:

• 莎拉波娃现在居住在美国东南部的佛罗里达。
• Not always guaranteed a unique tokenization
• Further complicated in Japanese, with multiple alphabets intermingled

• Dates/amounts in multiple formats

End-user can express query entirely in hiragana!

### Tokenization: language issues

• Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

• Words are separated, but letter forms within a word form complex ligatures

•                                        ← →    ← →                                  ← start

• ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

• With Unicode, the surface presentation is complex, but the stored form is straightforward

### Stop words

• With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:

• They have little semantic content: the, a, and, to, be

• There are a lot of them: ~30% of postings for top 30 words

• But the trend is away from doing this:

• Good compression techniques (lecture 5) means the space for including stopwords in a system is very small

• Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words.

• You need them for:

• Phrase queries: “King of Denmark”

• Various song titles, etc.: “Let it be”, “To be or not to be”

• “Relational” queries: “flights to London”

### Normalization to terms

• We need to “normalize” words in indexed text as well as query words into the same form

• We want to match U.S.A. and USA

• Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary

• We most commonly implicitly define equivalence classes of terms by, e.g.,

• deleting periods to form a term

• U.S.A., USA ⌊ USA

• deleting hyphens to form a term

• anti-discriminatory, antidiscriminatory ⌊ antidiscriminatory

### Normalization: other languages

• Accents: e.g., French résumé vs. resume.

• Umlauts: e.g., German: Tuebingen vs. Tübingen

• Should be equivalent

• Most important criterion:

• How are your users like to write their queries for these words?

• Even in languages that standardly have accents, users often may not type them

• Often best to normalize to a de-accented term

• Tuebingen, Tübingen, Tubingen    Tubingen

### Normalization: other languages

• Normalization of things like date forms

• 7月30日 vs. 7/30
• Japanese use of kana vs. Chinese characters
• Tokenization and normalization may depend on the language and so is intertwined with language detection

• Crucial: Need to “normalize” indexed text as well as query terms into the same form

### Case folding

• Reduce all letters to lower case

• exception: upper case in mid-sentence?

• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization
 Google example:Query C.A.T. #1 result was for “cat” (well, Lolcats) not Caterpillar Inc.

### Normalization to terms

• An alternative to equivalence classing is to do asymmetric expansion

• An example of where this may be useful

• Enter: window Search: window, windows

• Enter: windows Search: Windows, windows, window

• Enter: Windows Search: Windows

• Potentially more powerful, but less efficient

### Lemmatization

• Reduce inflectional/variant forms to base form

• E.g.,

• am, are, is  → be

• car, cars, car's, cars'    car

• the boy's cars are different colorsthe boy car be different color

• Lemmatization implies doing “proper” reduction to dictionary headword form

### Stemming

• Reduce terms to their “roots” before indexing

• “Stemming” suggest crude affix chopping

• language dependent

• e.g., automate(s), automatic, automation all reduced to automat.

### Porter’s algorithm

• Commonest algorithm for stemming English

• Results suggest it’s at least as good as other stemming options

• Conventions + 5 phases of reductions

• phases applied sequentially

• each phase consists of a set of commands

• sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

### Typical rules in Porter

• sses → ss

• ies i

• ational    ate

• tional    tion

• Rules sensitive to the measure of words

• (m>1) EMENT

• replacementreplac

• cement cement

### Other stemmers

• Other stemmers exist, e.g., Lovins stemmer

• http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

• Single-pass, longest suffix removal (about 250 rules)

• Full morphological analysis – at most modest benefits for retrieval

• Do stemming and other normalizations help?

• English: very mixed results. Helps recall but harms precision

• operative (dentistry) ⇒ oper

• operational (research) ⇒ oper

• operating (systems) ⇒ oper

• Definitely useful for Spanish, German, Finnish, …

• 30% performance gains for Finnish!

### Thesauri and soundex

• Do we handle synonyms and homonyms?

• E.g., by hand-constructed equivalence classes

• car = automobile      color = colour

• We can rewrite to form equivalence-class terms

• When the document contains automobile, index it under car-automobile (and vice-versa)

• Or we can expand a query

• When the query contains automobile, look under car as well

• One approach is soundex, which forms equivalence classes of words based on phonetic heuristics

• More in lectures 3 and 9

### Language-specificity

• Many of the above features embody transformations that are

• Language-specific and

• Often, application-specific

• These are “plug-in” addenda to the indexing process

• Both open source and commercial plug-ins are available for handling these

### Faster postıngs merges:

Faster postıngs merges:
Skıp poınters/Skıp lısts

### Recall basic merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

Can we do better? Yes (if index isn’t changing too fast).

### Augment postings with skip pointers (at indexing time)

• Why?

• To skip postings that will not figure in the search results.

• How?

• Where do we place skip pointers?

### Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.

### Where do we place skips?

• More skips →shorter skip spans ⇒more likely to skip. But lots of comparisons to skip pointers.

• Fewer skips  few pointer comparison, but then long skip spans    few successful skips.

### Placing skips

• Simple heuristic: for postings of length L, use √L evenly-spaced skip pointers.

• This ignores the distribution of query terms.

• Easy if the index is relatively static; harder if L keeps changing because of updates.

• This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based

• The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!

### Phrase queries and positiona...

Phrase querıes and posıtıonal ındexes

### Phrase queries

• Want to be able to answer queries such as “stanford university” – as a phrase

• Thus the sentence “I went to university at Stanford” is not a match.

• The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works

• Many more queries are implicit phrase queries

• For this, it no longer suffices to store only

• <term : docs> entries