### Introduction to Information Retrieval

CS276

Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 1: Boolean retrieval

### Information Retrieval

• Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

### Unstructured data in 1680

• Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?
• Why is that not the answer?
• Slow (for large corpora)
• NOT Calpurnia is non-trivial
• Other operations (e.g., find the word Romans near countrymen) not feasible
• Ranked retrieval (best documents to return)
• Later lectures

### Term-document incidence

Brutus AND Caesar BUT NOT                     1 if play contains word

Calpurnia                                                                 0 otherwise

### Incidence vectors

• So we have a 0/1 vector for each term.

• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  → bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

### Incidence vectors

• So we have a 0/1 vector for each term.

• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  → bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

### Basic assumptions of Information Retrieval

• Collection: Fixed set of documents
• Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.

### How good are the retrieved docs?

• Precision : Fraction of retrieved docs that are relevant to user’s information need

• Recall : Fraction of relevant docs in collection that are retrieved

• More precise definitions and measurements to follow in later lectures

### Bigger collections

• Consider N = 1 million documents, each with about 1000 words.

• Avg 6 bytes/word including spaces/punctuation

• 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.

### Can’t build the matrix

• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s.                               Why?

• matrix is extremely sparse.

• What’s a better representation?

• We only record the 1 positions.

### Inverted index

• For each term t, we must store a list of all documents that contain t.
• Identify each by a docID, a document serial number
• Can we use fixed-size arrays for this?
 Brutus → 1 2 4 11 31 45 173 174 Caesar → 1 2 4 5 6 16 57 32 Calpurnia → 2 31 54 101

What happens if the word Caesar is added to document 14?

### Inverted index

• We need variable-size postings lists

• On disk, a continuous run of postings is normal and best

• In memory, can use linked lists or variable length arrays

• Some tradeoffs in size/ease of insertion                         Posting(below)

 Brutus → 1 2 4 11 31 45 173 174 Caesar → 1 2 4 5 6 16 57 32 Calpurnia → 2 31 54 101

Brutus , Caeser and Calpurnia are dictionaries.

Numbers are postings.

Sorted by docID (more later on why).

### The index we just built

• How do we process a query?

• Later - what kinds of queries can we process?                ← Today’s focus

### Query processing: AND

• Consider processing the query: Brutus AND Caesar

• Locate Brutus in the Dictionary;

• Retrieve its postings.

• Locate Caesar in the Dictionary;

• Retrieve its postings.

• Merge” the two postings:

### The merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

### Query optimization

• What is the best order for query processing?

• Consider a query that is an AND of n terms.

• For each of the n terms, get its postings, then AND them together.

 Brutus → 2 4 8 16 32 64 128 Caesar → 1 2 3 5 8 16 21 34 Calpurnia → 13 16
Query: Brutus AND Calpurnia AND Caesar

### Query optimization example

• Process in order of increasing freq:

• start with smallest set, then keep cutting further.

↑

(This is why we kept document freq. in dictionary)

 Brutus → 2 4 8 16 32 64 128 Caesar → 1 2 3 5 8 16 21 34 Calpurnia → 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.

### Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a query that is a Boolean expression:
• Boolean Queries use AND, OR and NOT to join query terms

• Views each document as a set of words

• Is precise: document matches condition or not.

• Perhaps the simplest model to build an IR system on

• Primary commercial retrieval tool for 3 decades.

• Many search systems you still use are Boolean:

• Email, library catalog, Mac OS X Spotlight

### Boolean queries: More General Merges

• Exercise: Adapt the merge for the queries:

• Brutus AND NOT Caesar

• Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

What can we achieve?

### Boolean queries: More General Merges

• Exercise: Adapt the merge for the queries:

• Brutus AND NOT Caesar

• Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

What can we achieve?

### Example: WestLaw http://www.westlaw.com/

• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

• Tens of terabytes of data; 700,000 users

• Majority of users still use boolean queries

• Example query:

• What is the statute of limitations in cases involving the federal tort claims act?

• LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

• /3 = within 3 words, /S = in same sentence

### Example: WestLaw http://www.westlaw.com/

• Another example query:

• Requirements for disabled people to be able to access a workplace

• disabl! /p access! /s work-site work-place (employment /3 place)

• Note that SPACE is disjunction, not conjunction!

• Long, precise queries; proximity operators; incrementally developed; not like web search

• Many professional searchers still like Boolean search

• You know exactly what you are getting

• But that doesn’t mean it actually works better….

### More general optimization

• e.g., (madding OR crowd) AND (ignoble OR strife)

• Get doc. freq.’s for all terms.

• Estimate the size of each OR by the sum of its doc. freq.’s (conservative).

• Process in increasing order of OR sizes.

### Exercise

• Recommend a query processing order for
• (tangerine OR trees) AND

• (marmalade OR skies) AND

• (kaleidoscope OR eyes)

Term                            Freq

eyes                              213312

kaleidoscope                 87009

marmalade                   107913

skies                             271658

tangerine                       46653

trees                             316812

### Query processing exercises

• Exercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?

• Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?

• Hint: Begin with the case of a Boolean formula query where each term appears only once in the query.

### Exercise

• Try the search feature at

• Write down five search features you think it could do better

### What’s ahead in IR?

• Beyond term search

• What about phrases?

• Stanford University

• Proximity: Find Gates NEAR Microsoft.

• Need index to capture position information in docs.

• Zones in documents: Find documents with
(author = Ullman) AND (text contains automata).

### Evidence accumulation

• 1 vs. 0 occurrence of a search term

• 2 vs. 1 occurrence

• 3 vs. 2 occurrences, etc.

• Usually more seems better

• Need term frequency information in docs

### Ranking search results

• Boolean queries give inclusion or exclusion of docs.

• Often we want to rank/group results

• Need to measure proximity from query to each doc.

• Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.

### IR vs. databases: Structured vs unstructured data

• Structured data tends to refer to information in “tables”

 Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000
• Typically allows numerical range and exact match

• (for text) queries, e.g.,

• Salary < 60000 AND Manager = Smith.

### Unstructured data

• Typically refers to free-form text

• Allows

• Keyword queries including operators

• More sophisticated “concept” queries, e.g.,

• find all web pages dealing with drug abuse

• Classic model for searching text documents

### Semi-structured data

• In fact almost no data is “unstructured”

• E.g., this slide has distinctly identified zones such as the Title and Bullets

• Facilitates “semi-structured” search such as

• Title contains data AND Bullets contain search

… to say nothing of linguistic structure

### More sophisticated semi-structured search

• Title is about Object Oriented Programming AND Author something like stro*rup

• where * is the wild-card operator

• Issues:

• how do you process “about”?

• how do you rank results?

• The focus of XML search (IIR chapter 10)

### Clustering, classification and ranking

• Clustering: Given a set of docs, group them into clusters based on their contents.

• Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.

• Ranking: Can we learn how to best order a set of documents, e.g., a set of search results

### The web and its challenges

• Unusual and diverse documents

• Unusual and diverse users, queries, information needs

• Beyond terms, exploit ideas from social networks

• link analysis, clickstreams ...

• How do search engines work?
And how can we make them better?

### More sophisticated information retrieval

• Cross-language information retrieval

• Question answering

• Summarization

• Text mining

### Resources for today’s lecture

• Introduction to Information Retrieval, chapter 1

• Shakespeare:

• Try the neat browse by keyword sequence feature!

• Managing Gigabytes, chapter 3.2

• Modern Information Retrieval, chapter 8.2

### Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 2: The term vocabulary and postings lists

### Recap of the previous lecture

• Basic inverted indexes:

• Structure: Dictionary and Postings

• Key step in construction: Sortingli>

• Boolean query processing

• Intersection by linear time “merging”

• Simple optimizations

• Overview of course topics

### Plan for this lecture

Elaborate basic indexing

• Preprocessing to form the term vocabulary

• Documents

• Tokenization

• What terms do we put in the index?

• Postings

• Faster merges: skip lists

• Positional postings and phrase queries

### Parsing a document

• What format is it in?
• pdf/word/excel/html?

• What language is it in?

• What character set is in use?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

### Complications: Format/language

• Documents being indexed can include docs from many different languages

• A single index may have to contain terms of several languages.

• Sometimes a document or its components can contain multiple languages/formats

• French email with a German pdf attachment.

• What is a unit document?

• A file?

• An email? (Perhaps one of many in an mbox.)

• An email with 5 attachments?

• A group of files (PPT or LaTeX as HTML pages)

## TOKEN AND TERMS

### Tokenization

• Input: “Friends, Romans, Countrymen

• Output: Tokens

• Friends

• Romans

• Countrymen

• A token is a sequence of characters in a document

• Each such token is now a candidate for an index entry, after further processing

• Described below

• But what are valid tokens to emit?

### Tokenization

• Issues in tokenization:

• Finland’s capital →

• Finland? Finlands? Finland’s?

• Hewlett-Packard → Hewlett and Packard as two tokens?

• state-of-the-art: break up hyphenated sequence.

• co-education

• lowercase, lower-case, lower case ?

• It can be effective to get the user to put in possible hyphens

• San Francisco: one token or two?

• How do you decide it is one token?

### Numbers

• 3/12/91                Mar. 12, 1991                     12/3/91

• 55 B.C.

• B-52

• My PGP key is 324a3df234cb23e

• (800) 234-2333

• Often have embedded spaces

• Older IR systems may not index numbers

• But often very useful: think about things like looking up error codes/stacktraces on the web

• (One answer is using n-grams: Lecture 3)

• Will often index “meta-data” separately

• Creation date, format, etc.

### Tokenization: language issues

• French

• L'ensemble    one token or two?

• L ? L’ ? Le ?

• Want l’ensemble to match with un ensemble

• Until at least 2003, it didn’t on Google

• Internationalization!
• German noun compounds are not segmented

• Lebensversicherungsgesellschaftsangestellter

• ‘life insurance company employee’

• German retrieval systems benefit greatly from a compound splitter module

• Can give a 15% performance boost for German

### Tokenization: language issues

• Chinese and Japanese have no spaces between words:

• 莎拉波娃现在居住在美国东南部的佛罗里达。
• Not always guaranteed a unique tokenization
• Further complicated in Japanese, with multiple alphabets intermingled

• Dates/amounts in multiple formats

End-user can express query entirely in hiragana!

### Tokenization: language issues

• Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

• Words are separated, but letter forms within a word form complex ligatures

•                                        ← →    ← →                                  ← start

• ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

• With Unicode, the surface presentation is complex, but the stored form is straightforward

### Stop words

• With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:

• They have little semantic content: the, a, and, to, be

• There are a lot of them: ~30% of postings for top 30 words

• But the trend is away from doing this:

• Good compression techniques (lecture 5) means the space for including stopwords in a system is very small

• Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words.

• You need them for:

• Phrase queries: “King of Denmark”

• Various song titles, etc.: “Let it be”, “To be or not to be”

• “Relational” queries: “flights to London”

### Normalization to terms

• We need to “normalize” words in indexed text as well as query words into the same form

• We want to match U.S.A. and USA

• Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary

• We most commonly implicitly define equivalence classes of terms by, e.g.,

• deleting periods to form a term

• U.S.A., USA ⌊ USA

• deleting hyphens to form a term

• anti-discriminatory, antidiscriminatory ⌊ antidiscriminatory

### Normalization: other languages

• Accents: e.g., French résumé vs. resume.

• Umlauts: e.g., German: Tuebingen vs. Tübingen

• Should be equivalent

• Most important criterion:

• How are your users like to write their queries for these words?

• Even in languages that standardly have accents, users often may not type them

• Often best to normalize to a de-accented term

• Tuebingen, Tübingen, Tubingen    Tubingen

### Normalization: other languages

• Normalization of things like date forms

• 7月30日 vs. 7/30
• Japanese use of kana vs. Chinese characters
• Tokenization and normalization may depend on the language and so is intertwined with language detection

• Crucial: Need to “normalize” indexed text as well as query terms into the same form

### Case folding

• Reduce all letters to lower case

• exception: upper case in mid-sentence?

• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization
 Google example:Query C.A.T. #1 result was for “cat” (well, Lolcats) not Caterpillar Inc.

### Normalization to terms

• An alternative to equivalence classing is to do asymmetric expansion

• An example of where this may be useful

• Enter: window Search: window, windows

• Enter: windows Search: Windows, windows, window

• Enter: Windows Search: Windows

• Potentially more powerful, but less efficient

### Lemmatization

• Reduce inflectional/variant forms to base form

• E.g.,

• am, are, is  → be

• car, cars, car's, cars'    car

• the boy's cars are different colorsthe boy car be different color

• Lemmatization implies doing “proper” reduction to dictionary headword form

### Stemming

• Reduce terms to their “roots” before indexing

• “Stemming” suggest crude affix chopping

• language dependent

• e.g., automate(s), automatic, automation all reduced to automat.

### Porter’s algorithm

• Commonest algorithm for stemming English

• Results suggest it’s at least as good as other stemming options

• Conventions + 5 phases of reductions

• phases applied sequentially

• each phase consists of a set of commands

• sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

### Typical rules in Porter

• sses → ss

• ies i

• ational    ate

• tional    tion

• Rules sensitive to the measure of words

• (m>1) EMENT

• replacementreplac

• cement cement

### Other stemmers

• Other stemmers exist, e.g., Lovins stemmer

• http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

• Single-pass, longest suffix removal (about 250 rules)

• Full morphological analysis – at most modest benefits for retrieval

• Do stemming and other normalizations help?

• English: very mixed results. Helps recall but harms precision

• operative (dentistry) ⇒ oper

• operational (research) ⇒ oper

• operating (systems) ⇒ oper

• Definitely useful for Spanish, German, Finnish, …

• 30% performance gains for Finnish!

### Thesauri and soundex

• Do we handle synonyms and homonyms?

• E.g., by hand-constructed equivalence classes

• car = automobile      color = colour

• We can rewrite to form equivalence-class terms

• When the document contains automobile, index it under car-automobile (and vice-versa)

• Or we can expand a query

• When the query contains automobile, look under car as well

• What about spelling mistakes?

• One approach is soundex, which forms equivalence classes of words based on phonetic heuristics

• More in lectures 3 and 9

### Language-specificity

• Many of the above features embody transformations that are

• Language-specific and

• Often, application-specific

• These are “plug-in” addenda to the indexing process

• Both open source and commercial plug-ins are available for handling these

### Faster postıngs merges:

Faster postıngs merges:
Skıp poınters/Skıp lısts

### Recall basic merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

Can we do better? Yes (if index isn’t changing too fast).

### Augment postings with skip pointers (at indexing time)

• Why?

• To skip postings that will not figure in the search results.

• How?

• Where do we place skip pointers?

### Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.

### Where do we place skips?

• Tradeoff:

• More skips →shorter skip spans ⇒more likely to skip. But lots of comparisons to skip pointers.

• Fewer skips  few pointer comparison, but then long skip spans    few successful skips.

### Placing skips

• Simple heuristic: for postings of length L, use √L evenly-spaced skip pointers.

• This ignores the distribution of query terms.

• Easy if the index is relatively static; harder if L keeps changing because of updates.

• This definitely used to help; with modern hardware it may not (Bahle et al. 2002) unless you’re memory-based

• The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!

### Phrase queries and positiona...

Phrase querıes and posıtıonal ındexes

### Phrase queries

• Want to be able to answer queries such as “stanford university” – as a phrase

• Thus the sentence “I went to university at Stanford” is not a match.

• The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works

• Many more queries are implicit phrase queries

• For this, it no longer suffices to store only

• <term : docs> entries

### A first attempt: Biword indexes

• Index every consecutive pair of terms in the text as a phrase

• For example the text “Friends, Romans, Countrymen” would generate the biwords

• friends romans

• romans countrymen

• Each of these biwords is now a dictionary term

• Two-word phrase query-processing is now immediate.

### Longer phrase queries

• Longer phrases are processed as we did with wild-cards:

• stanford university palo alto can be broken into the Boolean query on biwords:

• stanford university AND university palo AND palo alto

• Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.

↑   Can have false positives!

### Extended biwords

• Parse the indexed text and perform part-of-speech-tagging (POST).

• Bucket the terms into (say) Nouns (N) and articles/prepositions (X).

• Call any string of terms of the form NX*N an extended biword.

• Each such extended biword is now made a term in the dictionary.

• Example: catcher in the rye

N           X   X    N

• Query processing: parse it into N’s and X’s

• Segment query into enhanced biwords

• Look up in index: catcher rye

### Issues for biword indexes

• False positives, as noted before

• Index blowup due to bigger dictionary

• Infeasible for more than biwords, big even for them

• Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

### Solution 2: Positional indexes

• In the postings, store for each term the position(s) in which tokens of it appear:

<term, number of docs containing term;

doc1: position1, position2 … ;

doc2: position1, position2 … ;

etc.>

### Positional index example

• For phrase queries, we use a merge algorithm recursively at the document level

• But we now need to deal with more than just equality

### Processing a phrase query

• Extract inverted index entries for each distinct term: to, be, or, not.

• Merge their doc:position lists to enumerate all positions with “to be or not to be”.

• to:

• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...

• be:

• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...

• Same general method for proximity searches

### Proximity queries

• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT

• Again, here, /k means “within k words of”.

• Clearly, positional indexes can be used for such queries; biword indexes cannot.

• Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?

• This is a little tricky to do correctly and efficiently

• See Figure 2.12 of IIR

• There’s likely to be a problem on it!

### Positional index size

• You can compress position values/offsets: we’ll talk about that in lecture 5

• Nevertheless, a positional index expands postings storage substantially

• Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system.

### Positional index size

• Need an entry for each occurrence, not just once per document

• Index size depends on average document size                    ←  Why?

• Average web page has <1000 _tmplitem="94" terms

• SEC filings, books, even some epic poems … easily 100,000 terms

• Consider a term with frequency 0.1%

### Rules of thumb

• A positional index is 2–4 as large as a non-positional index

• Positional index size 35–50% of volume of original text

• Caveat: all of this holds for “English-like” languages

### Combination schemes

• These two approaches can be profitably combined

• For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists

• Even more so for phrases like “The Who”

• Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme

• A typical web query mixture was executed in ¼ of the time of using just a positional index

• It required 26% more space than having a positional index alone

### Resources for today’s lecture

• IIR 2

• MG 3.6, 4.3; MIR 7.2

• Skip Lists theory: Pugh (1990)

• Multilevel skip lists give same O(log n) efficiency as trees

• D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp. 215-221.

### Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 3: Dictionaries and tolerant retrieval

### Recap of the previous lecture

• The type/token distinction

• Terms are normalized types put in the dictionary

• Tokenization problems:

• Hyphens, apostrophes, compounds, CJK

• Term equivalence classing:

• Numbers, case folding, stemming, lemmatization

• Skip pointers

• Encoding a tree-like structure in a postings list

• Biword indexes for phrases

• Positional indexes for phrases/proximity queries

### This lecture

• Dictionary data structures

• “Tolerant” retrieval

• Wild-card queries

• Spelling correction

• Soundex

### Dictionary data structures for inverted indexes

• The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

### A naïve dictionary

• An array of struct:

char[20]      int                           Postings *

20 bytes      4/8 bytes                 4/8 bytes

• How do we store a dictionary in memory efficiently?

• How do we quickly look up elements at query time?

### Dictionary data structures

• Two main choices:

• Hashtables

• Trees

• Some IR systems use hashtables, some trees

### Hashtables

• Each vocabulary term is hashed to an integer

• (We assume you’ve seen hashtables before)

• Pros:

• Lookup is faster than for a tree: O(1)

• Cons:

• No easy way to find minor variants:

• judgment/judgement

• No prefix search [tolerant retrieval]

• If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

### Root

• Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

### Trees

• Simplest: binary tree

• More usual: B-trees

• Trees require a standard ordering of characters and hence strings … but we typically have one

• Pros:

• Solves the prefix problem (terms starting with hyp)

• Cons:

• Slower: O(log M) [and this requires balanced tree]

• Rebalancing binary trees is expensive

• But B-trees mitigate the rebalancing problem

Wild-card queries

### Wild-card queries: *

• mon*: find all docs containing any word beginning with “mon”.

• Easy with binary tree (or B-tree) lexicon: retrieve all words in range:

mon w < moo

• *mon: find words ending in “mon”: harder

• Maintain an additional B-tree for terms backwards.

• Can retrieve all words in range: nom w < non.

• Exercise: from this, how can we enumerate all terms

• meeting the wild-card query pro*cent ?

### Query processing

• At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.

• We still have to look up the postings for each enumerated term.

• E.g., consider the query:

• se*ate AND fil*er

• This may result in the execution of many Boolean AND queries.

### B-trees handle *’s at the end of a query term

• How can we handle *’s in the middle of query term?

• co*tion

• We could look up co* AND *tion in a B-tree and intersect the two term sets

• Expensive

• The solution: transform wild-card queries so that the *’s occur at the end

• This gives rise to the Permuterm Index.

### Permuterm index

• For term hello, index under:

• hello$, ello$h, llo$he, lo$hel, o$hell • where$ is a special symbol.

• Queries:

• X lookup on X$X* lookup on$X*

• *X lookup on X$* *X* lookup on X* • X*Y lookup on Y$X*           X*Y*Z ??? Exercise!

↑

Query = hel*o

X=hel, Y=o

Lookup o$hel* ### Permuterm query processing • Rotate query wild-card to the right • Now use B-tree lookup as before. • Permuterm problem: quadruples lexicon size ↑ Empirical observation for English. ### Bigram (k-gram) indexes • Enumerate all k-grams (sequence of k chars) occurring in any term • e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) •$ is a special word boundary symbol

• Maintain a second inverted index from bigrams to dictionary terms that match each bigram.

• $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$

### Bigram index example

• The k-gram index finds terms based on a query consisting of k-grams (here k=2).

### Processing wild-cards

• Query mon* can now be run as

### Key concept: Centroid

• The centroid is the center of mass of a set of points

• Recall that we represent documents as points in a high-dimensional space

• Definition: Centroid

• where C is a set of documents.

### Rocchio Algorithm

• The Rocchio algorithm uses the vector space model to pick a relevance feedback query

• Rocchio seeks the query qop t that maximizes

• Tries to separate docs marked relevant and non-relevant

• Problem: we don’t know the truly relevant docs

### The Theoretically Best Query

x non-relevant documents

o relevant documents

### Rocchio 1971 Algorithm (SMART)

• Used in practice:

• Dr = set of known relevant doc vectors

• Dnr = set of known irrelevant doc vectors

• Different from Cr and Cnr           !

• qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)

• New query moves toward relevant documents and away from irrelevant documents

### Subtleties to note

• Tradeoff α vs. β/γ : If we have a lot of judged documents, we want a higher β/γ.

• Some weights in query vector can go negative

• Negative term weights are ignored (set to 0)

### Relevance feedback on initial query

x known non-relevant documents

o known relevant documents

### Relevance Feedback in vector spaces

• We can modify the query based on relevance feedback and apply standard vector space model.

• Use only the docs that were marked.

• Relevance feedback can improve recall and precision

• Relevance feedback is most useful for increasing recall in situations where recall is important

• Users can be expected to review results and to take time to iterate

### Positive vs Negative Feedback

• Positive feedback is more valuable than negative feedback (so, set γ< β; e.g. γ= 0.25, β = 0.75).

• Many systems only allow positive feedback (γ=0).

Why?

### Aside: Vector Space can be Counterintuitive

q1 query “cholera”

o www.ph.ucla.edu/epi/snow.html

x other documents

### High-dimensional Vector Spaces

• The queries “cholera” and “john snow” are far from each other in vector space.

• How can the document “John Snow and Cholera” be close to both of them?

• Our intuitions for 2- and 3-dimensional space don't work in >10,000 dimensions.

• 3 dimensions: If a document is close to many queries, then some of these queries must be close to each other.

• Doesn't hold for a high-dimensional space.

### Relevance Feedback: Assumptions

• A1: User has sufficient knowledge for initial query.

• A2: Relevance prototypes are “well-behaved”.

• Term distribution in relevant documents will be similar

• Term distribution in non-relevant documents will be different from those in relevant documents

• Either: All relevant documents are tightly clustered around a single prototype.

• Or: There are different prototypes, but they have significant vocabulary overlap.

• Similarities between relevant and irrelevant documents are small

### Violation of A1

• User does not have sufficient initial knowledge.

• Examples:

• Misspellings (Brittany Speers).

• Cross-language information retrieval (hígado).

• Mismatch of searcher’s vocabulary vs. collection vocabulary

• Cosmonaut/astronaut

### Violation of A2

• There are several relevance prototypes.

• Examples:

• Burma/Myanmar

• Contradictory government policies

• Pop stars that worked at Burger King

• Often: instances of a general concept

• Good editorial content can address problem

• Report on contradictory government policies

### Relevance Feedback: Problems

• Long queries are inefficient for typical IR engine.                           ⇐  Why?

• Long response times for user.

• High cost for retrieval system.

• Partial solution:

• Only reweight certain prominent terms

• Perhaps top 20 by term frequency

• Users are often reluctant to provide explicit feedback

• It’s often harder to understand why a particular document was retrieved after applying relevance feedback

### Evaluation of relevance feedback strategies

• Use q0 and compute precision and recall graph

• Use qm and compute precision recall graph

• Assess on all documents in the collection

• Spectacular improvements, but … it’s cheating!

• Partly due to known relevant documents ranked higher

• Must evaluate with respect to documents not seen by user

• Use documents in residual collection (set of documents minus those assessed relevant)

• Measures usually then lower than for original query

• But a more realistic evaluation

• Relative performance can be validly compared

• Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.

### Evaluation of relevance feedback

• Second method – assess only the docs not rated by the user in the first round

• Could make relevance feedback look worse than it really is

• Can still assess relative performance of algorithms

• Most satisfactory – use two collections each with their own relevance assessments

• q0 and user feedback from first collection

• qm run on second collection and measured

### Evaluation: Caveat

• True evaluation of usefulness must compare to other methods taking the same amount of time.

• Alternative to relevance feedback: User revises and resubmits query.

• Users may prefer revision/resubmission to having to judge relevance of documents.

• There is no clear evidence that relevance feedback is the “best use” of the user’s time.

### Relevance Feedback on the Web

• Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback)

• Google (link-based)                                          ←  α/β/γ ??

• Altavista

• Stanford WebBase

• But some don’t because it’s hard to explain to average user:

• Alltheweb

• bing

• Yahoo

• Excite initially had true relevance feedback, but abandoned it due to lack of use.

### Excite Relevance Feedback

• Spink et al. 2000

• Only about 4% of query sessions from a user used relevance feedback option

• Expressed as “More like this” link next to each result

• But about 70% of users only looked at first page of results and didn’t pursue things further

• So 4% is about 1/8 of people extending search

• Relevance feedback improved results about 2/3 of the time

### Pseudo relevance feedback

• Pseudo-relevance feedback automates the “manual” part of true relevance feedback.

• Pseudo-relevance algorithm:

• Retrieve a ranked list of hits for the user’s query

• Assume that the top k documents are relevant.

• Do relevance feedback (e.g., Rocchio)

• Works very well on average

• But can go horribly wrong for some queries.

• Several iterations can cause query drift.

• Why?

### Query Expansion

• In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents

• In query expansion, users give additional input (good/bad search term) on words or phrase

### Query assist

• Would you expect such a feature to increase the query volume at a search engine?

### How do we augment the user query?

• Manual thesaurus

• E.g. MedLine: physician, syn: doc, doctor, MD, medico

• Can be query rather than just synonyms

• Global Analysis: (static; of all documents in collection)

• Automatically derived thesaurus

• (co-occurrence statistics)

• Refinements based on query log mining

• Common on the web

• Local Analysis: (dynamic)

• Analysis of documents in result set

### Thesaurus-based query expansion

• For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus

• feline → feline cat

• May weight added terms less than original query terms.

• Generally increases recall

• Widely used in many science/engineering fields

• May significantly decrease precision, particularly with ambiguous terms.

• “interest rate”  “interest rate fascinate evaluate”

• There is a high cost of manually producing a thesaurus

• And for updating it for scientific changes

### Automatic Thesaurus Generation

• Attempt to generate a thesaurus automatically by analyzing the collection of documents

• Fundamental notion: similarity between two words

• Definition 1: Two words are similar if they co-occur with similar words.

• Definition 2: Two words are similar if they occur in a given grammatical relation with the same words.

• You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar.

• Co-occurrence based is more robust, grammatical relations are more accurate.

↑

Why?

### Co-occurrence Thesaurus

• Simplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrix.

• wi,j = (normalized) weight for (ti ,dj)

• For each ti, pick terms with high values in C

### Automatic Thesaurus Generation

• Discussion

• Quality of associations is usually a problem.

• Term ambiguity may introduce irrelevant statistically correlated terms.

• “Apple computer”  “Apple red fruit computer”

• Problems:

• False positives: Words deemed similar that are not

• False negatives: Words deemed dissimilar that are similar

• Since terms are highly correlated anyway, expansion may not retrieve many additional documents.

### Indirect relevance feedback

• On the web, DirectHit introduced a form of indirect relevance feedback.

• DirectHit ranked documents higher that users look at more often.

• Clicked on links are assumed likely to be relevant

• Assuming the displayed summaries are good, etc.

• Globally: Not necessarily user or query specific.

• This is the general area of clickstream mining

• Today – handled as part of machine-learned ranking

### Resources

• IIR Ch 9

• MG Ch. 4.7

• MIR Ch. 5.2 – 5.4

### CS276B

Text Retrieval and Mining
Winter 2005

Lecture 12

### What is XML?

• eXtensible Markup Language

• A framework for defining markup languages

• No fixed collection of markup tags

• Each XML language targeted for application

• All XML languages share features

• Enables building of generic tools

### Basic Structure

• An XML document is an ordered, labeled tree

• character data leaf nodes contain the actual data (text strings)

• element nodes, are each labeled with

• a name (often called the element type), and

• a set of attributes, each consisting of a name and a value,

• can have child nodes

### XML Example

<chapter id="cmds">

<chaptitle>FileCab</chaptitle<para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

### Elements

• Elements are denoted by markup tags

• thetext

• Element start tag: foo

• Attribute: attr1

• The character data: thetext

• Matching element end tag:

### XML vs HTML

• HTML is a markup language for a specific purpose (display in browsers)

• XML is a framework for defining markup languages

• HTML can be formalized as an XML language (XHTML)

• XML defines logical structure only

• HTML: same intention, but has evolved into a presentation language

### XML: Design Goals

• Separate syntax from semantics to provide a common framework for structuring information

• Allow tailor-made markup for any imaginable application domain

• Support internationalization (Unicode) and platform independence

• Be the future of (semi)structured information (do some of the work now done by databases)

### Why Use XML?

• Represent semi-structured data (data that are structured, but don’t fit relational model)

• XML is more flexible than DBs

• XML is more structured than simple IR

• You get a massive infrastructure for free

### Applications of XML

• XHTML

• CML – chemical markup language

• WML – wireless markup language

• ThML – theological markup language

• ### Having a Humble Opinion of Self

EVERY man naturally desires knowledge

Aristotle, Metaphysics, i. 1.

; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars.

Augustine, Confessions V. 4.

### XML Schemas

• Schema = syntax definition of XML language

• Schema language = formal language for expressing XML schemas

• Examples

• Document Type Definition

• XML Schema (W3C)

• Relevance for XML IR

• Our job is much easier if we have a (one) schema

### XML Tutorial

• (Anders Møller and Michael Schwartzbach)

• Previous (and some following) slides are based on their tutorial

XML Indexing and Search

### Native XML Database

• Uses XML document as logical unit

• Should support

• Elements

• Attributes

• PCDATA (parsed character data)

• Document order

• Contrast with

• DB modified for XML

• Generic IR system modified for XML

### XML Indexing and Search

• Most native XML databases have taken a DB approach

• Exact match

• Evaluate path expressions

• No IR type relevance ranking

• Only a few that focus on relevance ranking

### Data vs. Text-centric XML

• Data-centric XML: used for messaging between enterprise applications

• Mainly a recasting of relational data

• Content-centric XML: used for annotating content

• Rich in text

• Demands good integration of text retrieval functionality

• E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price

### IR XML Challenge 1: Term Statistics

• There is no document unit in XML

• How do we compute tf and idf?

• Global tf/idf over all text context is useless

• Indexing granularity

### IR XML Challenge 2: Fragments

• IR systems don’t store content (only index)

• Need to go to document for retrieving/displaying fragment

• E.g., give me the Abstracts of Papers on existentialism

• Where do you retrieve the Abstract from?

• Easier in DB framework

### IR XML Challenges 3: Schemas

• Ideally:

• There is one schema

• User understands schema

• In practice: rare

• Many schemas

• Schemas not known in advance

• Schemas change

• Users don’t understand schemas

• Need to identify similar elements in different schemas

• Example: employee

### IR XML Challenges 4: UI

• Help user find relevant nodes in schema

• Author, editor, contributor, “from:”/sender

• What is the query language you expose to the user?

• Specific XML query language? No.

• Forms? Parametric search?

• A textbox?

• In general: design layer between XML and user

### IR XML Challenges 5: using a DB

• Why you don’t want to use a DB

• Spelling correction

• Mid-word wildcards

• Contains vs “is about”

• DB has no notion of ordering

• Relevance ranking

### Querying XML

• Today:

• XQuery

• XIRQL

• Lecture 15

• Vector space approaches

### XQuery

• SQL for XML

• Usage scenarios

• Human-readable documents

• Data-oriented documents

• Mixed documents (e.g., patient records)

• Relies on

• XPath

• XML Schema datatypes

• Turing complete

• XQuery is still a working draft.

### XQuery

• The principal forms of XQuery expressions are:

• path expressions

• element constructors

• FLWR ("flower") expressions

• list expressions

• conditional expressions

• quantified expressions

• datatype expressions

• Evaluated with respect to a context

### FLWR

• FOR $p IN document("bib.xml")//publisher LET$b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p • FOR generates an ordered list of bindings of publisher names to$p

• LET associates to each binding a further binding of the list of book elements with that publisher to $b • at this stage, we have an ordered list of tuples of bindings: ($p,$b) • WHERE filters that list to retain only the desired tuples • RETURN constructs for each tuple a resulting value ### Queries Supported by XQuery • Location/position (“chapter no.3”) • Simple attribute/value • /play/title contains “hamlet” • Path queries • title contains “hamlet” • /play//title contains “hamlet” • Complex graphs • Employees with two managers • Subsumes: hyperlinks • What about relevance ranking? ### How XQuery makes ranking difficult • All documents in set A must be ranked above all documents in set B. • Fragments must be ordered in depth-first, left-to-right order. ### XQuery: Order By Clause • for$d in document("depts.xml")//deptno

• let $e := document("emps.xml")//emp[deptno =$d]

• where count($e) >= 10 • order by avg($e/salary) descending

• return { $d, {count($e)}, {avg(\$e/salary)} }

### XQuery Order By Clause

• Order by clause only allows ordering by “overt” criterion

• Say by an attribute value

• Relevance ranking

• Is often proprietary

• Can’t be expressed easily as function of set to be ranked

• Is better abstracted out of query formulation (cf. www)

### XIRQL

• University of Dortmund

• Goal: open source XML search engine

• Motivation

• “Returnable” fragments are special

• E.g., don’t return a some text fragment

• Structured Document Retrieval Principle

• Empower users who don’t know the schema

• Enable search for any person no matter how schema encodes the data

• Don’t worry about attribute/element

### Atomic Units

• Specified in schema

• Only atomic units can be returned as result of search (unless unit specified)

• Tf.idf weighting is applied to atomic units

• Probabilistic combination of “evidence” from atomic units

### Structured Document Retrieval Principle

• A system should always retrieve the most specific part of a document answering a query.

• Example query: xql

• Document:

• 0.3 XQL

• 0.5 example

• 0.8 XQL 0.7 syntax

• Return section, not chapter

### Augmentation weights

• Ensure that Structured Document Retrieval Principle is respected.

• Assume different query conditions are disjoint events -> independence.

• P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636

• Section ranked ahead of chapter

### Datatypes

• Example: person_name

• Assign all elements and attributes with person semantics to this datatype

• Allow user to search for “person” without specifying path

### XIRQL: Summary

• Relevance ranking

• Fragment/context selection

• Datatypes (person_name)

• Semantic relativism

• Attribute/element

### Data structures for XML retrieval

A very basic introduction.

### Data structures for XML retrieval

• What are the primitives we need?

• Inverted index: give me all elements matching text query Q

• We know how to do this – treat each element as a document

• Give me all elements (immediately) below any instance of the Book element

• Combination of the above

### Parent/child links

• Number each element

• Maintain a list of parent-child relationships

• E.g., Chapter:21  Book:8

• Enables immediate parent

• But what about “the word Hamlet under a Scene element under a Play element?

### General positional indexes

• View the XML document as a text document

• Build a positional index for each element

• Mark the beginning and end for each element, e.g.,

• Play →  Doc:1(27) →  Doc:1(2033) ---→

• /Play Doc:1(1122) →  Doc:1(5790) ---

• Verse →  Doc:1(431)  → Doc:4(33)  ---

• /Verse → Doc:1(867)  →  Doc:4(92) ---

• Term:droppeth →  Doc:1(720)

### Positional containment

↑

droppeth under Verse under Play.

### Summary of data structures

• Path containment etc. can essentially be solved by positional inverted indexes

• Retrieval consists of “merging” postings

• All the compression tricks etc. from 276A are still applicable

• Complications arise from insertion/deletion of elements, text within elements

• Beyond the scope of this course

### Resources

• www.w3.org/XML - XML resources at W3C

• Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.

• ORDPATHs: Insert-Friendly XML Node Labels.

• www.cs.umb.edu/~poneil/ordpath.pdf

### CS276A

Text Retrieval and Mining

Lecture 10

### Recap of the last lecture

• Improving search results

• Especially for high recall. E.g., searching for aircraft so it matches with plane; thermodynamic with heat

• Options for improving results…

• Global methods

• Query expansion

• Thesauri

• Automatic thesaurus generation

• Global indirect relevance feedback

• Local methods

• Relevance feedback

• Pseudo relevance feedback

### Probabilistic relevance feedback

• Rather than reweighting in a vector space…

• If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model:

• P(tk|R) = |Drk| / |Dr|

• P(tk|NR) = |Dnrk| / |Dnr|

• tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk.

### Why probabilities in IR?

• In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms.

• Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties?

### Probabilistic IR topics

• Classical probabilistic retrieval model

• Probability ranking principle, etc.

• (Naïve) Bayesian Text Categorization

• Bayesian networks for text retrieval

• Language model approach to IR

• An important emphasis in recent work

• Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.

• Traditionally: neat ideas, but they’ve never won on performance. It may be different now.

### The document ranking problem

• We have a collection of documents

• User issues a query

• A list of documents needs to be returned

• Ranking method is core of an IR system:

• In what order do we present documents to the user?

• We want the “best” document to be first, second best second, etc….

• Idea: Rank by probability of relevance of the document w.r.t. information need

• P(relevant|documenti, query)

### Recall a few probability basics

• For events a and b:
• Bayes’ Rule
• Odds:

### Probability Ranking Principle

• Let x be a document in the collection.

• Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.       ⇐  R={0,1} vs. NR/R

• Need to find p(R|x) - probability that a document is relevant.
• p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

### Probability Ranking Principle (PRP)

• Simple case: no selection costs or other utility concerns that would differentially weight errors

• Bayes’ Optimal Decision Rule

• x is relevant iff p(R|x) > p(NR|x)

• PRP in action: Rank all documents by p(R|x)

• Theorem:

• Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 loss

• Provable if all probabilities correct, etc. [e.g., Ripley 1996]

### Probability Ranking Principle

• More complex case: retrieval costs.

• Let d be a document

• C - cost of retrieval of relevant document

• C’ - cost of retrieval of non-relevant document

• Probability Ranking Principle: if

• for all d’ not yet retrieved, then d is the next document to be retrieved

• We won’t further consider loss/utility from now on

### Probability Ranking Principle

• How do we compute all those probabilities?

• Do not know exact probabilities, have to use estimates

• Binary Independence Retrieval (BIR) – which we discuss later today – is the simplest model

• Questionable assumptions

• “Relevance” of each document is independent of relevance of other documents.

• Really, it’s bad to keep on returning duplicates

• Boolean model of relevance

• That one has a single step information need

• Seeing a range of results might let user refine query

### Probabilistic Retrieval Strategy

• Estimate how terms contribute to relevance

• How do things like tf, df, and length influence your judgments about document relevance?

• One answer is the Okapi formulae (S. Robertson)

• Combine to find document relevance probability

• Order documents by decreasing probability

### The Probability Ranking Principle

• “If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

• [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538)

### Probabilistic Ranking

• Basic concept:

• "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.

• By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically."

• Van Rijsbergen

### Binary Independence Model

• Traditionally used in conjunction with PRP

• “Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1):

• xi=1  iff term i is present in document x.
• “Independence”: terms occur in documents independently

• Different documents can be modeled as same vector

• Bernoulli Naive Bayes model (cf. text categorization!)

### Binary Independence Model

• Queries: binary term incidence vectors

• Given query q,

• for each document d need to compute p(R|q,d).

• replace with computing p(R|q,x) where x is binary term incidence vector representing d Interested only in ranking

• Will use odds and Bayes’ Rule:

### Binary Independence Model

• Using Independence Assumption:

• So :

### Binary Independence Model

• Since xi is either 0 or 1:

• Let Pi = P(xi = 1 | R,q);  rt = p (xi =1 | NR,q)

• Assume, for all terms not occurring in the query (qi=0) pi = ri

• Then...                                                                   (This can be changed (e.g.,

in relevance feedback)

### Binary Independence Model

• Retrieval Status Value:

### Binary Independence Model

• All boils down to computing RSV.

• So, how do we compute ci’s from our data ?

### Binary Independence Model

• Estimating RSV coefficients.

• For each term i look at this table of document counts:

### Estimation – key challenge

• If non-relevant documents are approximated by the whole collection, then ri (prob. of occurrence in non-relevant documents for query) is n/N and

• log (1– ri)/ri = log (N– n)/n ≈ log N/n = IDF!

• pi (probability of occurrence in relevant documents) can be estimated in various ways:

• from relevant documents if know some

• Relevance weighting can be used in feedback loop

• constant (Croft and Harper combination match) – then just get idf weighting of terms

• proportional to prob. of occurrence in collection

• more accurately, to log of this (Greiff, SIGIR 1998)

### Iteratively estimating pi

1. Assume that pi constant over all xi in query

• pi = 0.5 (even odds) for any given doc

2. Determine guess of relevant document set:

• V is fixed size set of highest ranked documents on this model (note: now a bit like tf.idf!)

3. We need to improve our guesses for pi and ri, so

• Use distribution of xi in docs in V. Let Vi be set of documents containing xi

• pi = |Vi| / |V|

• Assume if not retrieved then not relevant

• ri = (ni – |Vi|) / (N – |V|)

4. Go to 2. until converges then return ranking

### Probabilistic Relevance Feedback

1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents V, as above.
2. Interact with the user to refine the description: learn some definite members of R and NR

3. Reestimate pi and ri on the basis of these
• Or can combine new information with original guess (use Bayesian prior):
4. Repeat, thus generating a succession of approximations to R

### PRP and BIR

• Getting reasonable approximations of probabilities is possible.

• Requires restrictive assumptions:

• term independence

• terms not in query don’t affect the outcome

• boolean representation of documents/queries/relevance

• document relevance values are independent

• Some of these assumptions can be removed

• Problem: either require partial relevance information or only can derive somewhat inferior term weights

### Removing term independence

 In general, index terms aren’t independentDependencies can be complexvan Rijsbergen (1979) proposed model of simple tree dependenciesExactly Friedman and Goldszmidt’s Tree Augmented Naive Bayes (AAAI 13, 1996)Each term dependent on one otherIn 1970s, estimation problems held back success of this model

### Food for thought

• Think through the differences between standard tf.idf and the probabilistic retrieval model in the first iteration

• Think through the differences between vector space (pseudo) relevance feedback and probabilistic (pseudo) relevance feedback

### Good and Bad News

• Standard Vector Space Model

• Empirical for the most part; success measured by results

• Few properties provable

• Probabilistic Model Advantages

• Based on a firm theoretical foundation

• Theoretically justified optimal ranking scheme

• Disadvantages

• Making the initial guess to get V

• Binary word-in-doc weights (not using term frequencies)

• Independence of terms (can be alleviated)

• Amount of computation

• Has never worked convincingly better in practice

### Bayesian Networks for Text Retrieval (Turtle and Croft 1990)

• Standard probabilistic model assumes you can’t estimate P(R|D,Q)

• Instead assume independence and use P(D|R)

• But maybe you can with a Bayesian network*

• What is a Bayesian network?

• A directed acyclic graph

• Nodes

• Events or Variables

• Assume values.

• For our purposes, all Boolean

• Links

• model direct dependencies between nodes

### Bayesian Networks

•  a,b,c - propositions (events). Bayesian networks model causal relations between events

 Inference in Bayesian Nets:Given probability distributions for roots and conditional probabilities can compute apriori probability of any instance Fixing assumptions (e.g., b was observed) will cause recomputation of probabilities
• For more information see:

• R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. 1999. Probabilistic Networks and Expert Systems. Springer Verlag.

• J. Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufman.

### Independence Assumptions

 Independence assumption:  P(t|g, f)=P(t|g) Joint probability P(f d n g t) =P(f) P(d) P(n|f) P(g|f d) P(t|g)

### Chained inference

• Evidence - a node takes on some value

• Inference

• Compute belief (probabilities) of other nodes

• conditioned on the known evidence

• Two kinds of inference: Diagnostic and Predictive

• Computational complexity

• General network: NP-hard

• Tree-like networks are easily tractable

• Much other work on efficient exact and approximate Bayesian network inference

• Clever dynamic programming

• Approximate inference (“loopy belief propagation”)

### Model for Text Retrieval

• Goal

• Given a user’s information need (evidence), find probability a doc satisfies need

• Retrieval model

• Model docs in a document network

• Model information need in a query network

### Bayesian Nets for IR

• Construct Document Network (once !)

• For each query

• Construct best Query Network

• Attach it to Document Network

• Find subset of di’s which maximizes the probability value of node I (best subset).

• Retrieve these di’s as the answer to query.

### Link matrices and probabilities

 Prior doc probability P(d) = 1/nP(r|d)within-document term frequencytf x idf - based P(c|r)1-to-1thesaurusP(q|c): canonical forms of query operatorsAlways use things like AND and NOT – never store a full CPT*
• *conditional probability table

### Extensions

• Prior probs don’t have to be 1/n.

• “User information need” doesn’t have to be a query - can be words typed, in docs read, any combination …

• Phrases, inter-document links

• Link matrices can be modified over time.

• User feedback.

• The promise of “personalization”

### Computational details

• Document network built at indexing time

• Query network built/scored at query time

• Representation:

• Link matrices from docs to any single term are like the postings entry for that term

• Canonical link matrices are efficient to store and compute

• Attach evidence only at roots of network

• Can do single pass from roots to leaves

### Bayes Nets in IR

• Flexible ways of combining term weights, which can generalize previous approaches

• Boolean model

• Binary independence model

• Probabilistic models with weaker assumptions

• Efficient large-scale implementation

• InQuery text retrieval system from U Mass

• Turtle and Croft (1990) [Commercial version defunct?]

• Need approximations to avoid intractable inference

• Need to estimate all the probabilities by some means (whether more or less ad hoc)

• Much new Bayes net technology yet to be applied?

### Resources

• S. E. Robertson and K. Spärck Jones. 1976. Relevance Weighting of Search Terms. Journal of the American Society for Information Sciences 27(3): 129–146.

• C. J. van Rijsbergen. 1979. Information Retrieval. 2nd ed. London: Butterworths, chapter 6. [Most details of math] http://www.dcs.gla.ac.uk/Keith/Preface.html

• N. Fuhr. 1992. Probabilistic Models in Information Retrieval. The Computer Journal, 35(3),243–255. [Easiest read, with BNs]

• F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell. 1998. Is This Document Relevant? ... Probably: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4): 528–552.

• http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/

• [Adds very little material that isn’t in van Rijsbergen or Fuhr ]

### Resources

• H.R. Turtle and W.B. Croft. 1990. Inference Networks for Document Retrieval. Proc. ACM SIGIR: 1-24.

• E. Charniak. Bayesian nets without tears. AI Magazine 12(4): 50-63 (1991). http://www.aaai.org/Library/Magazine/Vol12/12-04/vol12-04.html

• D. Heckerman. 1995. A Tutorial on Learning with Bayesian Networks. Microsoft Technical Report MSR-TR-95-06

• http://www.research.microsoft.com/~heckerman/

• N. Fuhr. 2000. Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications. Journal of the American Society for Information Science 51(2): 95–110.

• R. K. Belew. 2001. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge UP 2001.

• MIR 2.5.4, 2.8

### CS276A

Text Retrieval and Mining

Lecture 12

[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]

### Recap

• Probabilistic models:
Naïve Bayes Text Classification

• Introduction to Text Classification

• Probabilistic Language Models

• Naïve Bayes text categorization

### Today

• The Language Model Approach to IR

• Basic query generation model

• Alternative models

### IR based on Language Model (LM)

• A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
• The LM approach directly exploits that idea!

### Formal Language (Model)

• Traditional generative model: generates strings

• Finite state machines or regular grammars, etc.

• Example:

 I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … *wish I wish

### Stochastic Language Models

• Models probability of generating strings in the language (commonly all strings over alphabet ∑)

### Stochastic Language Models

• Model probability of generating any string

### Stochastic Language Models

• A statistical model for generating text

• Probability distribution over strings in a given language

### Unigram and higher-order models

• Unigram Language Models (Easy , effective !)
• Bigram (generally, n-gram) Language Models
• Other Language Models
• Grammar-based models (PCFGs), etc.
• Probably not the first thing to try in IR
• ### Using Language Models in IR

• Treat each document as the basis for a model (e.g., unigram sufficient statistics)

• Rank document d based on P(d | q)

• P(d | q) = P(q | d) x P(d) / P(q)

• P(q) is the same for all documents, so ignore

• P(d) [the prior] is often treated as the same for all d

• But we could use criteria like authority, length, genre

• P(q | d) is the probability of q given d’s model

• Very general formal approach

### The fundamental problem of LMs

• Usually we don’t know the model M

• But have a sample of text representative of that model

• Estimate a language model from a sample

• Then compute the observation probability

### Language Models for IR

• Language Modeling Approaches

• Attempt to model query generation process

• Documents are ranked by the probability that a query would be observed as a random sample from the respective document model

• Multinomial approach

### Retrieval based on probabilistic LM

• Treat the generation of queries as a random process.

• Approach

• Infer a language model for each document.

• Estimate the probability of generating the query according to each of these models.

• Rank the documents according to these probabilities.

• Usually a unigram estimate of words is used

• Some work on bigrams, paralleling van Rijsbergen

### Retrieval based on probabilistic LM

• Intuition

• Users …

• Have a reasonable idea of terms that are likely to occur in documents of interest.

• They will choose query terms that distinguish these documents from others in the collection.

• Collection statistics …

• Are integral parts of the language model.

• Are not used heuristically as in many other approaches.

• In theory. In practice, there’s usually some wiggle room for empirically set parameters

### Query generation probability (1)

• Ranking formula

• The probability of producing the query given the language model of document d using MLE is:

### Insufficient data

• Zero probability   p(t | Md) = 0

• May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]

• General approach

• A non-occurring term is possible, but no more likely than would be expected by chance in the collection.

• If :

### Insufficient data

• Zero probabilities spell disaster

• We need to smooth probabilities

• Discount nonzero probabilities

• Give some probability mass to unseen things

• There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation

• [See FSNLP ch. 6 or CS224N if you want more]

• A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution

### Mixture model

• P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

• Mixes the probability from the document with the general collection frequency of the word.

• Correctly setting lis very important

• A high value of lambda makes the search “conjunctive-like” – suitable for short queries

• A low value is more suitable for long queries

• Can tune l to optimize performance

• Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

### Basic mixture model summary

• General formulation of the LM for IR

• The user has a document in mind, and generates the query from this document.

• The equation represents the probability that the document that the user had in mind was in fact this one.

• general language model

• individual-document model

### Example

• Document collection (2 documents)

• d1: Xerox reports a profit but revenue is down

• d2: Lucent narrows quarter loss but revenue decreases further

• Model: MLE unigram from documents; l = ½

• Query: revenue down

• P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

• = 1/8 x 3/32 = 3/256

• P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

• = 1/8 x 1/32 = 1/256

• Ranking: d1 > d2

### Ponte and Croft Experiments

• Data

• TREC topics 202-250 on TREC disks 2 and 3

• Natural language queries consisting of one sentence each

• TREC topics 51-100 on TREC disk 3 using the concept fields

 Lists of good termsTopic: Satellite Launch Contracts

• Description:

• Concept(s):

1. Contract, agreement

2. Launch vehicle, rocket, payload, satellite

3. Launch services, …

Precision/recall results 202-250

Precision/recall results 51-100

The main difference is whether “Relevance” figures explicitly in the model or not

• LM approach attempts to do away with modeling relevance

• LM approach asssumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

• LM vs. Prob. Model for IR

Problems of basic LM approach

• Assumption of equivalence between document and information problem representation is unrealistic

• Very simple models of language

• Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

• Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

• LM vs. Prob. Model for IR

Extension: 3-level model

• 3-level model

1. Whole collection model ( )

2. Specific-topic model; relevant-documents model ( )

3. Individual-document model ( )

• Relevance hypothesis

• A request(query; topic) is generated from a specific-topic model { , }.

• Iff a document is relevant to the topic, the same model will apply to the document.

• It will replace part of the individual-document model in explaining the document.

• The probability of relevance of a document

• The probability that this model explains part of the document

• The probability that the { , , } combination is better than the { , } combination

### The main difference is whether “Relevance” figures explicitly in the model or not

• LM approach attempts to do away with modeling relevance

• LM approach asssumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

• LM vs. Prob. Model for IR

### Problems of basic LM approach

• Assumption of equivalence between document and information problem representation is unrealistic

• Very simple models of language

• Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

• Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

• LM vs. Prob. Model for IR

### Extension: 3-level model

• 3-level model

1. Whole collection model (Mc)

2. Specific-topic model; relevant-documents model (Mt)

3. Individual-document model (Md)

• Relevance hypothesis

• A request(query; topic) is generated from a specific-topic model { Mc, Mt}.

• Iff a document is relevant to the topic, the same model will apply to the document.

• It will replace part of the individual-document model in explaining the document.

• The probability of relevance of a document

• The probability that this model explains part of the document

• The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination

### Retrieval Using Language Models

• Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)

### Query Likelihood

• P(Q|Dm)

• Major issue is estimating document model

• i.e. smoothing techniques instead of tf.idf weights

• Good retrieval results

• e.g. UMass, BBN, Twente, CMU

• Problems dealing with relevance feedback, query expansion, structured queries

### Document Likelihood

• Rank by likelihood ratio P(D|R)/P(D|NR)

• treat as a generation problem

• P(w|R) is estimated by P(w|Qm)

• Qm is the query or relevance model

• P(w|NR) is estimated by collection probabilities P(w)

• Issue is estimation of query model

• Treat query as generated by mixture of topic and background

• Estimate relevance model from related documents (query expansion)

• Relevance feedback is easily incorporated

• Good retrieval results

• e.g. UMass at SIGIR 01

• inconsistent with heterogeneous document collections

### Model Comparison

• Estimate query and document models and compare

• Suitable measure is KL divergence D(Qm||Dm)

• equivalent to query-likelihood approach if simple empirical distribution used for query model

• More general risk minimization framework has been proposed

• Zhai and Lafferty 2001

• Better results than query-likelihood or document-likelihood approaches

### Two-stage smoothing:Another Reason for Smoothing

• p( “algorithms”|d1) = p(“algorithm”|d2)

• p( “data”|d1) < p(“data”|d2)

• p( “mining”|d1) < p(“mining”|d2)

• But p(q|d1)>p(q|d2)!

• We should make p(“the”) and p(“for”) less different for all docs.

### How can one do relevance feedback if using language modeling approach?

• Introduce a query model & treat feedback as query model updating

• Retrieval function:

• Query-likelihood => KL-Divergence

• Feedback:

• Expansion-based => Model-based

### Translation model (Berger and Lafferty)

• Basic LMs do not address issues of synonymy.

• Or any deviation in expression of information need from language of documents

• A translation model lets you generate query words not in document via “translation” to synonyms etc.

• Or to do cross-language IR, or multimedia IR

• Need to learn a translation model (using a dictionary or via statistical machine translation)

### Language models: pro & con

• Novel way of looking at the problem of text retrieval based on probabilistic language modeling

• Conceptually simple and explanatory

• Formal mathematical model

• Natural use of collection statistics, not heuristics (almost…)

• LMs provide effective retrieval and can be improved to the extent that the following conditions can be met

• Our language models are accurate representations of the data.

• Users have some sense of term distribution.*

• *Or we get more sophisticated with translation model

### Comparison With Vector Space

• There’s some relation to traditional tf.idf models:

• (unscaled) term frequency is directly in model

• the probabilities do length normalization of term frequencies

• the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

### Comparison With Vector Space

• Similar in some ways

• Term weights based on frequency

• Terms often used as if they were independent

• Inverse document/collection frequency used

• Some form of length normalization useful

• Different in others

• Based on probability rather than similarity

• Intuitions are probabilistic rather than geometric

• Details of use of document length and term, document, and collection frequency differ

### Resources

• J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21.
• D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.

• A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.

• D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.

• [Several relevant newer papers at SIGIR 2325, 2000–2002.]

• Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ .

• The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.

### Introduction to Information Retrieval

Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Lecture 10: Text Classification;

The Naive Bayes algorithm

### Standing queries

• The path from IR to text classification:

• You have an information need to monitor, say:

• Unrest in the Niger delta region

• You want to rerun an appropriate query periodically to find new news items on this topic

• You will be sent new documents that are found

• I.e., it’s not ranking but classification (relevant vs. not relevant)

• Such queries are called standing queries

• Long used by “information professionals”

• A modern mass instantiation is Google Alerts

• Standing queries are (hand-written) text classifiers

### Spam filteringAnother text classification task

• From: ""<takworlld@hotmail.com>

• Subject: real estate is the only way... gem oalvgkay

• Anyone can buy real estate with no money down

• Stop paying rent TODAY !

• There is no need to spend hundreds or even thousands for similar courses

• I am 22 years old and I have already purchased 6 properties using the

• methods outlined in this truly INCREDIBLE ebook.

• Change your life NOW !

• =================================================

• Click Below to order:

• http://www.wholesaledaily.com/sales/nmd.htm

• =================================================

### Text classification

• Today:

• Introduction to Text Classification

• Also widely known as “text categorization”

• Same thing

• Naïve Bayes text classification

• Including a little on Probabilistic Language Models

### Categorization/Classification

• Given:

• A description of an instance, d ∈ X

• X is the instance language or instance space.

• Issue: how to represent text documents.

• Usually some type of high-dimensional space – bag of words

• A fixed set of classes:

• C = {c1, c2,…, cJ}

• Determine:

• The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.

• We want to know how to build classification functions (“classifiers”).

### Machine Learning:Supervised Classification

• Given:

• A description of an instance, d ∈ X

• X is the instance language or instance space.

• A fixed set of classes:

• C = {c1, c2,…, cJ}

• A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×C

• Determine:

• A learning method or algorithm which will enable us to learn a classifier γ:X→C

• For a test document d, we assign it the class γ(d) ∈ C

### Document Classification

• (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)

### More Text Classification ExamplesMany search engine functionalities use classification

• Assigning labels to documents or web-pages:

• Labels are most often topics such as Yahoo-categories

• "finance," "sports," "news>world>asia>business"

• Labels may be genres

• "editorials" "movie-reviews" "news”

• Labels may be opinion on a person/product

• “like”, “hate”, “neutral”

• Labels may be domain-specific

• "interesting-to-me" : "not-interesting-to-me”

• “contains adult language” : “doesn’t”

• language identification: English, French, Chinese, …

• search vertical: about Linux versus not

• “link spam” : “not link spam"

### Classification Methods (1)

• Manual classification

• Used by the original Yahoo! Directory

• Looksmart, about.com, ODP, PubMed

• Very accurate when job is done by experts

• Consistent when the problem size and team is small

• Difficult and expensive to scale

• Means we need automatic classification methods for big problems

### Classification Methods (2)

• Hand-coded rule-based classifiers

• One technique used by CS dept’s spam filter, Reuters, CIA, etc.

• It’s what Google Alerts is doing

• Widely deployed in government and enterprise

• Companies provide “IDE” for writing such rules

• E.g., assign category if document contains a given boolean combination of words

• Commercial systems have complex query languages (everything in IR query languages +score accumulators)

• Accuracy is often very high if a rule has been carefully refined over time by a subject expert

• Building and maintaining these rules is expensive

### A Verity topic A complex classification rule

 Note:maintenance issues (author, etc.)Hand-weighting of terms[Verity was bought by Autonomy.]

### Classification Methods (3)

• Supervised learning of a document-label assignment function

• Many systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)

• k-Nearest Neighbors (simple, powerful)

• Naive Bayes (simple, common method)

• Support-vector machines (new, generally more powerful)

• … plus many other methods

• No free lunch: requires hand-classified training data

• But data can be built up (and refined) by amateurs

• Many commercial systems use a mixture of methods

### Relevance feedback

• In relevance feedback, the user marks a few documents as relevant/nonrelevant

• The choices can be viewed as classes or categories

• The IR system then uses these judgments to build a better model of the information need

• So, relevance feedback can be viewed as a form of text classification (deciding between several classes)

### Probabilistic relevance feedback

• Rather than reweighting in a vector space…

• If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier

• such as the Naive Bayes model we will look at today:

• P(tk|R) = |Drk| / |Dr|

• P(tk|NR) = |Dnrk| / |Dnr|

• tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.

### Recall a few probability basics

• For events a and b:

• Bayes’ Rule

• Odds:

### Bayesian Methods

• Learning and classification methods based on probability theory

• Bayes theorem plays a critical role

• Builds a generative model that approximates how data is produced

• Has prior probability of each category given no information about an item.

• Model produces a posterior probability

• Distribution over the possible categories given an item

• Naïve Bayes methods use a bag of words as the item description

### Bayes’ Rule for text classification

For a document d and a class c

### Naive Bayes Classifiers

• Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj ∈ C

### d={x1,x2,......,xn}

• MAP is “maximum a posteriori” = most likely class

### Naïve Bayes Classifier: Naïve Bayes Assumption

• P(cj)

• Can be estimated from the frequency of classes in the training examples.

• P(x1,x2,…,xn|cj)

• O(|X|n•|C|) parameters

• Could only be estimated if a very, very large number of training examples was available.

• Naïve Bayes Conditional Independence Assumption:

• Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

### The Multivariate Bernoulli NB Classifier [Like Prob IR BIM; not language model; less used]

• Conditional Independence Assumption: features detect term presence and are independent of each other given the class:

• This model is appropriate for binary variables
• Multivariate Bernoulli model

### Learning the Model

• First attempt: maximum likelihood estimates

• Simply use the frequencies in the data

### Problem with Maximum Likelihood

• What if we have seen no training documents with the word muscle-ache and classified in the topic Flu?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

### Smoothing to Avoid Overfitting

• Somewhat more subtle version

### Stochastic Language Models

• Model probability of generating any string

### Stochastic Language Models

• Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model

### Unigram and higher-order models

• Unigram Language Models (Easy , effective !)
• Bigram (generally, n-gram) Language Models
• Other Language Models
• Grammar-based models (PCFGs), etc.
• Probably not the first thing to try in IR
• ### Naïve Bayes via a class conditional language model = multinomial NB

• The probability of the words is done as a class-specific unigram language model

### Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

• Attributes are text positions, values are words.

• Still too many possibilities

• Assume that classification is independent of the positions of the words

• Use same parameters for each position

• Result is bag of words model (over tokens not types)

### Naive Bayes and Language Modeling

• Naïve Bayes classifiers can use any sort of feature

• URL, email address, dictionaries, network features

• But if, as in the previous slides

• We use only word features

• we use all of the words in the text (not a subset)

• Then

• Naïve Bayes is basically the same as language modeling

### Multinomial Naive Bayes: Learning

• From training corpus, extract Vocabulary
• Calculate required P(cj) and P(xk | cj) terms
• For each cj in C do
• docsj ← subset of documents for which the target class is cj
• Textj ← single document containing all docsj
• for each word xk in Vocabulary
• nk ← number of occurrences of xk in Textj

### Naive Bayes: Classifying

• positions ← all word positions in current document

which contain tokens found in Vocabulary

• Return cNB, where

### Naive Bayes: Time Complexity

• Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D.

• Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data.                          ← Why ?

• Generally just O(|D|Lave) since usually |C||V| < |D|Lave

• Test Time: O(|C| Lt) where Lt is the average length of a test document.

• Very efficient overall, linearly proportional to the time needed to just read in all the data.

### Underflow Prevention: using logs

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

• Class with highest final un-normalized log probability score is still the most probable.

• Note that model is now just max of sum of weights…

### Example

 Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?

### Two Naive Bayes Models

• Model 1: Multivariate Bernoulli

• One feature Xw for each word in dictionary

• for loop iterates over dictionary

• Xw = true in document d if w appears in d

• Naive Bayes assumption:

• Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

• This is the model used in the binary independence model in classic probabilistic relevance feedback on hand-classified data

### Two Models

• Model 2: Multinomial = Class conditional unigram

• One feature Xi for each word pos in document

• feature’s values are all words in dictionary

• Value of Xi is the word in position i

• Naïve Bayes assumption:

• Given the document’s topic, word in one position in the document tells us nothing about words in other positions

• Second assumption:

• Word appearance does not depend on position

P(Xi = w | c) = P(Xj = w | c

• Just have one multinomial feature predicting all words

• for all positions i,j, word w, and class c

### Parameter estimation

Multivariate Bernoulli model:

 fraction of documents of topic cj in which word w appears

Multinomial model:

 fraction of times in which word w appears among all words in documents of topic cj

• Can create a mega-document for topic j by concatenating all documents in this topic
• Use frequency of w in mega-document

### Which to use for classification?

• Multinomial vs Multivariate Bernoulli?

• Multinomial model is almost always more effective in text applications!

• See results figures later

• There has been exploration of multinomial naïve bayes variants which often work better in practice

• Binarized multinomial Naïve Bayes, etc.

• Topic of PA4

### Feature Selection: Why?

• Text collections have a large number of features

• 10,000 – 1,000,000 unique words … and more

• May make using a particular classifier feasible

• Some classifiers can’t deal with 1,000,000 features

• Reduces training time

• Training time for some methods is quadratic or worse in the number of features

• Makes runtime models smaller and faster

• Can improve generalization (performance)

• Eliminates noise features

• Avoids overfitting

### Feature Selection: How?

• Two ideas:

• Hypothesis testing statistics:

• Are we confident that the value of one categorical variable is associated with the value of another

• Chi-square test (χ2)

• Information theory:

• How much information does the value of one categorical variable give you about the value of another

• Mutual information

• They’re similar, but χ2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

### |2 statistic (CHI)

• |2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals?

• The null hypothesis is rejected with confidence .999,

• since 12.9 > 10.83 (the value for .999 confidence).

### |2 statistic (CHI)

• There is a simpler formula for 2x2 χ2:

• N = A + B + C + D

• Value for complete independence of term and category?

### Feature selection via Mutual Information

• In training set, choose k words which best discriminate (give most info on) the categories.

• The Mutual Information between a word, class is:

• For each word w and each category c

### Feature selection via MI (contd.)

• For each category we build a list of k most discriminating terms.

• For example (on 20 Newsgroups):

• sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

• rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

• Greedy: does not account for correlations between terms

• Why?

### Feature Selection

• Mutual Information

• Clear information-theoretic interpretation

• May select rare uninformative terms

• Chi-square

• Statistical foundation

• May select very slightly informative frequent terms that are not very useful for classification

### Feature Selection: Frequency

• The simplest feature selection method:

• Just use the commonest terms

• No particular foundation

• But it make sense why this works

• They’re the words that can be well-estimated and are most often available as evidence

• In practice, this is often 90% as good as better methods

### Feature selection for NB

• In general feature selection is necessary for multivariate Bernoulli NB.

• Otherwise you suffer from noise, multi-counting

• “Feature selection” really means something different for multinomial NB. It means dictionary truncation

• The multinomial NB model only has 1 feature

• This “feature selection” normally isn’t needed for multinomial NB, but may help a fraction with quantities that are badly estimated

### Evaluating Categorization

• Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).

• Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data)

• It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set).

• Measures: precision, recall, F1, classification accuracy

• Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.

• Adequate if one class per document

• Otherwise F measure for each class

### WebKB Experiment (1998)

• Classify webpages from CS departments into:

• student, faculty, course,project

• Train on ~5,000 hand-labeled web pages

 Cornell, Washington, U.Texas, WisconsinCrawl and classify a new site (CMU)
• Results:

### SpamAssassin

• Naïve Bayes has found a home in spam filtering

• Paul Graham’s A Plan for Spam

• A Naive Bayes-like classifier with weird parameter estimation

• Widely used in spam filters

• But many features beyond words:

• black hole lists, etc.

• particular hand-crafted text patterns

### Naïve Bayes in Spam Filtering

• SpamAssassin Features:

• Basic (Naïve) Bayes spam probability

• Mentions: Generic Viagra

• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

• Phrase: impress ... girl

• Phrase: 'Prestigious Non-Accredited Universities’

• From: starts with many numbers

• Subject is all capitals

• HTML has a low ratio of text to image area

• Relay in RBL, http://www.mail-abuse.com/enduserinfo_rbl.html

• RCVD line looks faked

• http://spamassassin.apache.org/tests_3_3_x.html

### Violation of NB Assumptions

• The independence assumptions do not really hold of documents written in natural language.

• Conditional independence

• Positional independence

• Examples?

### Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.

• Output probabilities are commonly very close to 0 or 1.

• Correct estimation ⇒ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)

### Naive Bayes is Not So Naiventitled

• Very Fast Learning and Testing (basically just count the data)

• Low Storage requirements

• Very good in domains with many equally important features

• More robust to irrelevant features than many learning methods

• Irrelevant Features cancel each other without affecting results

• More robust to concept drift (changing class definition over time)

• Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems

• Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

• A good dependable baseline for text classification (but not the best)!

### Resources for today’s lecture

• IIR 13

• Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

• Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.

• Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.

• Tom Mitchell, Machine Learning. McGraw-Hill, 1997.

• Clear simple explanation of Naïve Bayes

• Open Calais: Automatic Semantic Tagging

• Free (but they can keep your data), provided by Thompson/Reuters (ex-ClearForest)

• Weka: A data mining software package that includes an implementation of Naive Bayes

• Reuters-21578 – the most famous text classification evaluation set

• Still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)

### CS276: Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 11: Text Classification;

Vector space classification

[Borrows slides from Ray Mooney]

### Recap: Naïve Bayes classifiers

• Classify based on prior weight of class and conditional parameter for what each word says:

• Training is done by counting and dividing:

• Don’t forget to smooth

### The rest of text classification

• Today:

• Vector space methods for Text Classification

• Vector space classification using centroids (Rocchio)

• K Nearest Neighbors

• Decision boundaries, linear and nonlinear classifiers

• Dealing with more than 2 classes

• Later in the course

• More text classification

• Support Vector Machines

• Text-specific issues in classification

### Recall: Vector Space Representation

• Each document is a vector, one component for each term (= word).

• Normally normalize vectors to unit length.

• High-dimensional vector space:

• Terms are axes

• 10,000+ dimensions, or even 100,000+

• Docs are vectors in this space

• How can we do classification in this space?

### Classification Using Vector Spaces

• As before, the training set is a set of documents, each labeled with its class (e.g., topic)

• In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space

• Premise 1: Documents in the same class form a contiguous region of space

• Premise 2: Documents from different classes don’t overlap (much)

• We define surfaces to delineate classes in the space

### Using Rocchio for text classification

• Relevance feedback methods can be adapted for text categorization

• As noted before, relevance feedback can be viewed as 2-class classification

• Relevant vs. nonrelevant documents

• Use standard tf-idf weighted vectors to represent text documents

• For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.

• Prototype = centroid of members of class

• Assign test documents to the category with the closest prototype vector based on cosine similarity.

### Definition of centroid

• Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.

• Note that centroid will in general not be a unit vector even when the inputs are unit vectors.

### Rocchio Properties

• Forms a simple generalization of the examples in each class (a prototype).

• Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.

• Classification is based on similarity to class prototypes.

• Does not guarantee classifications are consistent with the given training data.

Why not?

### Rocchio Anomaly

• Prototype models have problems with polymorphic (disjunctive) categories.

### Rocchio classification

• Rocchio forms a simple representation for each class: the centroid/prototype

• Classification is based on similarity to / distance from the prototype/centroid

• It does not guarantee that classifications are consistent with the given training data

• It is little used outside text classification

• It has been used quite effectively for text classification

• But in general worse than Naïve Bayes

• Again, cheap to train and test documents

### Nearest-Neighbor Learning Algorithm

• Learning is just storing the representations of the training examples in D.

• Testing instance x (under 1NN):

• Compute similarity between x and all examples in D.

• Assign x the category of the most similar example in D.

• Does not explicitly compute a generalization or category prototypes.

• Also called:

• Case-based learning

• Memory-based learning

• Lazy learning

• Rationale of kNN: contiguity hypothesis

### kNN Is Close to Optimal

• Cover and Hart (1967)

• Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data]

• In particular, asymptotic error rate is 0 if Bayes rate is 0.

• Assume: query point coincides with a training point.

• Both query point and training point contribute error → 2 times Bayes rate

### k Nearest Neighbor

• Using only the closest example (1NN) to determine the class is subject to errors due to:

• A single atypical example.

• Noise (i.e., an error) in the category label of a single training example.

• More robust alternative is to find the k most-similar examples and return the majority category of these k examples.

• Value of k is typically odd to avoid ties; 3 and 5 are most common.

### k Nearest Neighbor Classification

• kNN = k Nearest Neighbor

• To classify a document d into class c:

• Define k-neighborhood N as k nearest neighbors of d

• Count number of documents i in N that belong to c

• Estimate P(c|d) as i/k

• Choose as class argmaxc P(c|d) [ = majority class]

### Similarity Metrics

• Nearest neighbor method depends on a similarity (or distance) metric.

• Simplest for continuous m-dimensional instance space is Euclidean distance.

• Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).

• For text, cosine similarity of tf.idf weighted vectors is typically most effective.

### 3 Nearest Neighbor vs. Rocchio

• Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB.

### Nearest Neighbor with Inverted Index

• Naively, finding nearest neighbors requires a linear search through |D| documents in collection

• But determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.

• Use standard vector space inverted index methods to find the k nearest neighbors.

• Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears.

• Typically B << |D|

### kNN: Discussion

• No feature selection necessary

• Scales well with large number of classes

• Don’t need to train n classifiers for n classes

• Classes can influence each other

• Small changes to one class can have ripple effect

• Scores can be hard to convert to probabilities

• No training necessary

• Actually: perhaps not true. (Data editing, etc.)

• May be expensive at test time

• In most cases it’s more accurate than NB or Rocchio

### kNN vs. Naive Bayes

• Bias/Variance tradeoff

• Variance ≈ Capacity

• kNN has high variance and low bias.

• Infinite memory

• NB has low variance and high bias.

• Decision surface has to be linear (hyperplane – see later)

• Consider asking a botanist: Is an object a tree?

• Too much capacity/variance, low bias

• Botanist who memorizes

• Will always say “no” to new object (e.g., different # of leaves)

• Not enough capacity/variance, high bias

• Lazy botanist

• Says “yes” if the object is green

• You want the middle ground

(Example due to C. Burges)

### Linear classifiers and binary and multiclass classification

• Consider 2 class problems

• Deciding between two classes, perhaps, government and non-government

• One-versus-rest classification

• How do we define (and find) the separating surface?

• How do we decide which region a test doc is in?

### Separation by Hyperplanes

• A strong high-bias assumption is linear separability:

• in 2 dimensions, can separate classes by a line

• in higher dimensions, need hyperplanes
• Can find separating hyperplane by linear programming

• (or can iteratively fit solution via perceptron):

• separator can be expressed as ax + by = c

### Linear programming / Perceptron

• Find a,b,c, such that
• ax + by > c for red points

• ax + by < c for blue points.

### Which Hyperplane?

• In general, lots of possible solutions for a,b,c.

### Which Hyperplane?

• Lots of possible solutions for a,b,c.

• Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]

• E.g., perceptron

• Most methods find an optimal separating hyperplane

 Which points should influence optimality?All pointsLinear/logistic regressionNaïve BayesOnly “difficult points” close to decision boundarySupport vector machines

### Linear classifier: Example

• Class: “interest” (as in interest rate)

• Example features of a linear classifier

wi     ti                                              wi     ti

0.70 prime                                        −0.71 dlrs

0.67 rate                                           −0.35 world

0.63 interest                                     −0.33 sees

0.60 rates                                          −0.25 year

0.46 discount                                    −0.24 group

0.43 bundesbank                               −0.24 dlr

• To classify, find dot product of feature vector and weights

### Linear Classifiers

• Many common text classifiers are linear classifiers

• Naïve Bayes

• Perceptron

• Rocchio

• Logistic regression

• Support vector machines (with linear kernel)

• Linear regression with threshold

• Despite this similarity, noticeable performance differences

• For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?

• What to do for non-separable problems?

• Different training methods pick different hyperplanes

• Classifiers more powerful than linear often don’t perform better on text problems. Why?

### Two-class Rocchio as a linear classifier

• Line or hyperplane defined by:

• For Rocchio, set:

• [Aside for ML/stats people: Rocchio classification is a simplification of the classic Fisher Linear Discriminant where you don’t model the variance (or assume it is spherical).]

### Naive Bayes is a linear classifier

• Two-class Naive Bayes. We compute:

• Decide class C if the odds is greater than 1, i.e., if the log odds is greater than 0.

• So decision boundary is hyperplane:

### A nonlinear problem

• A linear classifier like Naïve Bayes does badly on this task

• kNN will do very well (assuming enough training data)

### High Dimensional Data

 Pictures like the one at right are absolutely misleading!Documents are zero along almost all axesMost document pairs are very far apart (i.e., not strictly orthogonal, but only share very common words and a few scattered others)In classification terms: often document sets are separable, for most any classificationThis is part of why linear classifiers are quite successful in this domain

### More Than Two Classes

• Any-of or multivalue classification

• Classes are independent of each other.

• A document can belong to 0, 1, or >1 classes.

• Decompose into n binary problems

• Quite common for documents

• One-of or multinomial or polytomous classification

• Classes are mutually exclusive.

• Each document belongs to exactly one class

• E.g., digit recognition is polytomous classification

• Digits are mutually exclusive

### Set of Binary Classifiers: Any of

• Build a separator between each class and its complementary set (docs from all other classes).

• Given test doc, evaluate it for membership in each class.

• Apply decision criterion of classifiers independently

• Done

• Though maybe you could do better by considering dependencies between categories

### Set of Binary Classifiers: One of

• Build a separator between each class and its complementary set (docs from all other classes).

• Given test doc, evaluate it for membership in each class.

• Assign document to class with:

 maximum scoremaximum confidencemaximum probability

### Summary: Representation ofText Categorization Attributes

• Representations of text are usually very high dimensional (one feature for each word)

• High-bias algorithms that prevent overfitting in high-dimensional space should generally work best*

• For most text categorization tasks, there are many relevant features and many irrelevant ones

• Methods that combine evidence from many or all features (e.g. naive Bayes, kNN) often tend to work better than ones that try to isolate just a few relevant features*

*Although the results are a bit more mixed than often thought

### Which classifier do I use for a given text classification problem?

• Is there a learning method that is optimal for all text classification problems?

• No, because there is a tradeoff between bias and variance.

• Factors to take into account:

• How much training data is available?

• How simple/complex is the problem? (linear vs. nonlinear decision boundary)

• How noisy is the data?

• How stable is the problem over time?

• For an unstable problem, it’s better to use a simple and robust classifier.

### Resources for today’s lecture

• IIR 14

• Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

• Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.

• Trevor Hastie, Robert Tibshirani and Jerome Friedman, Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York.

• Open Calais: Automatic Semantic Tagging

• Free (but they can keep your data), provided by Thompson/Reuters

• Weka: A data mining software package that includes an implementation of many ML algorithms