### CS276A

Text Retrieval and Mining

Lecture 12

[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]

### Recap

• Probabilistic models:
Naïve Bayes Text Classification

• Introduction to Text Classification

• Probabilistic Language Models

• Naïve Bayes text categorization

### Today

• The Language Model Approach to IR

• Basic query generation model

• Alternative models

### IR based on Language Model (LM)

• A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
• The LM approach directly exploits that idea!

### Formal Language (Model)

• Traditional generative model: generates strings

• Finite state machines or regular grammars, etc.

• Example:

 I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … *wish I wish

### Stochastic Language Models

• Models probability of generating strings in the language (commonly all strings over alphabet ∑)

### Stochastic Language Models

• Model probability of generating any string

### Stochastic Language Models

• A statistical model for generating text

• Probability distribution over strings in a given language

### Unigram and higher-order models

• Unigram Language Models (Easy , effective !) • Bigram (generally, n-gram) Language Models
• • Other Language Models
• Grammar-based models (PCFGs), etc.
• Probably not the first thing to try in IR
• ### Using Language Models in IR

• Treat each document as the basis for a model (e.g., unigram sufficient statistics)

• Rank document d based on P(d | q)

• P(d | q) = P(q | d) x P(d) / P(q)

• P(q) is the same for all documents, so ignore

• P(d) [the prior] is often treated as the same for all d

• But we could use criteria like authority, length, genre

• P(q | d) is the probability of q given d’s model

• Very general formal approach

### The fundamental problem of LMs

• Usually we don’t know the model M

• But have a sample of text representative of that model

• Estimate a language model from a sample

• Then compute the observation probability ### Language Models for IR

• Language Modeling Approaches

• Attempt to model query generation process

• Documents are ranked by the probability that a query would be observed as a random sample from the respective document model

• Multinomial approach

### Retrieval based on probabilistic LM

• Treat the generation of queries as a random process.

• Approach

• Infer a language model for each document.

• Estimate the probability of generating the query according to each of these models.

• Rank the documents according to these probabilities.

• Usually a unigram estimate of words is used

• Some work on bigrams, paralleling van Rijsbergen

### Retrieval based on probabilistic LM

• Intuition

• Users …

• Have a reasonable idea of terms that are likely to occur in documents of interest.

• They will choose query terms that distinguish these documents from others in the collection.

• Collection statistics …

• Are integral parts of the language model.

• Are not used heuristically as in many other approaches.

• In theory. In practice, there’s usually some wiggle room for empirically set parameters

### Query generation probability (1)

• Ranking formula

• The probability of producing the query given the language model of document d using MLE is: ### Insufficient data

• Zero probability   p(t | Md) = 0

• May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]

• General approach

• A non-occurring term is possible, but no more likely than would be expected by chance in the collection.

• If :

### Insufficient data

• Zero probabilities spell disaster

• We need to smooth probabilities

• Discount nonzero probabilities

• Give some probability mass to unseen things

• There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation

• [See FSNLP ch. 6 or CS224N if you want more]

• A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution

### Mixture model

• P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

• Mixes the probability from the document with the general collection frequency of the word.

• Correctly setting lis very important

• A high value of lambda makes the search “conjunctive-like” – suitable for short queries

• A low value is more suitable for long queries

• Can tune l to optimize performance

• Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

### Basic mixture model summary

• General formulation of the LM for IR

• The user has a document in mind, and generates the query from this document.

• The equation represents the probability that the document that the user had in mind was in fact this one.

• general language model

• individual-document model

### Example

• Document collection (2 documents)

• d1: Xerox reports a profit but revenue is down

• d2: Lucent narrows quarter loss but revenue decreases further

• Model: MLE unigram from documents; l = ½

• Query: revenue down

• P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

• = 1/8 x 3/32 = 3/256

• P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

• = 1/8 x 1/32 = 1/256

• Ranking: d1 > d2

### Ponte and Croft Experiments

• Data

• TREC topics 202-250 on TREC disks 2 and 3

• Natural language queries consisting of one sentence each

• TREC topics 51-100 on TREC disk 3 using the concept fields

 Lists of good termsTopic: Satellite Launch Contracts

• Description:

• Concept(s):

1. Contract, agreement

2. Launch vehicle, rocket, payload, satellite

3. Launch services, …

Precision/recall results 202-250

Precision/recall results 51-100

The main difference is whether “Relevance” figures explicitly in the model or not

• LM approach attempts to do away with modeling relevance

• LM approach asssumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

• LM vs. Prob. Model for IR

Problems of basic LM approach

• Assumption of equivalence between document and information problem representation is unrealistic

• Very simple models of language

• Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

• Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

• LM vs. Prob. Model for IR

Extension: 3-level model

• 3-level model

1. Whole collection model ( )

2. Specific-topic model; relevant-documents model ( )

3. Individual-document model ( )

• Relevance hypothesis

• A request(query; topic) is generated from a specific-topic model { , }.

• Iff a document is relevant to the topic, the same model will apply to the document.

• It will replace part of the individual-document model in explaining the document.

• The probability of relevance of a document

• The probability that this model explains part of the document

• The probability that the { , , } combination is better than the { , } combination ### The main difference is whether “Relevance” figures explicitly in the model or not

• LM approach attempts to do away with modeling relevance

• LM approach asssumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

• LM vs. Prob. Model for IR

### Problems of basic LM approach

• Assumption of equivalence between document and information problem representation is unrealistic

• Very simple models of language

• Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

• Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

• LM vs. Prob. Model for IR

### Extension: 3-level model

• 3-level model

1. Whole collection model (Mc)

2. Specific-topic model; relevant-documents model (Mt)

3. Individual-document model (Md)

• Relevance hypothesis

• A request(query; topic) is generated from a specific-topic model { Mc, Mt}.

• Iff a document is relevant to the topic, the same model will apply to the document.

• It will replace part of the individual-document model in explaining the document.

• The probability of relevance of a document

• The probability that this model explains part of the document

• The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination

### Retrieval Using Language Models

• Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)

### Query Likelihood

• P(Q|Dm)

• Major issue is estimating document model

• i.e. smoothing techniques instead of tf.idf weights

• Good retrieval results

• e.g. UMass, BBN, Twente, CMU

• Problems dealing with relevance feedback, query expansion, structured queries

### Document Likelihood

• Rank by likelihood ratio P(D|R)/P(D|NR)

• treat as a generation problem

• P(w|R) is estimated by P(w|Qm)

• Qm is the query or relevance model

• P(w|NR) is estimated by collection probabilities P(w)

• Issue is estimation of query model

• Treat query as generated by mixture of topic and background

• Estimate relevance model from related documents (query expansion)

• Relevance feedback is easily incorporated

• Good retrieval results

• e.g. UMass at SIGIR 01

• inconsistent with heterogeneous document collections

### Model Comparison

• Estimate query and document models and compare

• Suitable measure is KL divergence D(Qm||Dm)

• equivalent to query-likelihood approach if simple empirical distribution used for query model

• More general risk minimization framework has been proposed

• Zhai and Lafferty 2001

• Better results than query-likelihood or document-likelihood approaches

### Two-stage smoothing:Another Reason for Smoothing

• p( “algorithms”|d1) = p(“algorithm”|d2)

• p( “data”|d1) < p(“data”|d2)

• p( “mining”|d1) < p(“mining”|d2)

• But p(q|d1)>p(q|d2)!

• We should make p(“the”) and p(“for”) less different for all docs.

### How can one do relevance feedback if using language modeling approach?

• Introduce a query model & treat feedback as query model updating

• Retrieval function:

• Query-likelihood => KL-Divergence

• Feedback:

• Expansion-based => Model-based

### Translation model (Berger and Lafferty)

• Basic LMs do not address issues of synonymy.

• Or any deviation in expression of information need from language of documents

• A translation model lets you generate query words not in document via “translation” to synonyms etc.

• Or to do cross-language IR, or multimedia IR

• Need to learn a translation model (using a dictionary or via statistical machine translation)

### Language models: pro & con

• Novel way of looking at the problem of text retrieval based on probabilistic language modeling

• Conceptually simple and explanatory

• Formal mathematical model

• Natural use of collection statistics, not heuristics (almost…)

• LMs provide effective retrieval and can be improved to the extent that the following conditions can be met

• Our language models are accurate representations of the data.

• Users have some sense of term distribution.*

• *Or we get more sophisticated with translation model

### Comparison With Vector Space

• There’s some relation to traditional tf.idf models:

• (unscaled) term frequency is directly in model

• the probabilities do length normalization of term frequencies

• the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

### Comparison With Vector Space

• Similar in some ways

• Term weights based on frequency

• Terms often used as if they were independent

• Inverse document/collection frequency used

• Some form of length normalization useful

• Different in others

• Based on probability rather than similarity

• Intuitions are probabilistic rather than geometric

• Details of use of document length and term, document, and collection frequency differ

### Resources

• J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21.
• D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.

• A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.

• D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.

• [Several relevant newer papers at SIGIR 2325, 2000–2002.]

• Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ .

• The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.