CS276A
Text Retrieval and Mining
Lecture 12
[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]
Recap
Probabilistic models:
Naïve Bayes Text Classification
Introduction to Text Classification
Probabilistic Language Models
Naïve Bayes text categorization
Today
The Language Model Approach to IR
Basic query generation model
Alternative models
Standard Probabilistic IR

IR based on Language Model (LM)

- A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
The LM approach directly exploits that idea!
Formal Language (Model)
Traditional generative model: generates strings
Finite state machines or regular grammars, etc.
Example:
![]() | I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … *wish I wish |
Stochastic Language Models
Models probability of generating strings in the language (commonly all strings over alphabet ∑)

Stochastic Language Models
Model probability of generating any string

Stochastic Language Models
A statistical model for generating text
Probability distribution over strings in a given language

Unigram and higher-order models



Using Language Models in IR
Treat each document as the basis for a model (e.g., unigram sufficient statistics)
Rank document d based on P(d | q)
P(d | q) = P(q | d) x P(d) / P(q)
P(q) is the same for all documents, so ignore
P(d) [the prior] is often treated as the same for all d
But we could use criteria like authority, length, genre
P(q | d) is the probability of q given d’s model
Very general formal approach
The fundamental problem of LMs
Usually we don’t know the model M
But have a sample of text representative of that model

Estimate a language model from a sample
Then compute the observation probability

Language Models for IR
Language Modeling Approaches
Attempt to model query generation process
Documents are ranked by the probability that a query would be observed as a random sample from the respective document model
Multinomial approach

Retrieval based on probabilistic LM
Treat the generation of queries as a random process.
Approach
Infer a language model for each document.
Estimate the probability of generating the query according to each of these models.
Rank the documents according to these probabilities.
Usually a unigram estimate of words is used
Some work on bigrams, paralleling van Rijsbergen
Retrieval based on probabilistic LM
Intuition
Users …
Have a reasonable idea of terms that are likely to occur in documents of interest.
They will choose query terms that distinguish these documents from others in the collection.
Collection statistics …
Are integral parts of the language model.
Are not used heuristically as in many other approaches.
In theory. In practice, there’s usually some wiggle room for empirically set parameters
Query generation probability (1)
Ranking formula

The probability of producing the query given the language model of document d using MLE is:

Insufficient data
Zero probability p(t | Md) = 0
May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]
General approach
A non-occurring term is possible, but no more likely than would be expected by chance in the collection.
If :

Insufficient data
Zero probabilities spell disaster
We need to smooth probabilities
Discount nonzero probabilities
Give some probability mass to unseen things
There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or to counts, Dirichlet priors, discounting, and interpolation
[See FSNLP ch. 6 or CS224N if you want more]
A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution
Mixture model
- P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)
Mixes the probability from the document with the general collection frequency of the word.
Correctly setting lis very important
A high value of lambda makes the search “conjunctive-like” – suitable for short queries
A low value is more suitable for long queries
Can tune l to optimize performance
Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)
Basic mixture model summary
General formulation of the LM for IR

The user has a document in mind, and generates the query from this document.
The equation represents the probability that the document that the user had in mind was in fact this one.
general language model
individual-document model
Example
Document collection (2 documents)
d1: Xerox reports a profit but revenue is down
d2: Lucent narrows quarter loss but revenue decreases further
Model: MLE unigram from documents; l = ½
Query: revenue down
P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
= 1/8 x 3/32 = 3/256
P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]
= 1/8 x 1/32 = 1/256
Ranking: d1 > d2
Ponte and Croft Experiments
Data
TREC topics 202-250 on TREC disks 2 and 3
Natural language queries consisting of one sentence each
TREC topics 51-100 on TREC disk 3 using the concept fields
| ![]() |
Precision/recall results 202-250
Precision/recall results 51-100
The main difference is whether “Relevance” figures explicitly in the model or not
LM approach attempts to do away with modeling relevance
LM approach asssumes that documents and expressions of information problems are of the same type
Computationally tractable, intuitively appealing
LM vs. Prob. Model for IR
Problems of basic LM approach
Assumption of equivalence between document and information problem representation is unrealistic
Very simple models of language
Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance
Can’t easily accommodate phrases, passages, Boolean operators
Current extensions focus on putting relevance back into the model, etc.
LM vs. Prob. Model for IR
Extension: 3-level model
3-level model
Whole collection model (Mc)
Specific-topic model; relevant-documents model (Mt)
Individual-document model (Md)
Relevance hypothesis
A request(query; topic) is generated from a specific-topic model { Mc, Mt}.
Iff a document is relevant to the topic, the same model will apply to the document.
It will replace part of the individual-document model in explaining the document.
The probability of relevance of a document
The probability that this model explains part of the document
The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination
3-level model

Alternative Models of Text Generation

Retrieval Using Language Models

Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)
Query Likelihood
P(Q|Dm)
Major issue is estimating document model
i.e. smoothing techniques instead of tf.idf weights
Good retrieval results
e.g. UMass, BBN, Twente, CMU
Problems dealing with relevance feedback, query expansion, structured queries
Document Likelihood
Rank by likelihood ratio P(D|R)/P(D|NR)
treat as a generation problem
P(w|R) is estimated by P(w|Qm)
Qm is the query or relevance model
P(w|NR) is estimated by collection probabilities P(w)
Issue is estimation of query model
Treat query as generated by mixture of topic and background
Estimate relevance model from related documents (query expansion)
Relevance feedback is easily incorporated
Good retrieval results
e.g. UMass at SIGIR 01
inconsistent with heterogeneous document collections
Model Comparison
Estimate query and document models and compare

Suitable measure is KL divergence D(Qm||Dm)
equivalent to query-likelihood approach if simple empirical distribution used for query model
More general risk minimization framework has been proposed
Zhai and Lafferty 2001
Better results than query-likelihood or document-likelihood approaches
Two-stage smoothing:Another Reason for Smoothing

p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
But p(q|d1)>p(q|d2)!
We should make p(“the”) and p(“for”) less different for all docs.
Two-stage Smoothing

How can one do relevance feedback if using language modeling approach?
Introduce a query model & treat feedback as query model updating
Retrieval function:
Query-likelihood => KL-Divergence
Feedback:
Expansion-based => Model-based
Expansion-based vs. Model-based

Feedback as Model Interpolation

Translation model (Berger and Lafferty)
Basic LMs do not address issues of synonymy.
Or any deviation in expression of information need from language of documents
A translation model lets you generate query words not in document via “translation” to synonyms etc.
Or to do cross-language IR, or multimedia IR

Need to learn a translation model (using a dictionary or via statistical machine translation)
Language models: pro & con
Novel way of looking at the problem of text retrieval based on probabilistic language modeling
Conceptually simple and explanatory
Formal mathematical model
Natural use of collection statistics, not heuristics (almost…)
LMs provide effective retrieval and can be improved to the extent that the following conditions can be met
Our language models are accurate representations of the data.
Users have some sense of term distribution.*
*Or we get more sophisticated with translation model
Comparison With Vector Space
There’s some relation to traditional tf.idf models:
(unscaled) term frequency is directly in model
the probabilities do length normalization of term frequencies
the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking
Comparison With Vector Space
Similar in some ways
Term weights based on frequency
Terms often used as if they were independent
Inverse document/collection frequency used
Some form of length normalization useful
Different in others
Based on probability rather than similarity
Intuitions are probabilistic rather than geometric
Details of use of document length and term, document, and collection frequency differ
Resources
- J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21.
D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.
A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.
D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.
[Several relevant newer papers at SIGIR 23–25, 2000–2002.]
Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ .
The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.