### Mixture model

• P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

• Mixes the probability from the document with the general collection frequency of the word.

• Correctly setting lis very important

• A high value of lambda makes the search “conjunctive-like” – suitable for short queries

• A low value is more suitable for long queries

• Can tune l to optimize performance

• Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

### Basic mixture model summary

• General formulation of the LM for IR

• The user has a document in mind, and generates the query from this document.

• The equation represents the probability that the document that the user had in mind was in fact this one.

• general language model

• individual-document model

### Example

• Document collection (2 documents)

• d1: Xerox reports a profit but revenue is down

• d2: Lucent narrows quarter loss but revenue decreases further

• Model: MLE unigram from documents; l = ½

• Query: revenue down

• P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

• = 1/8 x 3/32 = 3/256

• P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

• = 1/8 x 1/32 = 1/256

• Ranking: d1 > d2

### Ponte and Croft Experiments

• Data

• TREC topics 202-250 on TREC disks 2 and 3

• Natural language queries consisting of one sentence each

• TREC topics 51-100 on TREC disk 3 using the concept fields

 Lists of good termsTopic: Satellite Launch Contracts

• Description:

• Concept(s):

1. Contract, agreement

2. Launch vehicle, rocket, payload, satellite

3. Launch services, …

Precision/recall results 202-250

Precision/recall results 51-100

The main difference is whether “Relevance” figures explicitly in the model or not

• LM approach attempts to do away with modeling relevance

• LM approach asssumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

• LM vs. Prob. Model for IR

Problems of basic LM approach

• Assumption of equivalence between document and information problem representation is unrealistic

• Very simple models of language

• Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

• Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

• LM vs. Prob. Model for IR

Extension: 3-level model

• 3-level model

1. Whole collection model ( )

2. Specific-topic model; relevant-documents model ( )

3. Individual-document model ( )

• Relevance hypothesis

• A request(query; topic) is generated from a specific-topic model { , }.

• Iff a document is relevant to the topic, the same model will apply to the document.

• It will replace part of the individual-document model in explaining the document.

• The probability of relevance of a document

• The probability that this model explains part of the document

• The probability that the { , , } combination is better than the { , } combination

### The main difference is whether “Relevance” figures explicitly in the model or not

• LM approach attempts to do away with modeling relevance

• LM approach asssumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

• LM vs. Prob. Model for IR

### Problems of basic LM approach

• Assumption of equivalence between document and information problem representation is unrealistic

• Very simple models of language

• Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

• Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

• LM vs. Prob. Model for IR

### Extension: 3-level model

• 3-level model

1. Whole collection model (Mc)

2. Specific-topic model; relevant-documents model (Mt)

3. Individual-document model (Md)

• Relevance hypothesis

• A request(query; topic) is generated from a specific-topic model { Mc, Mt}.

• Iff a document is relevant to the topic, the same model will apply to the document.

• It will replace part of the individual-document model in explaining the document.

• The probability of relevance of a document

• The probability that this model explains part of the document

• The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination

### Retrieval Using Language Models

• Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)

### Query Likelihood

• P(Q|Dm)

• Major issue is estimating document model

• i.e. smoothing techniques instead of tf.idf weights

• Good retrieval results

• e.g. UMass, BBN, Twente, CMU

• Problems dealing with relevance feedback, query expansion, structured queries

### Document Likelihood

• Rank by likelihood ratio P(D|R)/P(D|NR)

• treat as a generation problem

• P(w|R) is estimated by P(w|Qm)

• Qm is the query or relevance model

• P(w|NR) is estimated by collection probabilities P(w)

• Issue is estimation of query model

• Treat query as generated by mixture of topic and background

• Estimate relevance model from related documents (query expansion)

• Relevance feedback is easily incorporated

• Good retrieval results

• e.g. UMass at SIGIR 01

• inconsistent with heterogeneous document collections

### Model Comparison

• Estimate query and document models and compare

• Suitable measure is KL divergence D(Qm||Dm)

• equivalent to query-likelihood approach if simple empirical distribution used for query model

• More general risk minimization framework has been proposed

• Zhai and Lafferty 2001

• Better results than query-likelihood or document-likelihood approaches

### Two-stage smoothing:Another Reason for Smoothing

• p( “algorithms”|d1) = p(“algorithm”|d2)

• p( “data”|d1) < p(“data”|d2)

• p( “mining”|d1) < p(“mining”|d2)

• But p(q|d1)>p(q|d2)!

• We should make p(“the”) and p(“for”) less different for all docs.