Mixture model

  • P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

  • Mixes the probability from the document with the general collection frequency of the word.

  • Correctly setting lis very important

  • A high value of lambda makes the search “conjunctive-like” – suitable for short queries

  • A low value is more suitable for long queries

  • Can tune l to optimize performance

    • Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

Basic mixture model summary

  • General formulation of the LM for IR

    • The user has a document in mind, and generates the query from this document.

    • The equation represents the probability that the document that the user had in mind was in fact this one.

  • general language model

  • individual-document model

Example

  • Document collection (2 documents)

    • d1: Xerox reports a profit but revenue is down

    • d2: Lucent narrows quarter loss but revenue decreases further

  • Model: MLE unigram from documents; l = ½

  • Query: revenue down

    • P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

    • = 1/8 x 3/32 = 3/256

    • P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

    • = 1/8 x 1/32 = 1/256

  • Ranking: d1 > d2

Ponte and Croft Experiments

  • Data

    • TREC topics 202-250 on TREC disks 2 and 3

      • Natural language queries consisting of one sentence each

    • TREC topics 51-100 on TREC disk 3 using the concept fields

      • Lists of good terms

Topic: Satellite Launch Contracts</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ><desc _tmplitem="15" >Description:</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >… </desc></p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ><con _tmplitem="15" >Concept(s):</p></LI></UL><OL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Contract, agreement</p><LI _tmplitem="15" ><p _tmplitem="15" >Launch vehicle, rocket, payload, satellite</p><LI _tmplitem="15" ><p _tmplitem="15" >Launch services, … </con></p></OL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4242-22"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4243-23-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4243"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4243-23"> Precision/recall results 202-250 </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4243-23" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><img _tmplitem="15" data-src="http://fileservice.slidewiki.org/media/images/29/949.png?filter=Resize-width-350"/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4243-23"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4244-24-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4244"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4244-24"> Precision/recall results 51-100 </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4244-24" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/950.png?filter=Resize-width-344.0624671916"/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4244-24"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4245-25-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4245"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4245-25"> The main difference is whether “Relevance” figures explicitly in the model or not </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4245-25" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM approach attempts to do away with modeling relevance</p></LI></UL></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM approach asssumes that documents and expressions of information problems are of the same type</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Computationally tractable, intuitively appealing</p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM vs. Prob. Model for IR</p></LI></UL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4245-25"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4246-26-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4246"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4246-26"> Problems of basic LM approach </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4246-26" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Assumption of equivalence between document and information problem representation is unrealistic</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Very simple models of language</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Can’t easily accommodate phrases, passages, Boolean operators</p></LI></UL></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Current extensions focus on putting relevance back into the model, etc.</p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM vs. Prob. Model for IR</p></LI></UL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4246-26"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4247-27-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4247"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4247-27"> Extension: 3-level model </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4247-27" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >3-level model</p></LI></UL><OL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Whole collection model ( )</p><LI _tmplitem="15" ><p _tmplitem="15" >Specific-topic model; relevant-documents model ( )</p><LI _tmplitem="15" ><p _tmplitem="15" >Individual-document model ( )</p><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Relevance hypothesis</p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >A request(query; topic) is generated from a specific-topic model { , }.</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Iff a document is relevant to the topic, the same model will apply to the document.</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >It will replace part of the individual-document model in explaining the document.</p></LI></UL></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability of relevance of a document</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability that this model explains part of the document</p></LI></UL></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability that the { , , } combination is better than the { , } combination</p></LI></UL></UL></UL></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/951.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/952.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/953.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/954.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/955.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/956.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/957.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/958.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/959.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/960.wmf?filter=Resize-width-31.875"/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4247-27"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div>

Precision/recall results 202-250

Precision/recall results 51-100

The main difference is whether “Relevance” figures explicitly in the model or not

  • LM approach attempts to do away with modeling relevance

  • LM approach asssumes that documents and expressions of information problems are of the same type

  • Computationally tractable, intuitively appealing

  • LM vs. Prob. Model for IR

Problems of basic LM approach

  • Assumption of equivalence between document and information problem representation is unrealistic

    • Very simple models of language

    • Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

    • Can’t easily accommodate phrases, passages, Boolean operators

  • Current extensions focus on putting relevance back into the model, etc.

  • LM vs. Prob. Model for IR

Extension: 3-level model

  • 3-level model

  1. Whole collection model (Mc)

  2. Specific-topic model; relevant-documents model (Mt)

  3. Individual-document model (Md)

    • Relevance hypothesis

      • A request(query; topic) is generated from a specific-topic model { Mc, Mt}.

      • Iff a document is relevant to the topic, the same model will apply to the document.

        • It will replace part of the individual-document model in explaining the document.

      • The probability of relevance of a document

        • The probability that this model explains part of the document

        • The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination

3-level model

Alternative Models of Text Generation

Retrieval Using Language Models

  • Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)

Query Likelihood

  • P(Q|Dm)

  • Major issue is estimating document model

    • i.e. smoothing techniques instead of tf.idf weights

  • Good retrieval results

    • e.g. UMass, BBN, Twente, CMU

  • Problems dealing with relevance feedback, query expansion, structured queries

Document Likelihood

  • Rank by likelihood ratio P(D|R)/P(D|NR)

    • treat as a generation problem

    • P(w|R) is estimated by P(w|Qm)

    • Qm is the query or relevance model

    • P(w|NR) is estimated by collection probabilities P(w)

  • Issue is estimation of query model

    • Treat query as generated by mixture of topic and background

    • Estimate relevance model from related documents (query expansion)

    • Relevance feedback is easily incorporated

  • Good retrieval results

    • e.g. UMass at SIGIR 01

    • inconsistent with heterogeneous document collections

Model Comparison

  • Estimate query and document models and compare

  • Suitable measure is KL divergence D(Qm||Dm)

    • equivalent to query-likelihood approach if simple empirical distribution used for query model

  • More general risk minimization framework has been proposed

    • Zhai and Lafferty 2001

  • Better results than query-likelihood or document-likelihood approaches

Two-stage smoothing:Another Reason for Smoothing

  • p( “algorithms”|d1) = p(“algorithm”|d2)

  • p( “data”|d1) < p(“data”|d2)

  • p( “mining”|d1) < p(“mining”|d2)

  • But p(q|d1)>p(q|d2)!

  • We should make p(“the”) and p(“for”) less different for all docs. 

Two-stage Smoothing