CS276A

Text Retrieval and Mining

Lecture 12

[Borrows slides from Viktor Lavrenko and Chengxiang Zhai]

Recap

  • Probabilistic models:
    Naïve Bayes Text Classification

    • Introduction to Text Classification

    • Probabilistic Language Models

    • Naïve Bayes text categorization

Today

  • The Language Model Approach to IR

    • Basic query generation model

    • Alternative models

Standard Probabilistic IR

IR based on Language Model (LM)

  • A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
  • The LM approach directly exploits that idea!

Formal Language (Model)

  • Traditional generative model: generates strings

    • Finite state machines or regular grammars, etc.

  • Example:


 I wish
 I wish I wish
 I wish I wish I wish
 I wish I wish I wish I wish
 …
 *wish I wish

Stochastic Language Models

  • Models probability of generating strings in the language (commonly all strings over alphabet ∑)  

Stochastic Language Models

  • Model probability of generating any string

Stochastic Language Models

  • A statistical model for generating text

    • Probability distribution over strings in a given language

Unigram and higher-order models

  • Unigram Language Models (Easy , effective !)
  • Bigram (generally, n-gram) Language Models
  • Other Language Models
  • Grammar-based models (PCFGs), etc.
  • Probably not the first thing to try in IR
  • Using Language Models in IR

    • Treat each document as the basis for a model (e.g., unigram sufficient statistics)

    • Rank document d based on P(d | q)

    • P(d | q) = P(q | d) x P(d) / P(q)

      • P(q) is the same for all documents, so ignore

      • P(d) [the prior] is often treated as the same for all d

        • But we could use criteria like authority, length, genre

      • P(q | d) is the probability of q given d’s model

    • Very general formal approach

    The fundamental problem of LMs

    • Usually we don’t know the model M

      • But have a sample of text representative of that model

    • Estimate a language model from a sample

    • Then compute the observation probability

    Language Models for IR

    • Language Modeling Approaches

      • Attempt to model query generation process

      • Documents are ranked by the probability that a query would be observed as a random sample from the respective document model

        • Multinomial approach

    Retrieval based on probabilistic LM

    • Treat the generation of queries as a random process.

    • Approach

      • Infer a language model for each document.

      • Estimate the probability of generating the query according to each of these models.

      • Rank the documents according to these probabilities.

      • Usually a unigram estimate of words is used

        • Some work on bigrams, paralleling van Rijsbergen

    Retrieval based on probabilistic LM

    • Intuition

      • Users …

        • Have a reasonable idea of terms that are likely to occur in documents of interest.

        • They will choose query terms that distinguish these documents from others in the collection.

    • Collection statistics …

      • Are integral parts of the language model.

      • Are not used heuristically as in many other approaches.

        • In theory. In practice, there’s usually some wiggle room for empirically set parameters

    Query generation probability (1)

    • Ranking formula

    • The probability of producing the query given the language model of document d using MLE is:

    Insufficient data

    • Zero probability   p(t | Md) = 0

      • May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics]

    • General approach

      • A non-occurring term is possible, but no more likely than would be expected by chance in the collection.

      • If :

    Insufficient data

    • Zero probabilities spell disaster

      • We need to smooth probabilities

        • Discount nonzero probabilities

        • Give some probability mass to unseen things

    • There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation

      • [See FSNLP ch. 6 or CS224N if you want more]

    • A simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution

    Mixture model

    • P(w|d) = lPmle(w|Md) + (1 – l)Pmle(w|Mc)

    • Mixes the probability from the document with the general collection frequency of the word.

    • Correctly setting lis very important

    • A high value of lambda makes the search “conjunctive-like” – suitable for short queries

    • A low value is more suitable for long queries

    • Can tune l to optimize performance

      • Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

    Basic mixture model summary

    • General formulation of the LM for IR

      • The user has a document in mind, and generates the query from this document.

      • The equation represents the probability that the document that the user had in mind was in fact this one.

    • general language model

    • individual-document model

    Example

    • Document collection (2 documents)

      • d1: Xerox reports a profit but revenue is down

      • d2: Lucent narrows quarter loss but revenue decreases further

    • Model: MLE unigram from documents; l = ½

    • Query: revenue down

      • P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

      • = 1/8 x 3/32 = 3/256

      • P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

      • = 1/8 x 1/32 = 1/256

    • Ranking: d1 > d2

    Ponte and Croft Experiments

    • Data

      • TREC topics 202-250 on TREC disks 2 and 3

        • Natural language queries consisting of one sentence each

      • TREC topics 51-100 on TREC disk 3 using the concept fields

        • Lists of good terms

    Topic: Satellite Launch Contracts</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ><desc _tmplitem="15" >Description:</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >… </desc></p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ><con _tmplitem="15" >Concept(s):</p></LI></UL><OL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Contract, agreement</p><LI _tmplitem="15" ><p _tmplitem="15" >Launch vehicle, rocket, payload, satellite</p><LI _tmplitem="15" ><p _tmplitem="15" >Launch services, … </con></p></OL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4242-22"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4243-23-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4243"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4243-23"> Precision/recall results 202-250 </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4243-23" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><img _tmplitem="15" data-src="http://fileservice.slidewiki.org/media/images/29/949.png?filter=Resize-width-350"/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4243-23"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4244-24-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4244"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4244-24"> Precision/recall results 51-100 </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4244-24" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/950.png?filter=Resize-width-344.0624671916"/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4244-24"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4245-25-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4245"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4245-25"> The main difference is whether “Relevance” figures explicitly in the model or not </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4245-25" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM approach attempts to do away with modeling relevance</p></LI></UL></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM approach asssumes that documents and expressions of information problems are of the same type</p></LI></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Computationally tractable, intuitively appealing</p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM vs. Prob. Model for IR</p></LI></UL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4245-25"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4246-26-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4246"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4246-26"> Problems of basic LM approach </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4246-26" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Assumption of equivalence between document and information problem representation is unrealistic</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Very simple models of language</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Can’t easily accommodate phrases, passages, Boolean operators</p></LI></UL></UL><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Current extensions focus on putting relevance back into the model, etc.</p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >LM vs. Prob. Model for IR</p></LI></UL></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4246-26"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div> <div _tmplitem="15" class="slide " id="tree-273-slide-4247-27-view"> <!-- <span _tmplitem="15" onclick="filterLatex(this.id)" id="4247"> Click me! </span> --> <div _tmplitem="15" class="slide-content"><div _tmplitem="15" class="slide-scaler"> <div _tmplitem="15" class="slide-header"> <h2 _tmplitem="15" > <div _tmplitem="15" class="slide-title" id="slide_title_tree-273-slide-4247-27"> Extension: 3-level model </div> </h2> </div> <div _tmplitem="15" class="slide-body" id="slide_body_tree-273-slide-4247-27" onclick="enableWYSISWYG(this)"> <div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" ></p></LI></UL></div><div _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >3-level model</p></LI></UL><OL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Whole collection model ( )</p><LI _tmplitem="15" ><p _tmplitem="15" >Specific-topic model; relevant-documents model ( )</p><LI _tmplitem="15" ><p _tmplitem="15" >Individual-document model ( )</p><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Relevance hypothesis</p></LI></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >A request(query; topic) is generated from a specific-topic model { , }.</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >Iff a document is relevant to the topic, the same model will apply to the document.</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >It will replace part of the individual-document model in explaining the document.</p></LI></UL></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability of relevance of a document</p></LI></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability that this model explains part of the document</p></LI></UL></UL></UL><UL _tmplitem="15" ><UL _tmplitem="15" ><UL _tmplitem="15" ><LI _tmplitem="15" ><p _tmplitem="15" >The probability that the { , , } combination is better than the { , } combination</p></LI></UL></UL></UL></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/951.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/952.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/953.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/954.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/955.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/956.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/957.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/958.wmf?filter=Resize-width-31.875"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/959.wmf?filter=Resize-width-33.437532808399"/></div><div _tmplitem="15" ><img _tmplitem="15" src="http://fileservice.slidewiki.org/media/images/29/960.wmf?filter=Resize-width-31.875"/></div> </div> </div></div> <div _tmplitem="15" class="slide-metadata"> <hr _tmplitem="15" > <small _tmplitem="15" >Speaker Notes:</small> <div _tmplitem="15" class="slide-note" id="slide_note_tree-273-slide-4247-27"> <p _tmplitem="15" >« Click to add note »</p> </div> <small _tmplitem="15" >Created by <a _tmplitem="15" href="user/29">Tgbyrdmc</a>.</small> <br _tmplitem="15" / ><br _tmplitem="15" /> </div> </div>

    Precision/recall results 202-250

    Precision/recall results 51-100

    The main difference is whether “Relevance” figures explicitly in the model or not

    • LM approach attempts to do away with modeling relevance

    • LM approach asssumes that documents and expressions of information problems are of the same type

    • Computationally tractable, intuitively appealing

    • LM vs. Prob. Model for IR

    Problems of basic LM approach

    • Assumption of equivalence between document and information problem representation is unrealistic

      • Very simple models of language

      • Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

      • Can’t easily accommodate phrases, passages, Boolean operators

    • Current extensions focus on putting relevance back into the model, etc.

    • LM vs. Prob. Model for IR

    Extension: 3-level model

    • 3-level model

    1. Whole collection model (Mc)

    2. Specific-topic model; relevant-documents model (Mt)

    3. Individual-document model (Md)

      • Relevance hypothesis

        • A request(query; topic) is generated from a specific-topic model { Mc, Mt}.

        • Iff a document is relevant to the topic, the same model will apply to the document.

          • It will replace part of the individual-document model in explaining the document.

        • The probability of relevance of a document

          • The probability that this model explains part of the document

          • The probability that the {Mc ,Mt ,Md } combination is better than the {Mc ,Md } combination

    3-level model

    Alternative Models of Text Generation

    Retrieval Using Language Models

    • Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)

    Query Likelihood

    • P(Q|Dm)

    • Major issue is estimating document model

      • i.e. smoothing techniques instead of tf.idf weights

    • Good retrieval results

      • e.g. UMass, BBN, Twente, CMU

    • Problems dealing with relevance feedback, query expansion, structured queries

    Document Likelihood

    • Rank by likelihood ratio P(D|R)/P(D|NR)

      • treat as a generation problem

      • P(w|R) is estimated by P(w|Qm)

      • Qm is the query or relevance model

      • P(w|NR) is estimated by collection probabilities P(w)

    • Issue is estimation of query model

      • Treat query as generated by mixture of topic and background

      • Estimate relevance model from related documents (query expansion)

      • Relevance feedback is easily incorporated

    • Good retrieval results

      • e.g. UMass at SIGIR 01

      • inconsistent with heterogeneous document collections

    Model Comparison

    • Estimate query and document models and compare

    • Suitable measure is KL divergence D(Qm||Dm)

      • equivalent to query-likelihood approach if simple empirical distribution used for query model

    • More general risk minimization framework has been proposed

      • Zhai and Lafferty 2001

    • Better results than query-likelihood or document-likelihood approaches

    Two-stage smoothing:Another Reason for Smoothing

    • p( “algorithms”|d1) = p(“algorithm”|d2)

    • p( “data”|d1) < p(“data”|d2)

    • p( “mining”|d1) < p(“mining”|d2)

    • But p(q|d1)>p(q|d2)!

    • We should make p(“the”) and p(“for”) less different for all docs. 

    Two-stage Smoothing

    How can one do relevance feedback if using language modeling approach?

    • Introduce a query model & treat feedback as query model updating

      • Retrieval function:

        • Query-likelihood => KL-Divergence

      • Feedback:

        • Expansion-based => Model-based

    Expansion-based vs. Model-based

    Feedback as Model Interpolation

    Translation model (Berger and Lafferty)

    • Basic LMs do not address issues of synonymy.

      • Or any deviation in expression of information need from language of documents

    • A translation model lets you generate query words not in document via “translation” to synonyms etc.

      • Or to do cross-language IR, or multimedia IR

      • Need to learn a translation model (using a dictionary or via statistical machine translation)

    Language models: pro & con

    • Novel way of looking at the problem of text retrieval based on probabilistic language modeling

        • Conceptually simple and explanatory

        • Formal mathematical model

        • Natural use of collection statistics, not heuristics (almost…)

      • LMs provide effective retrieval and can be improved to the extent that the following conditions can be met

        • Our language models are accurate representations of the data.

        • Users have some sense of term distribution.*

          • *Or we get more sophisticated with translation model

    Comparison With Vector Space

    • There’s some relation to traditional tf.idf models:

      • (unscaled) term frequency is directly in model

      • the probabilities do length normalization of term frequencies

      • the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

    Comparison With Vector Space

    • Similar in some ways

      • Term weights based on frequency

      • Terms often used as if they were independent

      • Inverse document/collection frequency used

      • Some form of length normalization useful

    • Different in others

      • Based on probability rather than similarity

        • Intuitions are probabilistic rather than geometric

      • Details of use of document length and term, document, and collection frequency differ

    Resources

    • J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21.
    • D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.

    • A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.

    • D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.

    • [Several relevant newer papers at SIGIR 2325, 2000–2002.]

    • Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ .

    • The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.