Introduction to Information Retrieval

Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Lecture 10: Text Classification;

The Naive Bayes algorithm

Standing queries

  • The path from IR to text classification:

    • You have an information need to monitor, say:

      • Unrest in the Niger delta region

    • You want to rerun an appropriate query periodically to find new news items on this topic

    • You will be sent new documents that are found

      • I.e., it’s not ranking but classification (relevant vs. not relevant)

  • Such queries are called standing queries

    • Long used by “information professionals”

    • A modern mass instantiation is Google Alerts

  • Standing queries are (hand-written) text classifiers

Introduction to Information Retrieval


Spam filteringAnother text classification task

  • From: ""<takworlld@hotmail.com>

  • Subject: real estate is the only way... gem oalvgkay

  • Anyone can buy real estate with no money down

  • Stop paying rent TODAY !

  • There is no need to spend hundreds or even thousands for similar courses

  • I am 22 years old and I have already purchased 6 properties using the

  • methods outlined in this truly INCREDIBLE ebook.

  • Change your life NOW !

  • =================================================

  • Click Below to order:

  • http://www.wholesaledaily.com/sales/nmd.htm

  • =================================================

Text classification

  • Today:

    • Introduction to Text Classification

      • Also widely known as “text categorization”

      • Same thing

    • Naïve Bayes text classification

      • Including a little on Probabilistic Language Models

Categorization/Classification

  • Given:

    • A description of an instance, d ∈ X

      • X is the instance language or instance space.

        • Issue: how to represent text documents.

        • Usually some type of high-dimensional space – bag of words

    • A fixed set of classes:

    • C = {c1, c2,…, cJ}

  • Determine:

    • The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.

      • We want to know how to build classification functions (“classifiers”).

Machine Learning:Supervised Classification

  • Given:

    • A description of an instance, d ∈ X

      • X is the instance language or instance space.

    • A fixed set of classes:

    • C = {c1, c2,…, cJ}

    • A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×C

  • Determine:

    • A learning method or algorithm which will enable us to learn a classifier γ:X→C

    • For a test document d, we assign it the class γ(d) ∈ C

Document Classification

 
  • (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)

More Text Classification ExamplesMany search engine functionalities use classification

  • Assigning labels to documents or web-pages:

  • Labels are most often topics such as Yahoo-categories

    • "finance," "sports," "news>world>asia>business"

  • Labels may be genres

    • "editorials" "movie-reviews" "news”

  • Labels may be opinion on a person/product

    • “like”, “hate”, “neutral”

  • Labels may be domain-specific

    • "interesting-to-me" : "not-interesting-to-me”

    • “contains adult language” : “doesn’t”

    • language identification: English, French, Chinese, …

    • search vertical: about Linux versus not

    • “link spam” : “not link spam"

Classification Methods (1)

  • Manual classification

    • Used by the original Yahoo! Directory

    • Looksmart, about.com, ODP, PubMed

    • Very accurate when job is done by experts

    • Consistent when the problem size and team is small

    • Difficult and expensive to scale

      • Means we need automatic classification methods for big problems

Classification Methods (2)

  • Hand-coded rule-based classifiers

    • One technique used by CS dept’s spam filter, Reuters, CIA, etc.

    • It’s what Google Alerts is doing

      • Widely deployed in government and enterprise

    • Companies provide “IDE” for writing such rules

    • E.g., assign category if document contains a given boolean combination of words

    • Commercial systems have complex query languages (everything in IR query languages +score accumulators)

    • Accuracy is often very high if a rule has been carefully refined over time by a subject expert

    • Building and maintaining these rules is expensive

A Verity topic A complex classification rule

  • Note:

    • maintenance issues (author, etc.)

    • Hand-weighting of terms

    • [Verity was bought by Autonomy.]

Classification Methods (3)

  • Supervised learning of a document-label assignment function

    • Many systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)

      • k-Nearest Neighbors (simple, powerful)

      • Naive Bayes (simple, common method)

      • Support-vector machines (new, generally more powerful)

      • … plus many other methods

    • No free lunch: requires hand-classified training data

    • But data can be built up (and refined) by amateurs

  • Many commercial systems use a mixture of methods

Relevance feedback

  • In relevance feedback, the user marks a few documents as relevant/nonrelevant

  • The choices can be viewed as classes or categories

  • The IR system then uses these judgments to build a better model of the information need

  • So, relevance feedback can be viewed as a form of text classification (deciding between several classes)

Probabilistic relevance feedback

  • Rather than reweighting in a vector space…

  • If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier

      • such as the Naive Bayes model we will look at today:

    • P(tk|R) = |Drk| / |Dr|

    • P(tk|NR) = |Dnrk| / |Dnr|

      • tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.

Recall a few probability basics

  • For events a and b:

  • Bayes’ Rule

  • Odds:

Bayesian Methods

  • Learning and classification methods based on probability theory

  • Bayes theorem plays a critical role

  • Builds a generative model that approximates how data is produced

  • Has prior probability of each category given no information about an item.

  • Model produces a posterior probability

    • Distribution over the possible categories given an item

  • Naïve Bayes methods use a bag of words as the item description

The bag of words representation

The bag of words representation

Bayes’ Rule for text classification

For a document d and a class c

Naive Bayes Classifiers

  • Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj ∈ C

    d={x1,x2,......,xn}

  • MAP is “maximum a posteriori” = most likely class


Naïve Bayes Classifier: Naïve Bayes Assumption

  • P(cj)

    • Can be estimated from the frequency of classes in the training examples.

  • P(x1,x2,…,xn|cj)

    • O(|X|n•|C|) parameters

    • Could only be estimated if a very, very large number of training examples was available.

  • Naïve Bayes Conditional Independence Assumption:

  • Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

The Multivariate Bernoulli NB Classifier [Like Prob IR BIM; not language model; less used]

  • Conditional Independence Assumption: features detect term presence and are independent of each other given the class:

  • This model is appropriate for binary variables
    • Multivariate Bernoulli model

Learning the Model

  • First attempt: maximum likelihood estimates

    • Simply use the frequencies in the data

Problem with Maximum Likelihood

  • What if we have seen no training documents with the word muscle-ache and classified in the topic Flu?

  • Zero probabilities cannot be conditioned away, no matter the other evidence!

Smoothing to Avoid Overfitting

  • Somewhat more subtle version

Stochastic Language Models

  • Model probability of generating any string


Stochastic Language Models

  • Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model

Unigram and higher-order models

  • Unigram Language Models (Easy , effective !)
  • Bigram (generally, n-gram) Language Models
  • Other Language Models
  • Grammar-based models (PCFGs), etc.
  • Probably not the first thing to try in IR
  • Naïve Bayes via a class conditional language model = multinomial NB

    • The probability of the words is done as a class-specific unigram language model

    Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

    • Attributes are text positions, values are words.

    • Still too many possibilities

    • Assume that classification is independent of the positions of the words

    • Use same parameters for each position

    • Result is bag of words model (over tokens not types)

    Naive Bayes and Language Modeling

    • Naïve Bayes classifiers can use any sort of feature

      • URL, email address, dictionaries, network features

    • But if, as in the previous slides

      • We use only word features

      • we use all of the words in the text (not a subset)

    • Then

      • Naïve Bayes is basically the same as language modeling

    Multinomial Naive Bayes: Learning

    • From training corpus, extract Vocabulary
    • Calculate required P(cj) and P(xk | cj) terms
      • For each cj in C do
        • docsj ← subset of documents for which the target class is cj
        • Textj ← single document containing all docsj
          • for each word xk in Vocabulary
          • nk ← number of occurrences of xk in Textj

    Naive Bayes: Classifying

    • positions ← all word positions in current document

                      which contain tokens found in Vocabulary

    • Return cNB, where

    Naive Bayes: Time Complexity

    • Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D.

      • Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data.                          ← Why ?

      • Generally just O(|D|Lave) since usually |C||V| < |D|Lave

    • Test Time: O(|C| Lt) where Lt is the average length of a test document.

    • Very efficient overall, linearly proportional to the time needed to just read in all the data.

    Underflow Prevention: using logs

    • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

    • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

    • Class with highest final un-normalized log probability score is still the most probable.

    • Note that model is now just max of sum of weights…

    Example


    Doc

    Words

    Class

    Training

    1

    Chinese Beijing Chinese

    c

    2

    Chinese Chinese Shanghai

    c

    3

    Chinese Macao

    c

    4

    Tokyo Japan Chinese

    j

    Test

    5

    Chinese Chinese Chinese Tokyo Japan

    ?

    Two Naive Bayes Models

    • Model 1: Multivariate Bernoulli

      • One feature Xw for each word in dictionary

        • for loop iterates over dictionary

      • Xw = true in document d if w appears in d

      • Naive Bayes assumption:

        • Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

    • This is the model used in the binary independence model in classic probabilistic relevance feedback on hand-classified data

    Two Models

    • Model 2: Multinomial = Class conditional unigram

      • One feature Xi for each word pos in document

        • feature’s values are all words in dictionary

      • Value of Xi is the word in position i

      • Naïve Bayes assumption:

        • Given the document’s topic, word in one position in the document tells us nothing about words in other positions

      • Second assumption:

        • Word appearance does not depend on position


          P(Xi = w | c) = P(Xj = w | c

        • Just have one multinomial feature predicting all words

    • for all positions i,j, word w, and class c


    Parameter estimation

    Multivariate Bernoulli model:

    fraction of documents of topic cj in which word w appears

    Multinomial model:

    fraction of times in which word w appears among all words in documents of topic cj


    • Can create a mega-document for topic j by concatenating all documents in this topic
    • Use frequency of w in mega-document

    Which to use for classification?

    • Multinomial vs Multivariate Bernoulli?

    • Multinomial model is almost always more effective in text applications!

        • See results figures later

    • There has been exploration of multinomial naïve bayes variants which often work better in practice

      • Binarized multinomial Naïve Bayes, etc.

      • Topic of PA4

    Feature Selection: Why?

    • Text collections have a large number of features

      • 10,000 – 1,000,000 unique words … and more

    • May make using a particular classifier feasible

      • Some classifiers can’t deal with 1,000,000 features

    • Reduces training time

      • Training time for some methods is quadratic or worse in the number of features

    • Makes runtime models smaller and faster

    • Can improve generalization (performance)

      • Eliminates noise features

      • Avoids overfitting

    Feature Selection: How?

    • Two ideas:

      • Hypothesis testing statistics:

        • Are we confident that the value of one categorical variable is associated with the value of another

        • Chi-square test (χ2)

      • Information theory:

        • How much information does the value of one categorical variable give you about the value of another

        • Mutual information

    • They’re similar, but χ2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

    |2 statistic (CHI)

    • |2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals? 

    • The null hypothesis is rejected with confidence .999,

    • since 12.9 > 10.83 (the value for .999 confidence).

    |2 statistic (CHI)

    • There is a simpler formula for 2x2 χ2:

    • N = A + B + C + D

    • Value for complete independence of term and category?

    Feature selection via Mutual Information

    • In training set, choose k words which best discriminate (give most info on) the categories.

    • The Mutual Information between a word, class is:

      • For each word w and each category c

    Feature selection via MI (contd.)

    • For each category we build a list of k most discriminating terms.

    • For example (on 20 Newsgroups):

      • sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

      • rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

    • Greedy: does not account for correlations between terms

    • Why?

    Feature Selection

    • Mutual Information

      • Clear information-theoretic interpretation

      • May select rare uninformative terms

    • Chi-square

      • Statistical foundation

      • May select very slightly informative frequent terms that are not very useful for classification

    Feature Selection: Frequency

    • The simplest feature selection method:

      • Just use the commonest terms

      • No particular foundation

      • But it make sense why this works

        • They’re the words that can be well-estimated and are most often available as evidence

      • In practice, this is often 90% as good as better methods

    Feature selection for NB

    • In general feature selection is necessary for multivariate Bernoulli NB.

    • Otherwise you suffer from noise, multi-counting

    • “Feature selection” really means something different for multinomial NB. It means dictionary truncation

      • The multinomial NB model only has 1 feature

    • This “feature selection” normally isn’t needed for multinomial NB, but may help a fraction with quantities that are badly estimated

    Evaluating Categorization

    • Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).

      • Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data)

    • It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set).

    • Measures: precision, recall, F1, classification accuracy

    • Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.

      • Adequate if one class per document

      • Otherwise F measure for each class

    Naive Bayes vs. other methods

    WebKB Experiment (1998)

    • Classify webpages from CS departments into:

      • student, faculty, course,project

    • Train on ~5,000 hand-labeled web pages

      • Cornell, Washington, U.Texas, Wisconsin

    • Crawl and classify a new site (CMU)

    • Results:

    NB Model Comparison: WebKB

    SpamAssassin

    • Naïve Bayes has found a home in spam filtering

      • Paul Graham’s A Plan for Spam

        • A Naive Bayes-like classifier with weird parameter estimation

      • Widely used in spam filters

      • But many features beyond words:

        • black hole lists, etc.

        • particular hand-crafted text patterns

    Naïve Bayes in Spam Filtering

    • SpamAssassin Features:

      • Basic (Naïve) Bayes spam probability

      • Mentions: Generic Viagra

      • Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

      • Phrase: impress ... girl

      • Phrase: 'Prestigious Non-Accredited Universities’

      • From: starts with many numbers

      • Subject is all capitals

      • HTML has a low ratio of text to image area

      • Relay in RBL, http://www.mail-abuse.com/enduserinfo_rbl.html

      • RCVD line looks faked

      • http://spamassassin.apache.org/tests_3_3_x.html

    Naïve Bayes on spam email

    Violation of NB Assumptions

    • The independence assumptions do not really hold of documents written in natural language.

      • Conditional independence

      • Positional independence

    • Examples?

    Example: Sensors

    Naïve Bayes Posterior Probabilities

    • Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

    • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.

      • Output probabilities are commonly very close to 0 or 1.

    • Correct estimation ⇒ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)

    Naive Bayes is Not So Naiventitled

    • Very Fast Learning and Testing (basically just count the data)

    • Low Storage requirements

    • Very good in domains with many equally important features

    • More robust to irrelevant features than many learning methods

      • Irrelevant Features cancel each other without affecting results

    • More robust to concept drift (changing class definition over time)

    • Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems

      • Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

    • A good dependable baseline for text classification (but not the best)!

    Resources for today’s lecture

    • IIR 13

    • Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

    • Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.

    • Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.

    • Tom Mitchell, Machine Learning. McGraw-Hill, 1997.

      • Clear simple explanation of Naïve Bayes

    • Open Calais: Automatic Semantic Tagging

      • Free (but they can keep your data), provided by Thompson/Reuters (ex-ClearForest)

    • Weka: A data mining software package that includes an implementation of Naive Bayes

    • Reuters-21578 – the most famous text classification evaluation set

      • Still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)