### Introduction to Information Retrieval

Introduction to Information Retrieval

CS276: Information Retrieval and Web Search

Lecture 10: Text Classification;

The Naive Bayes algorithm

### Standing queries

• The path from IR to text classification:

• You have an information need to monitor, say:

• Unrest in the Niger delta region

• You want to rerun an appropriate query periodically to find new news items on this topic

• You will be sent new documents that are found

• I.e., it’s not ranking but classification (relevant vs. not relevant)

• Such queries are called standing queries

• Long used by “information professionals”

• Standing queries are (hand-written) text classifiers

### Spam filteringAnother text classification task

• From: ""<takworlld@hotmail.com>

• Subject: real estate is the only way... gem oalvgkay

• Anyone can buy real estate with no money down

• Stop paying rent TODAY !

• There is no need to spend hundreds or even thousands for similar courses

• I am 22 years old and I have already purchased 6 properties using the

• methods outlined in this truly INCREDIBLE ebook.

• Change your life NOW !

• =================================================

• Click Below to order:

• http://www.wholesaledaily.com/sales/nmd.htm

• =================================================

### Text classification

• Today:

• Introduction to Text Classification

• Also widely known as “text categorization”

• Same thing

• Naïve Bayes text classification

• Including a little on Probabilistic Language Models

### Categorization/Classification

• Given:

• A description of an instance, d ∈ X

• X is the instance language or instance space.

• Issue: how to represent text documents.

• Usually some type of high-dimensional space – bag of words

• A fixed set of classes:

• C = {c1, c2,…, cJ}

• Determine:

• The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.

• We want to know how to build classification functions (“classifiers”).

### Machine Learning:Supervised Classification

• Given:

• A description of an instance, d ∈ X

• X is the instance language or instance space.

• A fixed set of classes:

• C = {c1, c2,…, cJ}

• A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×C

• Determine:

• A learning method or algorithm which will enable us to learn a classifier γ:X→C

• For a test document d, we assign it the class γ(d) ∈ C

### Document Classification

• (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)

### More Text Classification ExamplesMany search engine functionalities use classification

• Assigning labels to documents or web-pages:

• Labels are most often topics such as Yahoo-categories

• Labels may be genres

• "editorials" "movie-reviews" "news”

• Labels may be opinion on a person/product

• “like”, “hate”, “neutral”

• Labels may be domain-specific

• "interesting-to-me" : "not-interesting-to-me”

• “contains adult language” : “doesn’t”

• language identification: English, French, Chinese, …

• search vertical: about Linux versus not

### Classification Methods (1)

• Manual classification

• Used by the original Yahoo! Directory

• Very accurate when job is done by experts

• Consistent when the problem size and team is small

• Difficult and expensive to scale

• Means we need automatic classification methods for big problems

### Classification Methods (2)

• Hand-coded rule-based classifiers

• One technique used by CS dept’s spam filter, Reuters, CIA, etc.

• Widely deployed in government and enterprise

• Companies provide “IDE” for writing such rules

• E.g., assign category if document contains a given boolean combination of words

• Commercial systems have complex query languages (everything in IR query languages +score accumulators)

• Accuracy is often very high if a rule has been carefully refined over time by a subject expert

• Building and maintaining these rules is expensive

### A Verity topic A complex classification rule

 Note:maintenance issues (author, etc.)Hand-weighting of terms[Verity was bought by Autonomy.]

### Classification Methods (3)

• Supervised learning of a document-label assignment function

• Many systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)

• k-Nearest Neighbors (simple, powerful)

• Naive Bayes (simple, common method)

• Support-vector machines (new, generally more powerful)

• … plus many other methods

• No free lunch: requires hand-classified training data

• But data can be built up (and refined) by amateurs

• Many commercial systems use a mixture of methods

### Relevance feedback

• In relevance feedback, the user marks a few documents as relevant/nonrelevant

• The choices can be viewed as classes or categories

• The IR system then uses these judgments to build a better model of the information need

• So, relevance feedback can be viewed as a form of text classification (deciding between several classes)

### Probabilistic relevance feedback

• Rather than reweighting in a vector space…

• If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier

• such as the Naive Bayes model we will look at today:

• P(tk|R) = |Drk| / |Dr|

• P(tk|NR) = |Dnrk| / |Dnr|

• tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.

### Recall a few probability basics

• For events a and b:

• Bayes’ Rule

• Odds:

### Bayesian Methods

• Learning and classification methods based on probability theory

• Bayes theorem plays a critical role

• Builds a generative model that approximates how data is produced

• Has prior probability of each category given no information about an item.

• Model produces a posterior probability

• Distribution over the possible categories given an item

• Naïve Bayes methods use a bag of words as the item description

### Bayes’ Rule for text classification

For a document d and a class c

### Naive Bayes Classifiers

• Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj ∈ C

### d={x1,x2,......,xn}

• MAP is “maximum a posteriori” = most likely class

### Naïve Bayes Classifier: Naïve Bayes Assumption

• P(cj)

• Can be estimated from the frequency of classes in the training examples.

• P(x1,x2,…,xn|cj)

• O(|X|n•|C|) parameters

• Could only be estimated if a very, very large number of training examples was available.

• Naïve Bayes Conditional Independence Assumption:

• Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

### The Multivariate Bernoulli NB Classifier [Like Prob IR BIM; not language model; less used]

• Conditional Independence Assumption: features detect term presence and are independent of each other given the class:

• This model is appropriate for binary variables
• Multivariate Bernoulli model

### Learning the Model

• First attempt: maximum likelihood estimates

• Simply use the frequencies in the data

### Problem with Maximum Likelihood

• What if we have seen no training documents with the word muscle-ache and classified in the topic Flu?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

### Smoothing to Avoid Overfitting

• Somewhat more subtle version

### Stochastic Language Models

• Model probability of generating any string

### Stochastic Language Models

• Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model

### Unigram and higher-order models

• Unigram Language Models (Easy , effective !)
• Bigram (generally, n-gram) Language Models
• Other Language Models
• Grammar-based models (PCFGs), etc.
• Probably not the first thing to try in IR

• ### Naïve Bayes via a class conditional language model = multinomial NB

• The probability of the words is done as a class-specific unigram language model

### Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

• Attributes are text positions, values are words.

• Still too many possibilities

• Assume that classification is independent of the positions of the words

• Use same parameters for each position

• Result is bag of words model (over tokens not types)

### Naive Bayes and Language Modeling

• Naïve Bayes classifiers can use any sort of feature

• URL, email address, dictionaries, network features

• But if, as in the previous slides

• We use only word features

• we use all of the words in the text (not a subset)

• Then

• Naïve Bayes is basically the same as language modeling

### Multinomial Naive Bayes: Learning

• From training corpus, extract Vocabulary
• Calculate required P(cj) and P(xk | cj) terms
• For each cj in C do
• docsj ← subset of documents for which the target class is cj
• Textj ← single document containing all docsj
• for each word xk in Vocabulary
• nk ← number of occurrences of xk in Textj

### Naive Bayes: Classifying

• positions ← all word positions in current document

which contain tokens found in Vocabulary

• Return cNB, where

### Naive Bayes: Time Complexity

• Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D.

• Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data.                          ← Why ?

• Generally just O(|D|Lave) since usually |C||V| < |D|Lave

• Test Time: O(|C| Lt) where Lt is the average length of a test document.

• Very efficient overall, linearly proportional to the time needed to just read in all the data.

### Underflow Prevention: using logs

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

• Class with highest final un-normalized log probability score is still the most probable.

• Note that model is now just max of sum of weights…

### Example

 Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?

### Two Naive Bayes Models

• Model 1: Multivariate Bernoulli

• One feature Xw for each word in dictionary

• for loop iterates over dictionary

• Xw = true in document d if w appears in d

• Naive Bayes assumption:

• Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

• This is the model used in the binary independence model in classic probabilistic relevance feedback on hand-classified data

### Two Models

• Model 2: Multinomial = Class conditional unigram

• One feature Xi for each word pos in document

• feature’s values are all words in dictionary

• Value of Xi is the word in position i

• Naïve Bayes assumption:

• Given the document’s topic, word in one position in the document tells us nothing about words in other positions

• Second assumption:

• Word appearance does not depend on position

P(Xi = w | c) = P(Xj = w | c

• Just have one multinomial feature predicting all words

• for all positions i,j, word w, and class c

### Parameter estimation

Multivariate Bernoulli model:

 fraction of documents of topic cj in which word w appears

Multinomial model:

 fraction of times in which word w appears among all words in documents of topic cj

• Can create a mega-document for topic j by concatenating all documents in this topic
• Use frequency of w in mega-document

### Which to use for classification?

• Multinomial vs Multivariate Bernoulli?

• Multinomial model is almost always more effective in text applications!

• See results figures later

• There has been exploration of multinomial naïve bayes variants which often work better in practice

• Binarized multinomial Naïve Bayes, etc.

• Topic of PA4

### Feature Selection: Why?

• Text collections have a large number of features

• 10,000 – 1,000,000 unique words … and more

• May make using a particular classifier feasible

• Some classifiers can’t deal with 1,000,000 features

• Reduces training time

• Training time for some methods is quadratic or worse in the number of features

• Makes runtime models smaller and faster

• Can improve generalization (performance)

• Eliminates noise features

• Avoids overfitting

### Feature Selection: How?

• Two ideas:

• Hypothesis testing statistics:

• Are we confident that the value of one categorical variable is associated with the value of another

• Chi-square test (χ2)

• Information theory:

• How much information does the value of one categorical variable give you about the value of another

• Mutual information

• They’re similar, but χ2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

### |2 statistic (CHI)

• |2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals?

• The null hypothesis is rejected with confidence .999,

• since 12.9 > 10.83 (the value for .999 confidence).

### |2 statistic (CHI)

• There is a simpler formula for 2x2 χ2:

• N = A + B + C + D

• Value for complete independence of term and category?

### Feature selection via Mutual Information

• In training set, choose k words which best discriminate (give most info on) the categories.

• The Mutual Information between a word, class is:

• For each word w and each category c

### Feature selection via MI (contd.)

• For each category we build a list of k most discriminating terms.

• For example (on 20 Newsgroups):

• sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

• rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

• Greedy: does not account for correlations between terms

• Why?

### Feature Selection

• Mutual Information

• Clear information-theoretic interpretation

• May select rare uninformative terms

• Chi-square

• Statistical foundation

• May select very slightly informative frequent terms that are not very useful for classification

### Feature Selection: Frequency

• The simplest feature selection method:

• Just use the commonest terms

• No particular foundation

• But it make sense why this works

• They’re the words that can be well-estimated and are most often available as evidence

• In practice, this is often 90% as good as better methods

### Feature selection for NB

• In general feature selection is necessary for multivariate Bernoulli NB.

• Otherwise you suffer from noise, multi-counting

• “Feature selection” really means something different for multinomial NB. It means dictionary truncation

• The multinomial NB model only has 1 feature

• This “feature selection” normally isn’t needed for multinomial NB, but may help a fraction with quantities that are badly estimated

### Evaluating Categorization

• Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).

• Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data)

• It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set).

• Measures: precision, recall, F1, classification accuracy

• Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.

• Adequate if one class per document

• Otherwise F measure for each class

### WebKB Experiment (1998)

• Classify webpages from CS departments into:

• student, faculty, course,project

• Train on ~5,000 hand-labeled web pages

 Cornell, Washington, U.Texas, WisconsinCrawl and classify a new site (CMU)
• Results:

### SpamAssassin

• Naïve Bayes has found a home in spam filtering

• Paul Graham’s A Plan for Spam

• A Naive Bayes-like classifier with weird parameter estimation

• Widely used in spam filters

• But many features beyond words:

• black hole lists, etc.

• particular hand-crafted text patterns

### Naïve Bayes in Spam Filtering

• SpamAssassin Features:

• Basic (Naïve) Bayes spam probability

• Mentions: Generic Viagra

• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

• Phrase: impress ... girl

• Phrase: 'Prestigious Non-Accredited Universities’

• From: starts with many numbers

• Subject is all capitals

• HTML has a low ratio of text to image area

• Relay in RBL, http://www.mail-abuse.com/enduserinfo_rbl.html

• RCVD line looks faked

• http://spamassassin.apache.org/tests_3_3_x.html

### Violation of NB Assumptions

• The independence assumptions do not really hold of documents written in natural language.

• Conditional independence

• Positional independence

• Examples?

### Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.

• Output probabilities are commonly very close to 0 or 1.

• Correct estimation ⇒ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)

### Naive Bayes is Not So Naiventitled

• Very Fast Learning and Testing (basically just count the data)

• Low Storage requirements

• Very good in domains with many equally important features

• More robust to irrelevant features than many learning methods

• Irrelevant Features cancel each other without affecting results

• More robust to concept drift (changing class definition over time)

• Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems

• Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

• A good dependable baseline for text classification (but not the best)!

### Resources for today’s lecture

• IIR 13

• Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

• Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999.

• Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.

• Tom Mitchell, Machine Learning. McGraw-Hill, 1997.

• Clear simple explanation of Naïve Bayes

• Open Calais: Automatic Semantic Tagging

• Free (but they can keep your data), provided by Thompson/Reuters (ex-ClearForest)

• Weka: A data mining software package that includes an implementation of Naive Bayes

• Reuters-21578 – the most famous text classification evaluation set

• Still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)

Creator: Tgbyrdmc

Contributors:
-