Classification Methods (1)

  • Manual classification

    • Used by the original Yahoo! Directory

    • Looksmart, about.com, ODP, PubMed

    • Very accurate when job is done by experts

    • Consistent when the problem size and team is small

    • Difficult and expensive to scale

      • Means we need automatic classification methods for big problems

Classification Methods (2)

  • Hand-coded rule-based classifiers

    • One technique used by CS dept’s spam filter, Reuters, CIA, etc.

    • It’s what Google Alerts is doing

      • Widely deployed in government and enterprise

    • Companies provide “IDE” for writing such rules

    • E.g., assign category if document contains a given boolean combination of words

    • Commercial systems have complex query languages (everything in IR query languages +score accumulators)

    • Accuracy is often very high if a rule has been carefully refined over time by a subject expert

    • Building and maintaining these rules is expensive

A Verity topic A complex classification rule

  • Note:

    • maintenance issues (author, etc.)

    • Hand-weighting of terms

    • [Verity was bought by Autonomy.]

Classification Methods (3)

  • Supervised learning of a document-label assignment function

    • Many systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)

      • k-Nearest Neighbors (simple, powerful)

      • Naive Bayes (simple, common method)

      • Support-vector machines (new, generally more powerful)

      • … plus many other methods

    • No free lunch: requires hand-classified training data

    • But data can be built up (and refined) by amateurs

  • Many commercial systems use a mixture of methods