Token and Terms

TOKEN AND TERMS

Tokenization

  • Input: “Friends, Romans, Countrymen

  • Output: Tokens

    • Friends

    • Romans

    • Countrymen

  • A token is a sequence of characters in a document

  • Each such token is now a candidate for an index entry, after further processing

    • Described below

  • But what are valid tokens to emit?

Tokenization

  • Issues in tokenization:

    • Finland’s capital → 

    • Finland? Finlands? Finland’s?

    • Hewlett-Packard → Hewlett and Packard as two tokens?

      • state-of-the-art: break up hyphenated sequence.

      • co-education

      • lowercase, lower-case, lower case ?

      • It can be effective to get the user to put in possible hyphens

    • San Francisco: one token or two?

      • How do you decide it is one token?

Numbers

  • 3/12/91                Mar. 12, 1991                     12/3/91

  • 55 B.C.

  • B-52

  • My PGP key is 324a3df234cb23e

  • (800) 234-2333

    • Often have embedded spaces

    • Older IR systems may not index numbers

      • But often very useful: think about things like looking up error codes/stacktraces on the web

      • (One answer is using n-grams: Lecture 3)

    • Will often index “meta-data” separately

      • Creation date, format, etc.

Tokenization: language issues

  • French

    • L'ensemble    one token or two?

      • L ? L’ ? Le ?

      • Want l’ensemble to match with un ensemble

        • Until at least 2003, it didn’t on Google

        • Internationalization!
  • German noun compounds are not segmented

    • Lebensversicherungsgesellschaftsangestellter

    • ‘life insurance company employee’

    • German retrieval systems benefit greatly from a compound splitter module

        • Can give a 15% performance boost for German

Tokenization: language issues

  • Chinese and Japanese have no spaces between words:

    • 莎拉波娃现在居住在美国东南部的佛罗里达。
    • Not always guaranteed a unique tokenization
  • Further complicated in Japanese, with multiple alphabets intermingled

  • Dates/amounts in multiple formats

      End-user can express query entirely in hiragana!

Tokenization: language issues

  • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

  • Words are separated, but letter forms within a word form complex ligatures

  •                                        ← →    ← →                                  ← start

  • ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

  • With Unicode, the surface presentation is complex, but the stored form is straightforward