Normalization to terms

  • We need to “normalize” words in indexed text as well as query words into the same form

    • We want to match U.S.A. and USA

  • Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary

  • We most commonly implicitly define equivalence classes of terms by, e.g.,

    • deleting periods to form a term

      • U.S.A., USA ⌊ USA

    • deleting hyphens to form a term

      • anti-discriminatory, antidiscriminatory ⌊ antidiscriminatory

Normalization: other languages

  • Accents: e.g., French résumé vs. resume.

  • Umlauts: e.g., German: Tuebingen vs. Tübingen

    • Should be equivalent

  • Most important criterion:

    • How are your users like to write their queries for these words?

  • Even in languages that standardly have accents, users often may not type them

    • Often best to normalize to a de-accented term

      • Tuebingen, Tübingen, Tubingen    Tubingen

Normalization: other languages

  • Normalization of things like date forms

    • 7月30日 vs. 7/30
    • Japanese use of kana vs. Chinese characters
  • Tokenization and normalization may depend on the language and so is intertwined with language detection

  • Crucial: Need to “normalize” indexed text as well as query terms into the same form

Case folding

  • Reduce all letters to lower case

    • exception: upper case in mid-sentence?

      • e.g., General Motors
      • Fed vs. fed
      • SAIL vs. sail
    • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization
  • Google example:

    • Query C.A.T.

    • #1 result was for “cat” (well, Lolcats) not Caterpillar Inc.

Normalization to terms

  • An alternative to equivalence classing is to do asymmetric expansion

  • An example of where this may be useful

    • Enter: window Search: window, windows

    • Enter: windows Search: Windows, windows, window

    • Enter: Windows Search: Windows

  • Potentially more powerful, but less efficient