Recap of the previous lecture

  • The type/token distinction

    • Terms are normalized types put in the dictionary

  • Tokenization problems:

    • Hyphens, apostrophes, compounds, CJK

  • Term equivalence classing:

    • Numbers, case folding, stemming, lemmatization

  • Skip pointers

    • Encoding a tree-like structure in a postings list

  • Biword indexes for phrases

  • Positional indexes for phrases/proximity queries

Speaker notes:

