Tokenization: language issues

  • French

    • L'ensemble    one token or two?

      • L ? L’ ? Le ?

      • Want l’ensemble to match with un ensemble

        • Until at least 2003, it didn’t on Google

        • Internationalization!
  • German noun compounds are not segmented

    • Lebensversicherungsgesellschaftsangestellter

    • ‘life insurance company employee’

    • German retrieval systems benefit greatly from a compound splitter module

        • Can give a 15% performance boost for German

