Vocabulary vs. collection size

  • How big is the term vocabulary?

    • That is, how many distinct words are there?

  • Can we assume an upper bound?

    • Not really: At least 7020 = 1037 different words of length 20

  • In practice, the vocabulary will keep growing with the collection size

    • Especially with Unicode :)



Vocabulary vs. collection size

  • Heaps’ law: M = kTb

  • M is the size of the vocabulary, T is the number of tokens in the collection

  • Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5

  • In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½

    • It is the simplest possible relationship between the two in log-log space

    • An empirical finding (“empirical law”)





Creator: Tgbyrdmc

Contributors:
-


Licensed under the Creative Commons
Attribution ShareAlike CC-BY-SA license


This deck was created using SlideWiki.