How big is the term vocabulary?
That is, how many distinct words are there?
Can we assume an upper bound?
Not really: At least 7020 = 1037 different words of length 20
In practice, the vocabulary will keep growing with the collection size
Especially with Unicode :)
Heaps’ law: M = kTb
M is the size of the vocabulary, T is the number of tokens in the collection
Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½
It is the simplest possible relationship between the two in log-log space
An empirical finding (“empirical law”)