Small screen detected. You are viewing the mobile version of SlideWiki. If you wish to edit slides you will need to use a larger device.
As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit compression tricks (you can, but much more complex)
The final postings for any term are incomplete until the end.
At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.
T = 100,000,000 in the case of RCV1
So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of >150 years of newswire
Thus: We need to store intermediate results on disk.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License