Current Slide
Small screen detected. You are viewing the mobile version of SlideWiki. If you wish to edit slides you will need to use a larger device.
Sort-based index construction
As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit compression tricks (you can, but much more complex)
The final postings for any term are incomplete until the end.
At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.
T = 100,000,000 in the case of RCV1
So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of >150 years of newswire
Thus: We need to store intermediate results on disk.
Speaker notes:
Content Tools
Tools
Sources (0)
Tags (0)
Comments (0)
History
Usage
Questions (0)
Playlists (0)
Quality
Sources
There are currently no sources for this slide.