Compression

  • Now, we will consider compressing the space for the dictionary and postings

    • Basic Boolean index only

    • No study of positional indexes, etc.

    • We will consider compression schemes



  DICTIONARY COMPRESSION



Why compress the dictionary?

  • Search begins with the dictionary

  • We want to keep it in memory

  • Memory footprint competition with other applications

  • Embedded/mobile devices may have very little memory

  • Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time

  • So, compressing the dictionary is important



Dictionary storage - first cut

  • Array of fixed-width entries
    • ~400,000 terms; 28 bytes/term = 11.2 MB.

                     ↓                        20 bytes                4 bytes each

                   

Dictionary search structure



Fixed-width terms are wasteful

  • Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms.

    • And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.

  • Written English averages ~4.5 characters/word.

    • Exercise: Why is/isn’t this the number to use for estimating the dictionary size?

  • Ave. dictionary word in English: ~8 characters

    • How do we use ~8 characters per dictionary term?

  • Short words dominate token counts but not type average.



Compressing the term list: Dictionary-as-a-String

  • Store dictionary as a (long) string of characters:

    • Pointer to next word shows end of current word

    • Hope to save up to 60% of dictionary space.


Space for dictionary as a string

  • 4 bytes per term for Freq. 

  • 4 bytes per term for pointer to Postings.                         → Now avg. 11 bytes/term,

  • 3 bytes per term pointer                                                               ...Not 20

  • Avg. 8 bytes per term in term string

  • 400K terms x 19 ⇒7.6 MB (against 11.2MB for fixed width)



Blocking

  • Store pointers to every kth term string.

    • Example below: k=4.

  • Need to store term lengths (1 extra byte)



Net

  • Example for block size k = 4

  • Where we used 3 bytes/pointer without blocking

    • 3 x 4 = 12 bytes,

  • now we use 3 + 4 = 7 bytes.

  • Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB.

  • We can save more with larger k.

  • Why not go with larger k?



Exercise

  • Estimate the space usage (and savings compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.



Dictionary search without blocking

  • Assuming each dictionary term equally likely in query

    (not really so in practice!), average number of comparisons

    =(1+2∙2+4∙3+4)/8 ~2.6 

  • Exercise: what if the frequencies of query terms

    were non-uniform but known,

    how would you structure the dictionary search tree?



Dictionary search with blocking

  • Büinary search down to 4-term block;

    • Then linear search through terms in block.

  • Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares



Exercise

  • Estimate the impact on search performance (and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16. 



Front coding

  • Front-coding:

    • Sorted words commonly have long common prefix – store differences only

    • (for last k-1 in a block of k)

    • 8automata8automate9automatic10automation

                        → 8automat*a1♦e2ic3ion

                                   ↑                        

               Encodes automat          Extra length beyond automat.

  • Begins to resemble general string compression.



RCV1 dictionary compression summary

  • Technique

  • Size in MB

  • Fixed width

  • 11.2

  • Dictionary-as-String with pointers to every term

  • 7.6

  • Also, blocking k = 4

  • 7.1

  • Also, Blocking + front coding

  • 5.9



  POSTINGS COMPRESSION



Postings compression

  • The postings file is much larger than the dictionary, factor of at least 10.

  • Key desideratum: store each posting compactly.

  • A posting for our purposes is a docID.

  • For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers.

  • Alternatively, we can use log2 800,000 ≈ 20 bits per docID.

  • Our goal: use far fewer than 20 bits per docID.



Postings: two conflicting forces

  • A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.

  • A term like the occurs in virtually every doc, so 20 bits/posting is too expensive.

    • Prefer 0/1 bitmap vector in this case



Postings file entry

  • We store the list of docs containing a term in increasing order of docID.

    • computer: 33,47,154,159,202 …

  • Consequence: it suffices to store gaps.

    • 33,14,107,5,43 …

  • Hope: most gaps can be encoded/stored with far fewer than 20 bits.



Three postings entries



Variable length encoding

  • Aim:

    • For arachnocentric, we will use ~20 bits/gap entry.

    • For the, we will use ~1 bit/gap entry.

  • If the average gap for a term is G, we want to use ~log2G bits/gap entry.

  • Key challenge: encode every integer (gap) with about as few bits as needed for that integer.

  • This requires a variable length encoding

  • Variable length codes achieve this by using short codes for small numbers



Variable Byte (VB) codes

  • For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits

  • Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c

  • If G ≤127, binary-encode it in the 7 available bits and set c =1

  • Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm

  • At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.



Example



Other variable unit codes

  • Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).

  • Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases.

  • Variable byte codes:

    • Used by many commercial/research systems

    • Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).

  • There is also recent work on word-aligned codes that pack a variable number of gaps into one word



Unary code

  • Represent n as n 1s with a final 0.

  • Unary code for 3 is 1110.

  • Unary code for 40 is

  • 11111111111111111111111111111111111111110 .

  • Unary code for 80 is:

  • 111111111111111111111111111111111111111111111111111111111111111111111111111111110

  • This doesn’t look promising, but….



Gamma codes

  • We can compress better with bit-level codes

    • The Gamma code is the best known of these.

  • Represent a gap G as a pair length and offset

  • offset is G in binary, with the leading bit cut off

    • For example 13 → 1101 → 101

  • length is the length of offset

    • For 13 (offset 101), this is 3.

  • We encode length with unary code: 1110.

  • Gamma code of 13 is the concatenation of length and offset: 1110101



Gamma code examples



Gamma code properties

  • G is encoded using 2   log G⌋ + 1 bits

    • Length of offset is   log G   bits

    • Length of length is ⌊log G  + 1 bits

  • All gamma codes have an odd number of bits

  • Almost within a factor of 2 of best possible, log2 G

  • Gamma code is uniquely prefix-decodable, like VB

  • Gamma code can be used for any distribution

  • Gamma code is parameter-free



Gamma seldom used in practice

  • Machines have word boundaries – 8, 16, 32, 64 bits

    • Operations that cross word boundaries are slower

  • Compressing and manipulating at the granularity of bits can be slow

  • Variable byte encoding is aligned and thus potentially more efficient

  • Regardless of efficiency, variable byte is conceptually simpler at little additional space cost



RCV1 compression



Index compression summary

  • We can now create an index for highly efficient Boolean retrieval that is very space efficient
  • Only 4% of the total size of the collection

  • Only 10-15% of the total size of the text in the collection

  • However, we’ve ignored positional information

  • Hence, space savings are less for indexes used in practice

    • But techniques substantially the same.





Creator: Tgbyrdmc

Contributors:
-


Licensed under the Creative Commons
Attribution ShareAlike CC-BY-SA license


This deck was created using SlideWiki.