Postings compression

  • The postings file is much larger than the dictionary, factor of at least 10.

  • Key desideratum: store each posting compactly.

  • A posting for our purposes is a docID.

  • For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers.

  • Alternatively, we can use log2 800,000 ≈ 20 bits per docID.

  • Our goal: use far fewer than 20 bits per docID.

Postings: two conflicting forces

  • A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.

  • A term like the occurs in virtually every doc, so 20 bits/posting is too expensive.

    • Prefer 0/1 bitmap vector in this case

Postings file entry

  • We store the list of docs containing a term in increasing order of docID.

    • computer: 33,47,154,159,202 …

  • Consequence: it suffices to store gaps.

    • 33,14,107,5,43 …

  • Hope: most gaps can be encoded/stored with far fewer than 20 bits.

Three postings entries

Variable length encoding

  • Aim:

    • For arachnocentric, we will use ~20 bits/gap entry.

    • For the, we will use ~1 bit/gap entry.

  • If the average gap for a term is G, we want to use ~log2G bits/gap entry.

  • Key challenge: encode every integer (gap) with about as few bits as needed for that integer.

  • This requires a variable length encoding

  • Variable length codes achieve this by using short codes for small numbers

Variable Byte (VB) codes

  • For a gap value G, we want to use close to the fewest bytes needed to hold log2 G bits

  • Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c

  • If G ≤127, binary-encode it in the 7 available bits and set c =1

  • Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm

  • At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0.


Other variable unit codes

  • Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).

  • Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases.

  • Variable byte codes:

    • Used by many commercial/research systems

    • Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).

  • There is also recent work on word-aligned codes that pack a variable number of gaps into one word

Unary code

  • Represent n as n 1s with a final 0.

  • Unary code for 3 is 1110.

  • Unary code for 40 is

  • 11111111111111111111111111111111111111110 .

  • Unary code for 80 is:

  • 111111111111111111111111111111111111111111111111111111111111111111111111111111110

  • This doesn’t look promising, but….

Gamma codes

  • We can compress better with bit-level codes

    • The Gamma code is the best known of these.

  • Represent a gap G as a pair length and offset

  • offset is G in binary, with the leading bit cut off

    • For example 13 → 1101 → 101

  • length is the length of offset

    • For 13 (offset 101), this is 3.

  • We encode length with unary code: 1110.

  • Gamma code of 13 is the concatenation of length and offset: 1110101

Gamma code examples

Gamma code properties

  • G is encoded using 2   log G⌋ + 1 bits

    • Length of offset is   log G   bits

    • Length of length is ⌊log G  + 1 bits

  • All gamma codes have an odd number of bits

  • Almost within a factor of 2 of best possible, log2 G

  • Gamma code is uniquely prefix-decodable, like VB

  • Gamma code can be used for any distribution

  • Gamma code is parameter-free

Gamma seldom used in practice

  • Machines have word boundaries – 8, 16, 32, 64 bits

    • Operations that cross word boundaries are slower

  • Compressing and manipulating at the granularity of bits can be slow

  • Variable byte encoding is aligned and thus potentially more efficient

  • Regardless of efficiency, variable byte is conceptually simpler at little additional space cost

RCV1 compression