### Postings compression

The postings file is much larger than the dictionary, factor of at least 10.

Key desideratum: store each posting compactly.

A posting for our purposes is a docID.

For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers.

Alternatively, we can use log2 800,000 ≈ 20 bits per docID.

Our goal: use far fewer than 20 bits per docID.

### Postings: two conflicting forces

A term like

occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.**arachnocentric**

A term like

occurs in virtually every doc, so 20 bits/posting is too expensive.**the**

Prefer 0/1 bitmap vector in this case

### Postings file entry

We store the list of docs containing a term in increasing order of docID.

: 33,47,**computer****154,159**,202 …

*Consequence:*it suffices to store*gaps*.

33,14,107,

**5**,43 …

*Hope:*most gaps can be encoded/stored with far fewer than 20 bits.

### Three postings entries

### Variable length encoding

Aim:

For

, we will use ~20 bits/gap entry.**arachnocentric**

For

, we will use ~1 bit/gap entry.**the**

If the average gap for a term is

*G*, we want to use ~log2*G*bits/gap entry.

Key challenge: encode every integer (gap) with about as few bits as needed for that integer.

This requires a

*variable length encoding*

Variable length codes achieve this by using short codes for small numbers

### Variable Byte (VB) codes

For a gap value

*G,*we want to use close to the fewest bytes needed to hold log2*G*bits

Begin with one byte to store

*G*and dedicate 1 bit in it to be a continuation bit*c*

If

*G*≤127, binary-encode it in the 7 available bits and set*c*=1

Else encode

*G*’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm

At the end set the continuation bit of the last byte to 1 (

*c*=1) – and for the other bytes*c*= 0.

### Example

### Other variable unit codes

Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).

Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases.

Variable byte codes:

Used by many commercial/research systems

Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).

There is also recent work on word-aligned codes that pack a variable number of gaps into one word

### Unary code

Represent

*n*as*n*1s with a final 0.

Unary code for 3 is 1110.

Unary code for 40 is

11111111111111111111111111111111111111110 .

Unary code for 80 is:

111111111111111111111111111111111111111111111111111111111111111111111111111111110

This doesn’t look promising, but….

### Gamma codes

We can compress better with bit-level codes

The Gamma code is the best known of these.

Represent a gap

*G*as a pair*length*and*offset*

*offset*is*G*in binary, with the leading bit cut off

For example 13 → 1101 → 101

*length*is the length of offset

For 13 (offset 101), this is 3.

We encode

*length*with*unary code*: 1110.

Gamma code of 13 is the concatenation of

*length*and*offset*: 1110101

### Gamma code examples

### Gamma code properties

*G*is encoded using 2 ⌊ log*G*⌋ + 1 bits

Length of offset is ⌊ log

*G*⌋ bits

Length of length is ⌊log

*G*⌋ + 1 bits

All gamma codes have an odd number of bits

Almost within a factor of 2 of best possible, log

_{2}*G*

Gamma code is uniquely prefix-decodable, like VB

Gamma code can be used for any distribution

Gamma code is parameter-free

### Gamma seldom used in practice

Machines have word boundaries – 8, 16, 32, 64 bits

Operations that cross word boundaries are slower

Compressing and manipulating at the granularity of bits can be slow

Variable byte encoding is aligned and thus potentially more efficient

Regardless of efficiency, variable byte is conceptually simpler at little additional space cost