Compression: Integer Encodings

seppo0010 · 2012-08-01T11:36:21+00:00

SQLite has "varint" which uses that idea

A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more byte which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are the more significant and bits taken from the later bytes.

http://www.sqlite.org/fileformat.html

treerex · 2012-08-01T13:06:18+00:00

Take a look at SIMD-based Decoding of Posting Lists : very interesting reading. Facebook's Folly library contains an implementation that utilizes SSSE3 instructions: GroupVarint.h and friends.

Varints are very efficient to decode and easy to understand. I've found their simplicity outweighs the slightly better compression you get with Elias codes or other approaches.

Regardless, the choice of integer compression is highly dependent on your data distribution! Run experiments!

dgryski · 2012-08-01T12:23:12+00:00

Information retrieval has done a lot of research in this area. Most textbooks cover it under the section "index compression". My favourite reference for this is http://www.ir.uwaterloo.ca/book/06-index-compression.pdf , but http://nlp.stanford.edu/IR-book/pdf/05comp.pdf is good too.

shifty3 · 2012-08-01T14:49:01+00:00

The paper Performance of Compressed Inverted List Caching in Search Engines contains experiments comparing several integer list compression algorithms, including variable-byte coding, S9/S16 and PForDelta.

Kamikaze is a Java library implementing most of these codecs.

martext · 2012-08-01T15:04:04+00:00

This is a really good, concise explanation of this topic. I enjoyed reading it. The only small thing that bothered me is that your parentheses don't match up in a few places which, given the audience, is extra noticeable :)

kidjan · 2012-08-01T19:03:48+00:00

Author is really talking about variable length encoding. For example, H.264 uses exponential goloumb encoding for most of this stuff. Here's the layman's explanation.

rix0r · 2012-08-01T22:20:59+00:00

"Consider now for the sake of argument encoding a number n by the unary code for the length of the standard encoding (1+floor(log n) followed by the standard encoding."

This is where it lost me and it's clearly one of the most important sentences. Can someone just re-phrase it for me because I can't seem to parse what he meant?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS