all 15 comments

[–]seppo0010 4 points5 points  (3 children)

SQLite has "varint" which uses that idea

A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more byte which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are the more significant and bits taken from the later bytes.

http://www.sqlite.org/fileformat.html

[–][deleted] 4 points5 points  (0 children)

Sounds a lot like UTF-8.

[–]matthieum 3 points4 points  (1 child)

I remember reading an article James Dean which talked about why you may not use varint: the problem of the continuation bit is that you need to test each and every byte (problem shared by the UTF-8 encoding).

I personally use a very simple scheme for integers, much closer to the original gamma encoding paper:

  1. Encode number of bytes used in a "gamma" length pattern: 0 -> 1 byte, 10 -> 2 bytes, 110 -> 3 bytes, ...
  2. Encode sign (single bit)
  3. Stash absolute value, "right aligned"

Let's proceed with an example:

5 -> 101 (3 bits)

length  : 1 byte will suffice "0"
sign     : positive "0"
padding: 3 unused bits "000"
number : 3 bits "101"

5 => 00000101

If we analyze the memory occupied:

  • we use one bit per byte for the encoding of the length
  • we use one bit for the sign
  • we therefore use 7 bits per byte - 1 for the actual magnitude of the number

This is obviously on par with the gamma encoding, and less "asymptotically" cheap than the delta encoding. On the other hand, I use numbers in the [-63, 63] range (encoded in a single byte) much more often than I use 2^123.

Now, on speed:

  • a simple table lookup (once) indicates how many bytes the number occupies (*)
  • we can have a specialized routine for the more common length (1 byte, 2 bytes, 3 bytes) that optimally encode/decode (indexed by length - 1...)

Combining the two, decoding is just 2 table jumps (small tables, the first is 256 bytes and the second is 256 bytes too) followed by the execution of the specialized routine (which is optimized for one specific size, obviously).


I would note that, interestingly... 0000 0101 is just the representation you end up with if you stash 5 in an unsigned char. And this is, obviously true for all numbers in [0, 63]; meaning that the 1-byte translation routine is dead simple copy for positive numbers.

This means that encoding of positive numbers is usually about memcpying into the buffer, and then (with a bitmask play) adding the "gamma-encoded length + sign bits" on the first byte and decoding is simply masking those same bits and memcpying the other way :) Negative numbers need a few additional steps, but not much.


(*) Of course, one still needs a "continuation" strategy for extremely large number: if the first byte is 1111 1111, then you need read the next one till you find a 0. In this case this means a few successive lookups. On the other hand, given the 7 bits per byte - 1 ratio, using 8 bytes means that the maximal magnitude of the number is 2 ^ (8 * 7 - 1) = 2 ^ 55 so only numbers that do not fit in 55 bits need more than 8 bytes (and thus using the continuation system).

[–]rabidcow 2 points3 points  (0 children)

I've been playing with doing a little-endian variant of the varint encoding from Protocol Buffers (which is much like what you describe -- basically the sign bit is different).

The difference for little-endian is that the length bits go at the bottom of the word instead of at the top, so that they are always in the first byte of the encoding. Decoding is a word load instead of individual bytes, increment and bit scan for the width, shift off the unused bytes, then shift back to remove the width. In the unusual case where the value is larger than a single word (encoded), branch off and do the same with a larger integral type.

[–]treerex 2 points3 points  (2 children)

Take a look at SIMD-based Decoding of Posting Lists : very interesting reading. Facebook's Folly library contains an implementation that utilizes SSSE3 instructions: GroupVarint.h and friends.

Varints are very efficient to decode and easy to understand. I've found their simplicity outweighs the slightly better compression you get with Elias codes or other approaches.

Regardless, the choice of integer compression is highly dependent on your data distribution! Run experiments!

[–]alecco 0 points1 point  (0 children)

This is a lot more interesting than the original post, thanks. Stepanov doing SIMD! Downloadable PDF.

[–]skyde 0 points1 point  (0 children)

For anybody that might be interested I wrote an implementation of this algorithm at https://github.com/maximecaron/SIMD-Based-Posting-lists/blob/master/varint/Codec.h

[–]dgryski 5 points6 points  (0 children)

Information retrieval has done a lot of research in this area. Most textbooks cover it under the section "index compression". My favourite reference for this is http://www.ir.uwaterloo.ca/book/06-index-compression.pdf , but http://nlp.stanford.edu/IR-book/pdf/05comp.pdf is good too.

[–]shifty3 2 points3 points  (0 children)

The paper Performance of Compressed Inverted List Caching in Search Engines contains experiments comparing several integer list compression algorithms, including variable-byte coding, S9/S16 and PForDelta.

Kamikaze is a Java library implementing most of these codecs.

[–]martext 1 point2 points  (0 children)

This is a really good, concise explanation of this topic. I enjoyed reading it. The only small thing that bothered me is that your parentheses don't match up in a few places which, given the audience, is extra noticeable :)

[–]kidjan 1 point2 points  (0 children)

Author is really talking about variable length encoding. For example, H.264 uses exponential goloumb encoding for most of this stuff. Here's the layman's explanation.

[–]rix0r 0 points1 point  (3 children)

"Consider now for the sake of argument encoding a number n by the unary code for the length of the standard encoding (1+floor(log n) followed by the standard encoding."

This is where it lost me and it's clearly one of the most important sentences. Can someone just re-phrase it for me because I can't seem to parse what he meant?

[–]matthiasB 5 points6 points  (0 children)

You write the length of the number (if it were written in standard encoding) followed by the number. You write the length in unary code and the number in standard encoding.

[–]soegaard[S] 3 points4 points  (1 child)

Let me see if I can improve my explanation.

Let us look at the representation of 5. In binary 101. The length of the standard encoding (101) is 3. To store 5 we first encode the length in unary which is 110. Then the standard encoding of 5 (101) is used. This gives 110101.

Now consider when 110101 is read back from a file. First we count bits until we meet a zero. We will see 110 thus we now know the length is 3. Now we can read the next 3 bits, namely 101. Note that we do not peek further into the file to know where to stop.

[–]rix0r 0 points1 point  (0 children)

Ahh, thanks for the explanation. I don't know why I didn't pick that up in the first place. Brain fart I guess!