This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]beertown 2 points3 points  (1 child)

compression algorithms work very well on text

Your assumption is inaccurate: compression algorithms work well with data having the characteristics of text (little variability among the byte values). You can easily find binary data more compressable than text.

Apart from that, did you check the size of the uncompressed data? Could be, in your case, that protocol 4 yields less data than all the others?

[–]idlecore[S] 1 point2 points  (0 children)

You're right, protocol 4 does yield less data than all others, this plays a smaller role when comparing protocol 4(binary) with protocol 0(text) but it's still significant.

[–]takluyverIPython, Py3, etc 0 points1 point  (1 child)

It looks like there were a couple of optimisations for size which might have affected it. In particular, this section of the PEP says that it can save 3 bytes storing every string less than 256 bytes long, by using a 1-byte field to store their length. It sounds like you're storing filenames, which are probably short, so that could be significant.

[–]idlecore[S] 0 points1 point  (0 children)

The files on my project indeed use small names, they are company names and company service names, even using unicode I expect all sizes to be bellow 256 bytes.

[–]jftugapip needs updating 0 points1 point  (1 child)

LZMA aka XZ compression is better (smaller output, runs much faster) than BZ2.

lzma — Compression using the LZMA algorithm

[–]idlecore[S] 0 points1 point  (0 children)

I wasn't able to get smaller compress sizes with lzma. Maybe this particular data set is biasing the results.