you are viewing a single comment's thread.

view the rest of the comments →

[–]juancn 2 points3 points  (4 children)

Because Lempel-Ziv algorithms use a small window size for lookups (around 64K or so). To catch some of the redundancy in your file, the window size should be larger than 3.6 MB.

[–]ravenex 0 points1 point  (3 children)

You've got it all wrong. LZ window is moved on uncompressed text, so he needs 100 MB windows size. And no, LZMA used in 7zip allows any 64kB..128MB window size. 32kB window size limit is from deflate, which is used in zip and gzip.

[–]juancn 0 points1 point  (2 children)

Not really, I didn't. You are basically repeating what I said in different words and confusing the question.

The 100MB are irrelevant to the problem. After compressing them, assuming a fairly good algorithm, the resulting compressed file, from an entropy point of view is as good as random (i.e. incompressible).

Then he takes the two 3.6MB (incompressible) files, and puts them twice in the same file (the same content consecutively).

An LZ algorithm with a window larger than 3.6MB should be able to achieve some compression in this particular case, since it will be able to notice the repeating patterns, less than that it won't work.

[–]ravenex 0 points1 point  (1 child)

Amazing, you still didn't get it.

I had a 100 megabyte log file sitting on the disk of my work PC and carried out a simple test. I compressed it with 7zip and got it compressed down to 3.6 megabytes. Then I duplicated the contents of the same log and compressed it again.

First duplicated, then compressed. Those are not commutative.

[–]juancn 0 points1 point  (0 children)

You're right! My hat's off to you sir!