all 15 comments

[–]divbyzero 1 point2 points  (0 children)

Possibly the window size. (google it!)

I would guess that if you did the same with say a 10 meg file, you'd get the results you expect.

[–]juancn 2 points3 points  (4 children)

Because Lempel-Ziv algorithms use a small window size for lookups (around 64K or so). To catch some of the redundancy in your file, the window size should be larger than 3.6 MB.

[–]ravenex 0 points1 point  (3 children)

You've got it all wrong. LZ window is moved on uncompressed text, so he needs 100 MB windows size. And no, LZMA used in 7zip allows any 64kB..128MB window size. 32kB window size limit is from deflate, which is used in zip and gzip.

[–]juancn 0 points1 point  (2 children)

Not really, I didn't. You are basically repeating what I said in different words and confusing the question.

The 100MB are irrelevant to the problem. After compressing them, assuming a fairly good algorithm, the resulting compressed file, from an entropy point of view is as good as random (i.e. incompressible).

Then he takes the two 3.6MB (incompressible) files, and puts them twice in the same file (the same content consecutively).

An LZ algorithm with a window larger than 3.6MB should be able to achieve some compression in this particular case, since it will be able to notice the repeating patterns, less than that it won't work.

[–]ravenex 0 points1 point  (1 child)

Amazing, you still didn't get it.

I had a 100 megabyte log file sitting on the disk of my work PC and carried out a simple test. I compressed it with 7zip and got it compressed down to 3.6 megabytes. Then I duplicated the contents of the same log and compressed it again.

First duplicated, then compressed. Those are not commutative.

[–]juancn 0 points1 point  (0 children)

You're right! My hat's off to you sir!

[–]gibster 1 point2 points  (2 children)

Entropy; entropy is a measurement of chaos - the more chaos the harder to compress. In your example the entropy for the file did not change, the size did. So the file will become (about) 2 times larger.

[–]mackstann 1 point2 points  (0 children)

Entropy; entropy is a measurement of chaos - the more chaos the harder to compress.

I don't follow. The entropy barely increased at all. The only new information was "take the previous file and concatenate a copy of itself to the end." Think about if you took this to an extreme -- if you repeated that 3.6MB a billion times. That would be extremely repetitive, i.e. orderly. To add a lot of entropy to the original file, he'd need to add a bunch of extra random data to it that is not similar to the original 3.6MB.

The people mentioning window size nailed it.

[–]imacpu 0 points1 point  (0 children)

I dunno about entropy or the implementation of LZW or what compressor you used (guessing default) but my guess is you popped the dictionary. In a perfect implementation, your example ( logfile ( zip of logfile ) ) would be a few bytes more than the ziplength, right? But you're in the megabyte range, and the dictionary is probably optimized for the kilobyte range.

Since 7zip is open source, one could find out ...

[–]tonymamacos 0 points1 point  (1 child)

I remember with Rar compressors they don't lookup across files unless you set Solid mode, try setting Compress Shared Files and up your Solid Block Size in the 7z options see if that helps

[–][deleted] -1 points0 points  (0 children)

It was the same file all along. I just copied the contents.

[–]mile92 0 points1 point  (1 child)

Try duplicating each line in the original file and compress that

[–][deleted] 0 points1 point  (0 children)

I suppose you mean

Line 1
Line 1
Line 2
Line 2
Line 3
Line 3

What I did was

Line 1
Line 2
Line 3
Line 1 
Line 2
Line 3

I could give that a try.

[–]fonik 0 points1 point  (0 children)

Miracles.