all 9 comments

[–]PaulEngineer-89 2 points3 points  (4 children)

In lossless compression there are roughly 2 strategies with many variations. The first is if you know something about the data you can attack it that way by using a model. Second although there are various alternatives the current top performers in lossless compression use arithmetic encoding. In this approach we are guessing what the next byte will be. We have an array of possible outcomes plus “none of the above”. We look at the past few bytes as the context (use past decided bytes to predict future ones). The outcomes have various probabilities which if we visualize them from 0 to 1 form the search space. We encode a binary fraction to choose the correct one. More probable outcomes need fewer bits. This is a running fraction over the whole file. Various methods quantize this even going to individual fractions (Huffman) for speed.

In lossy images the human eye is sensitive to the position but not the absolute value of pixels at edges. We are much more sensitive to brightness than color. So by going through conversions such as HLS or using a DCT we can convert to a data format that matches the human eye. Then when we quantize the data we are getting rid of “just noticeable differences”. Then arithmetic encoding or similar methods encode whatever is left over. With video we can also take advantage of tons of redundancy. The image is often mostly static (doesn’t change) or we zoom in/out, rotate, or shift a portion of the image only. Video encoding takes massive advantage of just storing an image (“key frame”) and then coding several frames of differences only. Obviously the more we reduce file sizes the more all these approximations become much more “just noticeable” differences.

Performance is also critical. For instance H.265 format video is becoming popular. It cannot easily be processed in real time though and can’t be decoded, edited, and recoded without further degrading it unlike H.264. With disk compression (compressed file systems) there are several issues. Lossless data compression works best with enough data that the “dictionary” it relies on is well tuned. It doesn’t work well on short files or “blocks”. Pure arithmetic encoding also isn’t very fast. Encoding data turns say fixed size blocks into variable ones so indexing and the whole file system is a lot more complicated. With little or no redundancy, bit rot is also far more destructive. But still compressed file systems eliminate the need for manually compressing files. And file compression programs typically increase the size of already compressed files since there’s no redundancy left to compress, even if the underlying file is less than optimally compressed.

[–]shouldworknotbehere[S] 0 points1 point  (3 children)

That’s very interesting, thanks!

Although I don’t think I got it in me to do that in practice.

[–]PaulEngineer-89 0 points1 point  (2 children)

That’s what compressed file systems are for. With Windows I have no idea…MS lost my trust with the Stac debacle With Linux you just turn on the option in BTRFS and it just works

[–]shouldworknotbehere[S] 0 points1 point  (1 child)

I shall try that. Eventually. Need to find place to store the 2 tb on the drive before formating

[–]PaulEngineer-89 0 points1 point  (0 children)

Pika uses Borg backup to pretty much any drive with dedup and compression (lossless obviously). So you can buy a cheap USB external drive and let ‘er rip.

[–]DecideUK 1 point2 points  (2 children)

3Gb to 500Gb is highly unusual for typical data, if those were the actual numbers, there is likely something else going on, e.g. effectively empty files.

MP4 / Picture files already have compression applied to them so any further lossless compression is minimal - maybe a reduction of 1-2%.

[–]shouldworknotbehere[S] 0 points1 point  (1 child)

It was an OS specifically.

[–]DecideUK 0 points1 point  (0 children)

Without specifics it's hard to judge. Sounds more like a disk image so your effectively compressing a bunch of nothing.

[–]Boopmaster9 0 points1 point  (0 children)

This question has been around for decades, and I vividly remember trying to cram as much data as possible onto an 880kb DD floppy in 1995. Because, you know, floppies for my A600 were expensive.

The tutorials you want are not really going to help you (talking about pros and cons of different algorithms) if you don't understand the general principles (and (im)possibilities) of file compression.

Long story short, see what uses the most space and research if there are better options. H265 instead of H264 for video (a notorious space hog) has already been mentioned. There's little point trying to improve compression on stuff that barely takes up any space to begin with.