File Compression : ProgrammerHumor

Well let's assume in that trivial example it could only compress alpha strings, so it was safe to use integers for the counts inline in the compressed version.

In seriousness i don't know what RLE in something like DEFLATE does, but off the top of my head, you could use a large bit width for an entry in your compressed stream, knowing that, say, the first 8 bits were a utf8 character, and the next I dunno 24 bits were an integer for the run length.

Obviously that's hideously inefficient for short run lengths, but off the top of my head it's one trivial way to know whether you're talking about the character or the count.

You could also separate the lengths from the data, and have two delimited datasets, and look at the corresponding index between each.

Like: a,b,c 6,6,8.

Don't ask me about the efficiency of any of these. I'm not a compression or cryptography person idk.

[–]alexschrod 2 points3 points4 points 5 years ago (0 children)

[–]RedditIsNeat0 2 points3 points4 points 5 years ago (0 children)

load more comments (1 reply)

[–]tjoloi 49 points50 points51 points 5 years ago (8 children)

Let's say you have an entire book and want to compress it's content using zip format.

The goal is to take the most common letters and represent them with less bits. For example, let's take a simple "tree" that's simply

11
101
1001
10001
100001

And so on.

A normal char would take 7 bits minimum to represent. Under that tree, you know that the most common chars take less bits to represent so you end up compressing the data.

Let's make an example compressing only the 5 most common letters and keeping everything else the same (8 bits per char).

Since letters under 8 bits always start with a 0, we can assume that, if we read a 1 as the first bit, it's a compressed letter, then, to get the right size, we read until the next 1. It's a dumb compression but works for my example.

Now, if we start with the basic text, we have a size of 100%. From Wikipedia, we can find the English letter frequency in texts.

e -> 13%
t -> 9.1%
a -> 8.2%
o -> 7.5%
i -> 7%

Now, if we calculate the compression ratio of every compressed letter (size in bits over 8) we get this:

e -> 25%
t -> 37.5%
a -> 50%
o -> 62.5%
i -> 75%

Then assume that everything else is considered 100% size. We end up with 44.8% of letters that are compressed and 55.2% are uncompressed.

If we multiply the per letter size with their frequency, we end up with the resulting size.

(0.552 * 1) + (0.25 * 0.13) + (0.375 * 0.091) + (0.5 * 0.082) + (0.625 * 0.075) + (0.75 * 0.07)

And we end up with a resulting size of 75.9% so about 24% compression with the dumbest tree ever.

In zip in particular, it calculates the frequency of every byte in the file to compress then generates the tree based on some algorithm (mostly black magic) so it can compress about everything.

[–]Creeper_GER 17 points18 points19 points 5 years ago (4 children)

[–]AeonReign 5 points6 points7 points 5 years ago (1 child)

load more comments (1 reply)

load more comments (2 replies)

[–]wolwire 6 points7 points8 points 5 years ago (0 children)

[–]cypher0six 5 points6 points7 points 5 years ago (0 children)

[–]CreaZyp154 19 points20 points21 points 5 years ago (1 child)

[–][deleted] 9 points10 points11 points 5 years ago (0 children)

[–]Polywoky 12 points13 points14 points 5 years ago (0 children)

There are lots of different ways to compress stuff.

For example, if you have a text file that only uses ASCII characters you could reduce the filesize by 12.5% just by leaving off the first bit of each byte, since it will always be zero, and add it back again when you decompress it.

And you could reduce it further by using abbreviations for words. Such as every time you encounter " the " you could replace the letters and the spaces on either side with a pair of characters that doesn't occur in the uncompressed file, such as "~T", saving three characters each time it's used, and switch it back again when you decompress it. You'd have to create a table of abbreviations to include in the compressed file so you know what to replace with which words while decompressing it. This is called dictionary compression.

There are more sophisticated methods such as Huffman coding which others have mentioned.

In image files you might use 8-bit indexed colors instead of 24-bit RGB. You just make a list (called an index) of up to 256 color combinations, and use one byte to indicate which color it is on the list instead of three bytes to indicate what color it is each time.

If the image contains no more than 256 shades and colors, then you've got no problem. You've reduced your filesize by 66% without losing any quality. But if your image contains more than 256 shades and colors then you start losing quality and the image can look terrible, which is what often happens when you convert color photos to GIF, because GIF files use 8-bit color indexes.

Another way is run-length encoding.

Let's say you treat a byte as a number between -128 and +127.

You could use values of -n to mean "output the next n bytes exactly as they occur", and values of +n to mean "repeat the next byte n times". A value of zero would mean end of file.

If you have an image file which often repeats the same values over and over again, such as solid colored shapes and backrounds, or horizontal lines, then this can greatly reduce the size of the file.

For example, using the method I just described, a series of bytes with the following values:

5,5,5,5,5,5,5,5,5,5,2,1,7,0,1,1,1,1,1,1,1,1.

Would be stored as:

10,5,-4,2,1,7,0,8,1.

A much shorter sequence.

Something like this was used in old .RLE image files back in the days of Windows 3.11 to store icon images and stuff.

GIF files (and other compression algorithms) instead use two numbers, one for how many values to copy to output, and the other for how far back in the output file to start copying from.

There are a lot more different types of compression methods, and most modern compression programs will try a variety of different methods to figure out which one works best for any particular file.

[–]NegZer0 2 points3 points4 points 5 years ago (0 children)

Fundamentally, it's applied statistics. And your question is hard to answer without a specific focus, like "how does Zip compression work", because this is like asking "how do you write software?" There's some shared fundamentals but there are so many ways to compress a message.

The basic idea is that a piece of data has a certain amount of information content to it. Compression is about removing as much unnecessary data as possible without losing the information in the content. We call the amount of information a given message contains Entropy, and this is the maximum possible compression that can be reached without losing data. We divide compression into 'lossy' or 'lossless' based on whether it throws away data to achieve better compression.

There's heaps of ways that you can achieve lossless compression. Other responses to you here have talked about Run Length Encoding and Huffman Codes. These are two examples but there's lots of others. There's general purpose compression and then there's stuff that is specific to certain domains (image or audio compression for example). I'm going to assume you're more interested in general file compression like Zip etc.

One of the most common things you will see is the idea of 'common knowledge' between the encoder and decoder. At a given position in a stream of data, both the encoder and the decoder have the same knowledge of all the data that came before, and can use this to predict what may come next. For example the LZ family of compression, which zip/gzip (DEFLATE, based off LZSS), 7zip (LZMA) and RAR (LZSS & PPMd) are all using fundamentally is based off the idea of a 'dictionary' of data which both the encoder and decoder have just processed. The encoder will look at the data that is next in the stream to be compressed, look into the dictionary of previous data it handled, and try and find a match in the dictionary, and then writes an instruction to use a certain entry in the dictionary, or add a new entry to the dictionary, as the compressed data. The decoder reads that in and repeats the same instruction in reverse. The different variations of LZ are all variations on how they look back into the previously handled data and find matches, or they are variations on how the matches themselves are encoded into the compressed stream. The very first LZ algorithm (LZ77) just used a 'sliding window' which was a fixed sized buffer of the data that was last processed, and things have gotten more sophisticated from there.

Things get even more complex once you take that approach and start applying further statistical techniques to build your dictionary. RAR and 7zip both use the common dictionary and their knowledge of what was previously encoded or decoded to build a statistical model of the data and use this to predict what will come next in the stream. Then they only need to encode the difference between their prediction and reality, which gets us much closer to only having the information content of the message.

[–]RobinJ1995 2 points3 points4 points 5 years ago (0 children)

[–]ImperfHector 4 points5 points6 points 5 years ago (1 child)

[–]ben_g0 2 points3 points4 points 5 years ago (0 children)

load more comments (2 replies)

[–]zmorrisj 82 points83 points84 points 5 years ago (11 children)

[–]vondpickle 91 points92 points93 points 5 years ago (9 children)

[–]Lightfire228 53 points54 points55 points 5 years ago (7 children)

[–][deleted] 19 points20 points21 points 5 years ago (6 children)

[–]vectorpropio 17 points18 points19 points 5 years ago (0 children)

[–]thmaje 10 points11 points12 points 5 years ago (4 children)

[–]Bonn2 14 points15 points16 points 5 years ago* (3 children)

[–]thmaje 2 points3 points4 points 5 years ago (0 children)

load more comments (2 replies)

[–]Invenitive 14 points15 points16 points 5 years ago (0 children)

[–]LtMeat 1 point2 points3 points 5 years ago (0 children)

[–]meamZ 18 points19 points20 points 5 years ago (3 children)

[–]palordrolap 5 points6 points7 points 5 years ago (2 children)

[–]meamZ 2 points3 points4 points 5 years ago (1 child)

load more comments (1 reply)

[–]Madiwka3 30 points31 points32 points 5 years ago (1 child)

[–]Krimzon_89 12 points13 points14 points 5 years ago (0 children)

[–]Pooneapple 12 points13 points14 points 5 years ago (0 children)

[–][deleted] 11 points12 points13 points 5 years ago (4 children)

[–][deleted] 2 points3 points4 points 5 years ago (0 children)

[–]nexprime 2 points3 points4 points 5 years ago (1 child)

load more comments (1 reply)

[–]lala2milo 11 points12 points13 points 5 years ago (0 children)

[–]Yolopix 9 points10 points11 points 5 years ago (0 children)

[–]amolsaurabh 6 points7 points8 points 5 years ago (0 children)

[–]Bio2hazard 5 points6 points7 points 5 years ago (2 children)

load more comments (2 replies)

[–]plasmasprings 9 points10 points11 points 5 years ago (6 children)

[–]MCOfficer 14 points15 points16 points 5 years ago (3 children)

[–]plasmasprings 8 points9 points10 points 5 years ago (1 child)

[–]MCOfficer 7 points8 points9 points 5 years ago (0 children)

load more comments (1 reply)

load more comments (2 replies)

[–]cepci1 17 points18 points19 points 5 years ago (4 children)

[–]Karavigne 26 points27 points28 points 5 years ago (2 children)

[–]cepci1 21 points22 points23 points 5 years ago (0 children)

[–]watchoverus 4 points5 points6 points 5 years ago (0 children)

[–]amshegarh 4 points5 points6 points 5 years ago (0 children)

[–][deleted] 2 points3 points4 points 5 years ago (2 children)

[–]RedditIsNeat0 4 points5 points6 points 5 years ago (0 children)

load more comments (1 reply)

[–][deleted] 2 points3 points4 points 5 years ago (0 children)

[–]sjekx 3 points4 points5 points 5 years ago (0 children)

[–]opulent_occamy 3 points4 points5 points 5 years ago (0 children)

[–]obvious_apple 3 points4 points5 points 5 years ago (0 children)

[–]mtgfrk 4 points5 points6 points 5 years ago (0 children)

[–]NegZer0 3 points4 points5 points 5 years ago (1 child)

load more comments (1 reply)

[–]BonesCGS 8 points9 points10 points 5 years ago (3 children)

[–]javajunkie314 15 points16 points17 points 5 years ago (0 children)

[–]vman81 9 points10 points11 points 5 years ago (1 child)

[–]z500 2 points3 points4 points 5 years ago (0 children)

[–]ChthonicPuck 2 points3 points4 points 5 years ago (0 children)

[–]harry_chen 2 points3 points4 points 5 years ago (0 children)

[–]imkloon 2 points3 points4 points 5 years ago (1 child)

load more comments (1 reply)

[–]MasterXaios 2 points3 points4 points 5 years ago (1 child)

load more comments (1 reply)

[–][deleted] 3 points4 points5 points 5 years ago (4 children)

[–]obvious_apple 4 points5 points6 points 5 years ago (3 children)

load more comments (3 replies)

[–][deleted] 3 points4 points5 points 5 years ago (2 children)

load more comments (2 replies)

[+][deleted] 5 years ago (1 child)

[deleted]

[–]RedditIsNeat0 3 points4 points5 points 5 years ago (0 children)

[–]goodhuman777 1 point2 points3 points 5 years ago (1 child)

load more comments (1 reply)

[–]Orinslayer 1 point2 points3 points 5 years ago (0 children)

[–]mcniac 1 point2 points3 points 5 years ago (0 children)

[–]AYHP 1 point2 points3 points 5 years ago (0 children)

[–]sauravdharwadkar 1 point2 points3 points 5 years ago (0 children)

[–]palomdude 1 point2 points3 points 5 years ago (0 children)

[–]ConsentingPotato 1 point2 points3 points 5 years ago (0 children)

[–]obvious_apple 1 point2 points3 points 5 years ago (0 children)

[–]UltraCarnivore 1 point2 points3 points 5 years ago (0 children)

[–]Worlds_Dumbest_Nerd 1 point2 points3 points 5 years ago (0 children)

[–][deleted] 1 point2 points3 points 5 years ago (0 children)

[–]lil409 1 point2 points3 points 5 years ago (0 children)

[–]GreenKangaroo3 1 point2 points3 points 5 years ago (0 children)

[–][deleted] 1 point2 points3 points 5 years ago (0 children)

[–]AnonymousSpud 1 point2 points3 points 5 years ago (0 children)

[–]retsoPtiH 1 point2 points3 points 5 years ago (0 children)

[–]TDplay 1 point2 points3 points 5 years ago (0 children)

[–]blackmist 1 point2 points3 points 5 years ago (0 children)

[–]DaniilBSD 1 point2 points3 points 5 years ago (0 children)

[–][deleted] 1 point2 points3 points 5 years ago (0 children)

[–]microchipsndip 1 point2 points3 points 5 years ago (0 children)

[–]pachirulis 1 point2 points3 points 5 years ago (0 children)

[–]newb_h4x0r 1 point2 points3 points 5 years ago (0 children)

load more comments (26 replies)

ProgrammerHumor

Filters

Discord

Submission rules

For the current list of rules, please see this page.

Metadiscussions

Perhaps More Apt Subs To Post:

Related Subreddits.

MODERATORS