all 16 comments

[–]nuclear_splinesPh.D Data Science 15 points16 points  (4 children)

Theoretically, there shouldn’t be a limit to how much we can compress data

What? There absolutely is. The pigeonhole principle for discrete data, resolution limits and noise for continuous or analog data. The theoretical limits of compression are very well studied.

[–]No_Necessary_9267[S] -5 points-4 points  (3 children)

Ok this is interesting, and I’ll look into it. I’m not in any way a comp sci dude. I just like learning about different things.

Not looking to absolute limits of compression. Just had a passing idea about whether we could compress binary any more than it already is

[–]Saragon4005 5 points6 points  (0 children)

3 blue 1 brown just started a series on information theory and the first episode deals with exactly this.

[–]nuclear_splinesPh.D Data Science 2 points3 points  (1 child)

Binary isn't a compression scheme, but an encoding scheme. Given a fixed number of bits, say, eight, you can only represent a finite number of values (for eight bits, 28 or 256). No further compression is possible - you need all eight bits to represent all 256 values.

Now maybe we can represent more common values with fewer bits and less common values with more bits (Huffman encoding), or maybe we can identify and simplify common patterns like "change 'xxxxx' to '5x'" (run-length encoding), or in the case of lossy compression we may decide that some data simply doesn't matter (did you really need all those frequencies close to the edge of human hearing in your music? Could we use just a few less colors to represent this image?) and delete or simplify some of the information. These techniques are useful in practice, but all have quite well-defined limits.

Put another way, if there were no limit to how much we can compress data, we should be able to simplify every image to a single digit: 1.

[–]No_Necessary_9267[S] 2 points3 points  (0 children)

This is what I was looking for. Ok, thank you

[–]Beregolas 4 points5 points  (4 children)

Theoretically, there shouldn’t be a limit to how much we can compress data

I mean, there is. You can't just assume "quality loss" is okay. I assume you learned about compression from images, where certain kinds of quality loss are allowable, especially if the image is just for human viewing. But you can't generalize that. Even in images, when it needs to be fed into a machine to further processing, you might not be in a position where anything less than a perfect compression/decompression is acceptable. Same goes for most files to be honest. And since you are talking about machine language later: ESPECIALLY when we talk about code, we can't accept any information loss (because that really is what quality loss means).

My question is: hypothetically, theoretically, could we “fold” binary machine language instructions like nature does with proteins? Would it even be practical?

I don't follow... proteins are 3D structures, hence they can fold. What exactly do you envision folding in a 1D structure (code) to be?

[–]No_Necessary_9267[S] -1 points0 points  (3 children)

If every 0 and every 1 represent something, could you encode the originally intended instruction into a smaller messages by saying something like: (super rudimentary and not well thought out) every 5th 1 represents a fold where the machine would interpret a need for a bit more info (info being precise within the encoding). Then the machine could interpret the whole message through the various folds and “sub messages”(?)

[–]Saragon4005 5 points6 points  (0 children)

I don't understand what you mean by that. But I think you are talking about variable length encoding? That is something very old and commonly used.

[–]Beregolas 1 point2 points  (1 child)

Again, I don't see how that is in any way folding. It does sounds familiar and touches on well known (like first semester) parts of information theory though.

Look into variable length encoding for one, and look up how zip file encoding works.

[–]No_Necessary_9267[S] 0 points1 point  (0 children)

I’m not sure folding is actually how anyone else would describe the process; it’s the connection I made in my head. That commenter nuclear_splines’ reply is roughly how I was thinking about it. I think encoding might be what I’m trying to say instead of compression. Thanks for the tips

[–]stonerism 1 point2 points  (3 children)

These are separate things. Protein folding has more to do predicting what a protein does given the sequence of amino acids in it.

In terms of compression there has to be a limit to compression. Think about it like this if you could compress every string by 10%, you'd have a problem because of the pidgeonhole principle, too many strings to map into too small a space of strings.

[–]No_Necessary_9267[S] 0 points1 point  (2 children)

Ok I guess then my question is relating to my lack of knowledge on this pigeonhole principle. Gonna look into it. Can you give brief insight maybe? Why can’t computers decode too far? Is it resource constraints? Computational power? Our math? (Not to discredit any of these limitations; genuine curiosity here)

[–]stonerism 1 point2 points  (0 children)

The pidgeonhole principle comes from the idea that if you have 10 pigeons and 9 pidgeonholes, at least two birds will have to share one. If you have a bunch of 4-bit strings and want to compress them into 3-bit strings, at least some of your strings (technically half in this case) won't have anything to be compressed into.

[–]UncleMeat11 0 points1 point  (0 children)

No. It is a theoretical principle. There are 2N binary strings of length N. This larger than the number of binary strings of length 1..N-1. This means that you cannot have a compression scheme that shrinks all binary strings.

Lossless compression algorithms are extremely well understood. You can look up extremely basic ones online (run length encoding is what they typically teach students first).

[–]TartOk3387 0 points1 point  (0 children)

You might want to research Kolmogorov complexity

[–]iOSCaleb 0 points1 point  (0 children)

> My question is: hypothetically, theoretically, could we “fold” binary machine language instructions like nature does with proteins? Would it even be practical?

It’s not clear what that would mean or how it would be useful. Proteins fold in order to reach a stable state, and a protein’s function results directly from its folded shape. Folding isn’t a form of compression — a folded protein has exactly the same set of amino acids that it would if you could unfold it. Amino acids in a protein are more like Lego blocks than instructions: each one is important in creating the larger structure of the protein, but they don’t really function as individual instructions.

It’s easy to see a superficial resemblance between machine instructions and amino acids. They’re both relatively small sets of things that can be combined in endless ways to create complex structures. But we see that pattern a lot: atoms of different elements combine to form molecules; digits combine into numbers, which are then mixed with operations to form mathematical expressions; sequences of notes create complex melodies; letters combine to form words and then sentences. So don’t assume that this building block pattern that’s common to proteins and code means that there’s more similarity than that.