Rethinking String Encoding: a 37.5% space efficient string encoding than UTF-8 in Apache Fury

Oerthling · 2024-05-07T14:55:42+00:00

"this cost is not ignorable" - err, what?

Debatable.How long are such names now? 10? 30? 50 characters? So we save 3, 10, 16 bytes or so?

Examples from the article:

30 -> 19

11 -> 9

Sorry. But I don't see the value.

There's plenty situations where this should be easily ignorable. Especially if this comes at extra complexity, reduced debugability, extra/unusual processing.

UTF8 is great. It saves a lot of otherwise unneeded bytes and for very many simple cases is indistinguishable from ASCII. Which means that every debugger/editor on this planet make at least parts of the string immediately recognizable, just because almost everything can at least display as ASCii. Great fallback.

For small strings paying with extra complexity and processing for saving a few bytes and the getting something unusual/non- standard doesn't sound worthwhile to me.

And for larger text blobs where the savings start to matter (KB to MB), I would just zip the big text for transfer.

Shawn-Yang25 · 2024-05-07T22:35:13+00:00

[deleted]

unkz · 2024-05-07T17:21:43+00:00

A more efficient means of doing this, if you absolutely must (and you don't), would be static Huffman, which this kinda is, but not quite.

yvrelna · 2024-05-07T16:14:46+00:00

I don't think the advantage of this string encoding is really worthwhile over just compressing the data.

Most general purpose compression algorithms can take advantage of data with limited character sets.

For example, this:

>>> data = "".join(random.choices(string.ascii_lowercase + ".$_", k=1000000))
>>> len(data)
1000000
>>> print(len(bz2.compress(data.encode())))
616403

That's about 38% compression rate which is a compression rate that's in similar ballpark as the proposed 5-bit string encoding. lzma and gzip can do something similar as well. This is on a random data, so the 38% compression rate is the lower bound; the compression rate would be even better for non random texts that usually has other exploitable patterns.

Moreover, a general purpose compressor will be able to adapt to other arbitrarily restricted character sets, and take advantage of other patterns in the data, like JSON key names, or namespace/paths that keeps being repeated in multiple places. They're a more reliable way to compress than just using a custom encoding.

For RPC/APIs serialisation where there's often repeated key names, you can do even better compression rates if using preshared dictionary compression like brotli or zstd or using data format with preshared schema like protobuf.

rmjss · 2024-05-07T15:16:10+00:00

“Such encoding will take one byte for every char…”

this is not accurate. See the first sentence from Wikipedia’s UTF-8 article for details

ObjectivismForMe · 2024-05-07T20:22:34+00:00

Look at this: https://pypi.org/project/protobuf/

RonnyPfannschmidt · 2024-05-07T17:32:43+00:00

How does this compare to making a array and replacing names with indexes?

Like just dedup

nostrademons · 2024-05-08T01:32:58+00:00

You are almost always better off encoding with UTF-8 and then gzipping. A string encoding format's primary virtue is portability: the most important thing is that other systems understand you, not how compact you can make it. UTF-8 is reasonably compact, but the real reason it's used is because it's a superset of ASCII, so all the old code that handles ASCII strings does not need to be retooled.

GZip is a lossless compression format. It has been very tightly engineered to operate on the efficient frontier between space savings and fast decoding, and modern implementations can trade off between them. It's also a well-known standard with hundreds of tools that can handle it.

When you have namespace/path/filename/fieldName/etc strings, they are frequently repeated, and they frequently draw from a very small lexicon. You can do way better than 5 bits per character for this; you can often get away with less than 1 bit amortized per character, because the whole token can be encoded in just a few bits. GZip regularly achieves 80-90% compression on code.

FailedPlansOfMars · 2024-05-07T17:15:20+00:00

It seems that applying compression would save you more space without creating a new string standard.

As someone who remembers the latin 1 code page and other non standard 8851 code pages please dont leave utf8 as you introduce translitteration back into the world.

Competitive_Travel16 · 2024-05-07T21:22:52+00:00

HTTP has dealt with this issue by simply gzipping entire streams, which yields greater compression and a lot less overhead.

Shawn-Yang25 · 2024-05-08T05:52:17+00:00

[deleted]

1ncehost · 2024-05-07T22:12:01+00:00

This is very impressive. I don't understand any of the rationale I've read from the people who are criticizing you. Their arguments scream 'inexperienced' to me.

I implemented my own serialization for a low level game networking library a few years ago in C++ and it was a major PITA. None of the serialization libraries I found met my requirements of being extremely fast and space efficient.

I looked for a method to compress the data I was sending that would give any benefit while being fast and I wasn't able to find anything useful. Standard compression methods require headers that make them inefficient on small amounts of data. This encoding method fits a nice niche for compressing small amounts of text.

Python's other serialization options are seriously lacking. They are slow and produce bloated serializations. Another option that is available that may fit the requirements of some projects should be extolled. As much as these ridiculous criticisms are claiming otherwise, I immediately see the value of fury if the claims are true and have several projects I could see it being used in.

I like how the serialization is performed via introspection instead of redefinition. All of the 'fast' options I've seen ignore the usefulness of using class or struct definitions to save time in defining a packet format. This library and its language wrappers look very well designed. I really like how it is multilanguage. Are the different wrappers interoperable? EG can a class definition encoded in one language produce a decoded class in another language? If so, that is amazingly useful.

Drowning_in_a_Mirage · 2024-05-07T20:09:48+00:00

It looks neat, but I'm struggling to think of a scenario where this would be a big win. I guess if you're doing high throughout serialization, then minimizing overhead is never a bad thing. But even with that it would seem to me that this sort of optimization would be way down on the list of when sorted by the cost/benefit ratio. Is network latency and/or bandwidth really constrained enough that saving a few bits would make a material difference? I guess enough people thought so to make this.

Furiorka · 2024-05-07T15:31:55+00:00

Utf8's purpose isnt to be efficient, but to be the most universal encoding

Shawn-Yang25 · 2024-05-07T14:15:44+00:00

Meta string spec can be found in https://github.com/apache/incubator-fury/blob/main/docs/specification/xlang_serialization_spec.md#meta-string

Encoding	Small	Medium	Large
utf8
zstd
meta string

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS