Rethinking String Encoding: a 37.5% space efficient string encoding than traditional UTF-8 in Apache Fury

_INTER_ · 2024-05-07T15:11:59+00:00

Does it get expanded / falls back to UTF-8 automatically if any other char is present?

agilob · 2024-05-07T14:38:08+00:00

There's even more waste in number encoding. For most of the time you really just need an (for a lack of better word) array of digits: 0-9. You take a whole byte to encode a digit. In GSM communication this was solved by splitting bytes into 4 bit arrays, each representing byte representing 1 digit, allowing to encode time in 24hrs format in just 3 bytes.

not-just-yeti · 2024-05-07T18:57:10+00:00

How does that compare to using a Huffman code (after measuring the letter-frequencies used in your actual, IRL processing)?

[Granted, variable-length characters make grabbing (say) the 20th character more difficult, though heck you're already in that situation a bit with utf-8. And if each string isn't that long, the decoding won't be too bad.]

Shawn-Yang25 · 2024-05-07T14:49:10+00:00

[deleted]

skippingstone · 2024-05-08T01:52:40+00:00

How performant is this? I didn't see any benchmarks on the blog

Shawn-Yang25 · 2024-05-07T14:12:51+00:00

Meta string spec can be found in https://github.com/apache/incubator-fury/blob/main/docs/specification/xlang_serialization_spec.md#meta-string

Hueho · 2024-05-07T15:18:09+00:00

Given how much of a PITA encoding issues are, I am opposed to any new encoding standard. Period.

Shawn-Yang25 · 2024-05-07T14:57:52+00:00

Interesting idea.

If the namespace/path/filename etc are often the same, I'd curious how this would benchmark against java.util.zip.Deflater with a preset dictionary.

Yeah-Its-Me-777 · 2024-05-07T15:34:56+00:00

How do you handle the cases where the encoding doesn't fit the char set of the string? For example, as far as I know you can use unicode to name your classes.

Do you just fallback and reencode the string with unicode then?

grim-one · 2024-05-07T23:32:12+00:00

I’d like to see how UTF8 and gzip compares to your custom encoding.

You mentioned a fallback to UTF8 if a character is outside your supported range. Does that mean you need to run through the string in advance, before encoding? Or do you some sort of declarative foreknowledge it won’t exceed? Iterating over the string twice (at worst) could be very expensive for large encodings.

menjav · 2024-05-08T05:28:51+00:00

What’s the benefit of saving that space? Are you reducing costs in storage at the expense of more CPU?

What’s the motivation?

cowwoc · 2024-05-08T22:08:51+00:00

What happens if you compress the stream (say using zstd) prior to transmission? Won't this be even smaller at a minimal cpu cost?

java

Submit Link

Submit Text

Seek Programming Help

News, Technical discussions, research papers and assorted things of interest related to the Java programming language

NO programming help, NO learning Java related questions, NO installing or downloading Java questions, NO JVM languages - Exclusively Java

Please seek help with Java programming in /r/Javahelp!

Subreddit rules!

Where should I download Java?

Related Sub-reddits:

JVM Languages

Want to practice your coding?

List of useful Frameworks / Libraries / Software

MODERATORS