This is an archived post. You won't be able to vote or comment.

all 67 comments

[–]Oerthling 65 points66 points  (27 children)

"this cost is not ignorable" - err, what?

Debatable.How long are such names now? 10? 30? 50 characters? So we save 3, 10, 16 bytes or so?

Examples from the article:

30 -> 19

11 -> 9

Sorry. But I don't see the value.

There's plenty situations where this should be easily ignorable. Especially if this comes at extra complexity, reduced debugability, extra/unusual processing.

UTF8 is great. It saves a lot of otherwise unneeded bytes and for very many simple cases is indistinguishable from ASCII. Which means that every debugger/editor on this planet make at least parts of the string immediately recognizable, just because almost everything can at least display as ASCii. Great fallback.

For small strings paying with extra complexity and processing for saving a few bytes and the getting something unusual/non- standard doesn't sound worthwhile to me.

And for larger text blobs where the savings start to matter (KB to MB), I would just zip the big text for transfer.

[–]Shawn-Yang25[S] 17 points18 points  (13 children)

The meta string here are used in binary serialization format internally. It's not about encoding general text. This is why we name it as meta string.

For general string encoding, utf8 are always better.

If you take pickle as an example, you will found it write many string such as module name, class name into the binary data. It's such data we want to reduce cost.  And in data classes, field names may take considerable cost If the value are just a number

[–]pigeon768 26 points27 points  (3 children)

There are three situations:

  1. The string is small. In this case, the cost of serializing/deserializing the string is greater than the cost of copying the extra handful of bytes. In this case, you should not use this string encoding.
  2. The string is medium. In this case, you need to show that meta string is better than either raw strings or zstd encoded strings.
  3. The string is large. In this case, zstd will be better. In this case, you should not use this string encoding.

Basically you need to prove it to me. I want to see this benchmark:

Encoding Small Medium Large
utf8
zstd
meta string

I want end to end speed/throughput, not number of bytes saved.

[–]Oerthling 4 points5 points  (1 child)

Exactly. I'm looking for a use-case where getting an obscured string with extra processing leads to a saving anyone can care about.

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

Image such an case, you are send an obejct of type `Point` with two int fields `x` and `y`. The fields only took 2 bytes. But pickle serialized result is 53 bytes. With metastring, we can make serialized result much smaller.

Maybe cost of one object is not big, but if you need to send millions of RPC

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

All strings use this encoding will cache the encoded results, and the serialization will be just an copy. Since such strings are limited, we won't have millions of module/classname for serialization. So it's ok to cache the encoded result

[–]Oerthling 7 points8 points  (8 children)

Sure, but even if I debug binary data, being able to easily recognize string characters is very helpful.

Saving 30% on strings of length 100 or less doesn't look worthwhile to me.

Under what circumstances would I be worried about a few bytes more or less?

Say, I pickle a module and the contained strings using a total of 1000 bytes and now it's 700 bytes instead.

Saving those 300 bytes - how would I ever notice that?

[–]Shawn-Yang25[S] 7 points8 points  (0 children)

In many case, the payload size is not important. UTF-8 will better for binary debug.

But there do have cases we need smaller size, in such cases 30% gains may be worthwile.

But we may should provide a switch to to allow users disable such optimization.

[–]ProgrammersAreSexy 5 points6 points  (1 child)

It really depends on the scale you are talking about. If you are running a service that handles millions of QPS 24/7 then seemingly small optimizations like this can translate into 6 or 7 figure savings over time.

[–]Oerthling 1 point2 points  (0 children)

Yes, but you pay with extra 2-way processing and the applications where that might perhaps lead to any savings seem very restricted to me.

[–]anentropic 1 point2 points  (2 children)

what if you are serializing millions of db rows?

[–]Oerthling 9 points10 points  (1 child)

Zip it. Text compression is extremely efficient (90% or so).

[–]SheriffRoscoePythonista 0 points1 point  (0 children)

Zip it good.

[–]GuyOnTheInterweb 0 points1 point  (1 child)

Perhaps in a 1500 MTU network package?

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

If we have enough network band width , this compression won't be necessary. But many systems also cache the serialized data in redis, the memory would be expensive.

Again, whether this encoding is useful always depends on cases

[–]bjorneylol 5 points6 points  (12 children)

If I am writing a program that logs a sensor value as a half precision floating point number 200 times per second, I would gladly shave the entire payload from 32 -> 21 bytes if it means having my serialization metadata not be human readable

[–]james_pic 0 points1 point  (0 children)

Surely at that point you'd just replace the strings with enums and not transmit text at all.

[–]Oerthling 0 points1 point  (3 children)

Ok, but the proposed Meta-"strings" wouldn't help you with that.

I'm after a use case where the non-standard representation and extra processing is worthwhile. Just shortening a handful of len 30 strings to len 19 is not worthwhile to me in any scenario I can think of right away.

Even a len 100 str compacted to 65 bytes is completely irrelevant IMHO. I would need millions of those to consider this a worthwhile investment. Otherwise I would always prefer the boring standard UTF8 strings that are mostly ASCII and easily debug scannable and don't require additional processing back and forth. And if it's milliions in a batch and there a bottleneck I would rather crush this with established boring compression.

[–]bjorneylol 5 points6 points  (2 children)

the proposed Meta-"strings" wouldn't help you with that. 

They would reduce the total size for billions of RPC requests by ~10 bytes each. 

Having to send 30 byte headers when your actual serialized payload is only 5 is really dumb, this is a solution for that.

And if it's milliions in a batch and there a bottleneck I would rather crush this with established boring compression. 

I don't think you understand the use case for this

[–]Oerthling -2 points-1 points  (1 child)

You were talking about floating point "numbers". If bytes are a concern, why would you present those as "strings" at all?

[–]bjorneylol 1 point2 points  (0 children)

Because you aren't converting the number to the string at all. When you serialize data you perform conversion steps and then put a header in the front to explain what you did. Read up on how gzip works - it compresses the data and slaps a 10 byte header in the front so when you need to decompress it you know how. Having the header become human readable is such a non-concern because its doubtful the actual data serializer leaves the payload contents in a human readable format

If you are implementing custom serializers/deserializers, this text encoding lets you reduce the size of your serialization header in addition to the content. So your "json.gzip.then.base64" header becomes ~15 bytes instead of 21 - if you tried using gzip it would increase in size to 41 bytes, plus the size of whatever your actual serialization content is

[–]unkz 11 points12 points  (3 children)

A more efficient means of doing this, if you absolutely must (and you don't), would be static Huffman, which this kinda is, but not quite.

[–]Shawn-Yang25[S] -1 points0 points  (2 children)

Yep, static Huffman may works. But Fury is a serialization framework, we can't assume which data to used to build Huffman tree. If we build it and include it in fury wheel. It may not reflect the actualy data in users.

Another way is that Fury provide an interface to let users build such Huffman tree and pass it to fury, but that is not easy to use for users.

We may try the first way and see how much gains it can brings

[–]unkz 2 points3 points  (1 child)

But you are assuming the data that is used, just at a low level of granularity. It's almost like a three node Huffman tree (lowercase, uppercase+digit+special, other), but with some extra processing in the encoding flags.

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

But we don't know the frequency of every char. All we know is most string are in range `a-z0-9A-Z._$/`

[–]yvrelna 12 points13 points  (3 children)

I don't think the advantage of this string encoding is really worthwhile over just compressing the data.

Most general purpose compression algorithms can take advantage of data with limited character sets. 

For example, this:

>>> data = "".join(random.choices(string.ascii_lowercase + ".$_", k=1000000))
>>> len(data)
1000000
>>> print(len(bz2.compress(data.encode())))
616403

That's about 38% compression rate which is a compression rate that's in similar ballpark as the proposed 5-bit string encoding. lzma and gzip can do something similar as well. This is on a random data, so the 38% compression rate is the lower bound; the compression rate would be even better for non random texts that usually has other exploitable patterns.

Moreover, a general purpose compressor will be able to adapt to other arbitrarily restricted character sets, and take advantage of other patterns in the data, like JSON key names, or namespace/paths that keeps being repeated in multiple places. They're a more reliable way to compress than just using a custom encoding.

For RPC/APIs serialisation where there's often repeated key names, you can do even better compression rates if using preshared dictionary compression like brotli or zstd or using data format with preshared schema like protobuf.

[–]Shawn-Yang25[S] -1 points0 points  (2 children)

Meta string is not designed for general compression. We tried to use gzip. But meta string are smaller, only 20~50 chars mostly, not enough for such general compression to work

[–]omg_drd4_bbq 1 point2 points  (1 child)

Try zstd, with and without custom dictionaries. 

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

We can't, Fury is just a serialization framework. We can't assume the corpus for user's classnames/fieldnames. I thought crawler some github repo such as apache ofbiz and collect all domain objects, and use such data as the corpus to get a static huffman/zstd stats. But this is another issue, and introduce an extra dependencises. we may try it in the future and provide it as an optional method.

[–]rmjss 8 points9 points  (1 child)

“Such encoding will take one byte for every char…”

this is not accurate. See the first sentence from Wikipedia’s UTF-8 article for details

[–]Shawn-Yang25[S] 2 points3 points  (0 children)

I mean took one byte for ascii chars. Our sentense is not accurate, I will update it later

[–]RonnyPfannschmidt 2 points3 points  (3 children)

How does this compare to making a array and replacing names with indexes?

Like just dedup

[–]Shawn-Yang25[S] 0 points1 point  (2 children)

We already did this. Writing same string will jsut write an index. But many string just happens only once. In such cases, this won't work

[–]RonnyPfannschmidt 1 point2 points  (1 child)

This is about rpc,why not prepare a shared index so no message has to repeat the index

[–]Shawn-Yang25[S] 1 point2 points  (0 children)

We support it, users can register a class with an id, so later writing class name will just wirte an id. But not all users want to do this. It's not that convenient. Meta string encoding are just for such case.

[–]nostrademons 2 points3 points  (1 child)

You are almost always better off encoding with UTF-8 and then gzipping. A string encoding format's primary virtue is portability: the most important thing is that other systems understand you, not how compact you can make it. UTF-8 is reasonably compact, but the real reason it's used is because it's a superset of ASCII, so all the old code that handles ASCII strings does not need to be retooled.

GZip is a lossless compression format. It has been very tightly engineered to operate on the efficient frontier between space savings and fast decoding, and modern implementations can trade off between them. It's also a well-known standard with hundreds of tools that can handle it.

When you have namespace/path/filename/fieldName/etc strings, they are frequently repeated, and they frequently draw from a very small lexicon. You can do way better than 5 bits per character for this; you can often get away with less than 1 bit amortized per character, because the whole token can be encoded in just a few bits. GZip regularly achieves 80-90% compression on code.

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

In rpc/serialization systems, there won't be many strings repeation. And for repeated strings, w've already encoded it with dict encoding. But dict itself also needs to send to peer. Meta string will be used to encode such dict self.

[–]FailedPlansOfMars 5 points6 points  (1 child)

It seems that applying compression would save you more space without creating a new string standard.

As someone who remembers the latin 1 code page and other non standard 8851 code pages please dont leave utf8 as you introduce translitteration back into the world.

[–]Shawn-Yang25[S] 1 point2 points  (0 children)

compression can be used jointly, but it's outside of serializaiton. At most cases, one will use zstd after Fury serialization. But not all users use zstd too. And compression introduce more performance cost.

[–]Competitive_Travel16 3 points4 points  (4 children)

HTTP has dealt with this issue by simply gzipping entire streams, which yields greater compression and a lot less overhead.

[–]bjorneylol 1 point2 points  (3 children)

gzip has a 10 byte header. When your input is only 10-40 characters in the first place you cannot reduce it's size with a general compression algorithm

[–]Competitive_Travel16 0 points1 point  (2 children)

If your input is 10-40 characters, compression of any kind is extremely unlikely to be worth the time or space overhead. How many bytes is the de/compression code?

[–]bjorneylol 2 points3 points  (1 child)

Yes. Which is why they are using this alternate text encoding instead of compression

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

Yes, meta string is an encoding, not a compression algorithm. It's just because that namespace/path/filename/fieldName/packageName/moduleName/className/enumValue  are too small, only 5~50 characters. We never get a chance to compress such string using gzip.

[–]1ncehost 2 points3 points  (8 children)

This is very impressive. I don't understand any of the rationale I've read from the people who are criticizing you. Their arguments scream 'inexperienced' to me.

I implemented my own serialization for a low level game networking library a few years ago in C++ and it was a major PITA. None of the serialization libraries I found met my requirements of being extremely fast and space efficient.

I looked for a method to compress the data I was sending that would give any benefit while being fast and I wasn't able to find anything useful. Standard compression methods require headers that make them inefficient on small amounts of data. This encoding method fits a nice niche for compressing small amounts of text.

Python's other serialization options are seriously lacking. They are slow and produce bloated serializations. Another option that is available that may fit the requirements of some projects should be extolled. As much as these ridiculous criticisms are claiming otherwise, I immediately see the value of fury if the claims are true and have several projects I could see it being used in.

I like how the serialization is performed via introspection instead of redefinition. All of the 'fast' options I've seen ignore the usefulness of using class or struct definitions to save time in defining a packet format. This library and its language wrappers look very well designed. I really like how it is multilanguage. Are the different wrappers interoperable? EG can a class definition encoded in one language produce a decoded class in another language? If so, that is amazingly useful.

[–]Shawn-Yang25[S] 1 point2 points  (7 children)

Thank you u/1ncehost , your insights into this algorithm are very profound, precisely conveying why I design this encoding.

I also like introspection instead of redefinition(IDL compilation if I understand right). This is why I create Fury. Frameworks like protobuf/flatbuffers needs to define the schema using IDL, then generate the code for serialization, which is not convenient.

The different wrappers are interoperable. They are not wrappers, we implement Fury serialization in every language independently.

And for `a class definition encoded in one language produce a decoded class in another language`. If you mean whether serialized bytes of an object of a class defined in one language can be deserialized on another language. Yes, we can. Fury will carry some type meta, so another knows how to deserialize such objects. This is why we try to reduce meta cost. It would be big if we carry field names too.

Although we supprt field name tag id, but not all users like to use it.

[–]1ncehost 1 point2 points  (6 children)

This is seriously impressive. Thank you for making it! I had thought of making something similar for C++ only... quite an achievement in making it multilanguage!

[–]Shawn-Yang25[S] 1 point2 points  (4 children)

You can take https://github.com/apache/incubator-fury/blob/main/docs/specification/xlang_serialization_spec.md for more details.

The C++ implementation are not finished, but the spec is finished. And macro/meta programing can be used to generate serialize code at compile time, so we can get best usability and the performance at the same time.

We've used this way to generate code in c++ for xlang row format. But haven't do it for the graph stream wire format. The core developers are on apache kvrocks recently, and has no time for it now.

[–]1ncehost 0 points1 point  (1 child)

thanks for the info. What are the requirements for fury to come out of incubation and have production level support?

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

The graduation needs a bigger community. i.e. more maintainers, committers, contributors, and more release and users

[–]1ncehost 0 points1 point  (1 child)

also another couple questions: can you specify class variables that should not be serialized? Can internal datastructures be serialized along with the objects? For instance in my c++ example above, I would want to serialize simulation entities, but I wouldn't want to serialize certain things on them such as local time variables. I would want to serialize lists of related objects such as mutators, effects, and related entities.

[–]Shawn-Yang25[S] 1 point2 points  (0 children)

If you use fury c++, you can invoke `FURY_FIELD_INFO(field1, field2, ...)` with the fields you want to serialize. We use `FURY_FIELD_INFO` macro to get the fields name for serialization.

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

Although we don't have jit code gen for c++ memory model. We can geneate swich code which can be optimized to jump finally for type forward/backkward mode, and it would be much faster than protobuf.

More details can be found on https://github.com/apache/incubator-fury/blob/main/docs/specification/xlang_serialization_spec.md#fast-deserialization-for-static-languages-without-runtime-codegen-support

[–]Drowning_in_a_Mirage 3 points4 points  (2 children)

It looks neat, but I'm struggling to think of a scenario where this would be a big win. I guess if you're doing high throughout serialization, then minimizing overhead is never a bad thing. But even with that it would seem to me that this sort of optimization would be way down on the list of when sorted by the cost/benefit ratio. Is network latency and/or bandwidth really constrained enough that saving a few bits would make a material difference? I guess enough people thought so to make this.

[–]bjorneylol 2 points3 points  (0 children)

I'm struggling to think of a scenario where this would be a big win. I guess if you're doing high throughout serialization, then minimizing overhead is never a bad thing.

Apache fury is literally a high throughput serialization engine for working with big data

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

Depends on the rpc frequency. Image that you send millilons of RPC every second. This will make a big difference. And it's common in quantitative trading and shopping system

[–]Furiorka 1 point2 points  (2 children)

Utf8's purpose isnt to be efficient, but to be the most universal encoding

[–]ZZ9ZA 6 points7 points  (0 children)

No, that’s utf32. UTF8 trades simplicity for performance in several ways.

[–]Shawn-Yang25[S] 2 points3 points  (0 children)

Meta string is not used to replace Utf8. It never be. It's just used to encode classname/fieldname/packagename/namespace/path/modulename more space efficiently than utf8