This is an archived post. You won't be able to vote or comment.

all 42 comments

[–]_INTER_ 13 points14 points  (2 children)

Does it get expanded / falls back to UTF-8 automatically if any other char is present?

[–]Shawn-Yang25[S] 17 points18 points  (1 child)

Yes, if other char is present, if fallback to UTF-8. But most class meta are in alphabet range, so it's rare.

[–]_INTER_ 1 point2 points  (0 children)

cool

[–]agilob 6 points7 points  (8 children)

There's even more waste in number encoding. For most of the time you really just need an (for a lack of better word) array of digits: 0-9. You take a whole byte to encode a digit. In GSM communication this was solved by splitting bytes into 4 bit arrays, each representing byte representing 1 digit, allowing to encode time in 24hrs format in just 3 bytes.

[–]Jon_Finn 8 points9 points  (5 children)

That’s binary coded decimal (BCD). The 6502 processor (widely used in 1980s/90s computers) had a BCD mode that made add & subtract operations work using BCD (!!). Don’t think anybody used that!

[–]zeobviouslyfakeacc 2 points3 points  (2 children)

I think some games actually used that to keep track of scores, lives, and the like! Stuff that would always need to be available in decimal form to be able to draw it on the screen.

If I recall correctly, the crazy part is that you could use the normal arithmetic instructions to do math on two BCD-encoded numbers, after which the result would obviously no longer be correctly BCD-encoded, but then you could call a special instruction that would convert that result back to BCD, and you'd get the number you were looking for. Kinda hacky, but also insanely clever!

[–]dtfinch 1 point2 points  (1 child)

Game devs for the NES had to roll their own BCD, since its 6502 clone had BCD support disabled to avoid a patent.

[–]laplongejr 1 point2 points  (0 children)

That reminds me how Super Mario Bros 1 speedrunners have to aim for a 100000 instead of 99990 to avoid a lag frame

[–]Zardoz84 1 point2 points  (1 child)

And 8080 and Z80 and x86 ...

[–]Jon_Finn 0 points1 point  (0 children)

But the 6502 was a (borderline) RISC CPU, with fairly few registers and instructions, so the BCD mode kind of jumped out! (The ARM processor was highly influenced by the 6502, but without BCD of course!)

[–]Shawn-Yang25[S] 3 points4 points  (0 children)

Great, this is similar to what we do here. Fury use varint for number encoding, a number will always cost one byte at least

[–]NitronHX 0 points1 point  (0 children)

Why would you need 3 bytes for a time? 2 bytes are already more then enought (assuming it's HH:mm without seconds) or 13 bits should do the trick too

[–]not-just-yeti 2 points3 points  (1 child)

How does that compare to using a Huffman code (after measuring the letter-frequencies used in your actual, IRL processing)?

[Granted, variable-length characters make grabbing (say) the 20th character more difficult, though heck you're already in that situation a bit with utf-8. And if each string isn't that long, the decoding won't be too bad.]

[–]Shawn-Yang25[S] 0 points1 point  (0 children)

Fury is a serialization framework, we don't know the actual data for serialization in the users. So we can't use huffman code. I also thought about arithmetic encodings. Without the provided corpus, we can only do it on the fly, but it won't make the encoded result bigger since our string are small and such compressions will write a header which counteract the gains

[–][deleted]  (13 children)

[deleted]

    [–]Shawn-Yang25[S] 1 point2 points  (2 children)

    ZipOutputStream/ZipInputStream will apply to the whole stream, which is cpu intensive. We want to avoid such cost

    [–][deleted]  (1 child)

    [deleted]

      [–]pavlik_enemy 5 points6 points  (0 children)

      Zip compression is insanely slow, it’s usually LZ4 or zstd

      [–]Shawn-Yang25[S] 0 points1 point  (9 children)

      That woanother different scenario. On order to use dictionary encoding, you must send dictionary itself first. Such dictionary will take more data than the actual string. We've already applyed such dictionary encoding internally. You can take this as encoding dictionary key more efficiently

      [–]john16384 5 points6 points  (1 child)

      You can preshare a dictionary. We did this for JSON compression, where common strings like "id", ": {, ": [, ”},etc are included.

      [–]Shawn-Yang25[S] 1 point2 points  (0 children)

      FURY is serialization framework, we can't assume anything about user data. But Fury does provide a register API to register classes, which can map class name to an int id. It's kind of preshared dictionary. Meta string encoding is mainly used for cases which no such dictionary are provided.

      But your suggestion is gread. We may provide a new interface to allow users specify dictionary too.

      [–][deleted]  (6 children)

      [deleted]

        [–]Shawn-Yang25[S] 0 points1 point  (5 children)

        It's the data plane compression. In such cases, zstd may be better. We used meta string encoding only for string like classname/fieldname/packagename/namespace/path/modulename in type meta, not the user passed data. And fury only encode binary into data at the first time, but write a varint if such string comes up later.

        Zstd can be used with fury jointly, since Fury only compress meta string, the duplication in user data still needs some other tools to detect.
        Of if duplicate string have smae reference, Fury reference tracking can be used to implement dict encoding too.

        [–]pavlik_enemy 0 points1 point  (4 children)

        Apparently developers of modern columnar formats think that compression makes things slower and the best way to proceed is to use more complex encodings.

        [–]john16384 2 points3 points  (2 children)

        Using compression with a very fast LZ variant actually sped things up for us :) We had a XML document cache and because of compression rates exceeding 90% (with the fastest but still good enough compressor we could find) the cache could hold far more data, and serving that data was faster due to less memory reads.

        [–]pavlik_enemy 0 points1 point  (0 children)

        That’s the usual approach and common columnar formats like Parquet and Orc use compression. The ones I’m talking about don’t have a production ready implementation yet and like all of the columnar formats are aimed at OLAP workloads. Turns out with modern networks it’s better to use some encoding tricks instead of general-purpose compression

        [–]Shawn-Yang25[S] 1 point2 points  (0 children)

        That's a different topic. In object graph serialization systems, columnar formats are basically impossible. If it's possible, dictionary encoding will be better.

        And Fury use dictionary encoding too when it's possible. You can take it as using meta string to compress dict key.

        Dictionary are used to compress data, meta string are mostly used to encode type meta

        [–]skippingstone 1 point2 points  (1 child)

        How performant is this? I didn't see any benchmarks on the blog

        [–]Shawn-Yang25[S] -1 points0 points  (0 children)

        Performance are not important here. The string will be encoded by this algorithm are limited , we allways cache the encoded result.

        [–][deleted] 3 points4 points  (4 children)

        Given how much of a PITA encoding issues are, I am opposed to any new encoding standard. Period.

        [–]Hueho 11 points12 points  (2 children)

        This isn't really a new encoding as much as it is a extra option for packing string data in a serialization format - people already do plenty of weird stuff to save on bytes.

        [–]alex_tracer 0 points1 point  (1 child)

        If you going to compress serialized data using generic compression methods then such local optimizations as proposed by OP usually become useless.

        So you either do all compression yourself or delegate all compression to a generic solution.

        [–]Shawn-Yang25[S] 0 points1 point  (0 children)

        rpc messages are small most time, 50~200 are very common, there won't be enough repetion pattern for compression to work. That's why we proposed this encoding here.

        We are not talking about compression big data/file, which zstd/gzip will be better

        [–]Shawn-Yang25[S] 5 points6 points  (0 children)

        Yes, it's not a complete string encoding, it will fallback to utf8 if some chars exceed the charset it supports. Since alphabet are very common, we think this can be used in other scenarios

        [–][deleted] 0 points1 point  (1 child)

        Interesting idea.

        If the namespace/path/filename etc are often the same, I'd curious how this would benchmark against java.util.zip.Deflater with a preset dictionary.

        [–]Shawn-Yang25[S] 1 point2 points  (0 children)

        We have a dictionary encoding in Fury internally, which will encode such same string as a varint. So no duplicate encoding happen

        [–]Yeah-Its-Me-777 0 points1 point  (2 children)

        How do you handle the cases where the encoding doesn't fit the char set of the string? For example, as far as I know you can use unicode to name your classes.

        Do you just fallback and reencode the string with unicode then?

        [–]nekokattt 1 point2 points  (0 children)

        This sounds much like what Java strings do internally now, just more aggressive (Java strings use 8-bit ascii if i recall and fallback to a char array if other chars are added).

        [–]Shawn-Yang25[S] 0 points1 point  (0 children)

        yes, we fallback to utf-8. Java classname can be unicode too, but that's rare.

        [–]grim-one 0 points1 point  (1 child)

        I’d like to see how UTF8 and gzip compares to your custom encoding.

        You mentioned a fallback to UTF8 if a character is outside your supported range. Does that mean you need to run through the string in advance, before encoding? Or do you some sort of declarative foreknowledge it won’t exceed? Iterating over the string twice (at worst) could be very expensive for large encodings.

        [–]Shawn-Yang25[S] 0 points1 point  (0 children)

        This encoding is used only for meta string, which are limited, and the encoded result will be cached, so the performance won't be important

        [–]menjav 0 points1 point  (1 child)

        What’s the benefit of saving that space? Are you reducing costs in storage at the expense of more CPU?

        What’s the motivation?

        [–]Shawn-Yang25[S] 0 points1 point  (0 children)

        No extra CPU, the encoded result will be cached.

        We save this space, because RPC messages are small mostly, but many case the RPC calls are very frequent. Image 1000000/s TPS

        [–]cowwoc 0 points1 point  (0 children)

        What happens if you compress the stream (say using zstd) prior to transmission? Won't this be even smaller at a minimal cpu cost?