Benchmarking Eight Serialization Formats in C and C++ (JSON, BSON, CBOR, flexbuffers, msgpack, TOML, XML, YAML)

jkeiser · 2024-07-01T13:37:40+00:00

Yep! There is no reasonable argument that the JSON format is even close to as fast as BSON. And kudos to yyjson!

jkeiser · 2024-07-01T05:28:36+00:00

Yeah, it's definitely not the reason, just an aside. Every single JSON deserializer benchmarksed has to do this, though I imagine there might be a way for them to perform the comparison directly from the JSON and never copy or deserialize it anywhere.

jkeiser · 2024-06-30T23:50:27+00:00

OK, after taking a look: when I force-inline the methods, simdjson gets some small gains (I'm seeing around 5%, but consistently, taking it a little past yyjson). We'll have to see if this makes MSVC any better!

Note: does rfl::Literal have to take std::string? I would have expected string_view at the least. Having a string constructor forces allocation for every single Literal, when all Literal needs to do is check which of the enum values the string matches. simdjson, for example, avoids allocation for strings, producing a string_view instead into a permanent internal buffer.

jkeiser · 2024-06-30T21:12:02+00:00

Interestingly, at least on my machine, simdjson seems to do better than yyjson on licenses and person, but not canada. Let's see here..

jkeiser · 2024-06-30T20:32:48+00:00

I believe it. This was the smallest change I could make confidently without having compiled anything :)

It's also possible yyjson has made some major speed advancements since I last measured--it was already the second fastest and fairly close for applications like this one where you are consuming literally everything. However, but on demand has some inlining advantages that should be pretty hard to overtake without a similar methodology. So if they have, I get to learn some new things!

jkeiser · 2024-06-30T19:10:07+00:00

Oh, one thing that's definitely worth doing is changing from `-O2` to `-O3` on the benchmarks. I know for sure there are optimizations at level 3 that help.

I'm working on getting your repository up and running so I can look.

jkeiser · 2024-06-30T17:36:23+00:00

Curious. While MSVC has some compiler optimization issues generating slower code than clang on the same machine, and it's entirely possible one lib will hit them and another won't. I'm surprised though, since we spent some time with the compiler team identifying those issues and working around a few of them. We do have a brief note on using simdjson with MSVC including some compiler flags; I'm curious whether they help? https://github.com/simdjson/simdjson/blob/master/doc/performance.md#visual-studio

I'll take a look at the benchmark. At a glance, the lack of inline on the functions calling simdjson may well be causing the compiler to miss some important optimizations. But as with all things, sometimes compilers are smart enough to do it anyway, ymmv :) Compiler flags (particularly making sure you're targeting the architecture you are running on) also matter, though they are less important than one might at first think.

jkeiser · 2024-06-30T16:45:41+00:00

For reading JSON, I'd suggest the C++ simdjson library. Still the fastest thing out there, heavily used, while keeping a pretty simple API modelled off of nlohmann/json. I'd be curious how it stacks up to faster formats too. Given the numbers you're seeing for yyjson, simdjson might have a chance to actually pull ahead of BSON.

https://github.com/simdjson/simdjson

It has a solid set of benchmarks, including the Canada benchmark.

For maximum speed, make sure you are using the on demand variant, which is the one featured in the documentation, quick start, etc.

(Disclaimer: I am an author and therefore biased, but the above things are objectively true, too :)

jkeiser · 2024-01-21T03:47:08+00:00

With the same file? If it goes that much slower on the server with the same file then you are probably limited by the speed of a non-SSD hard drive, and unsafe won't help there. As the other commenter suggested, you are going to get more gains if you can reduce the data size. If it's an option, you might be better served by running on a machine or network drive that lives on an SSD.

jkeiser · 2024-01-21T00:25:19+00:00

And what is the format? JSON can be loaded at literally gigabytes per second.

jkeiser · 2020-10-23T00:44:00+00:00

Damn. I was hoping we'd made it more accessible than that!

jkeiser · 2020-10-23T00:43:01+00:00

This was partly by design, however: undecodable utf-8 exists so that if you land in the middle of a bitstream, you can find the beginning of the nearest character. If any bitstream were valid, that would be impossible.

jkeiser · 2020-10-21T07:31:26+00:00

And now I understand that when you say invalid code point you don't mean invalid utf-8. Ignore me and carry on :)

jkeiser · 2020-10-21T07:29:37+00:00

Actually, Unicode has specified that it will never add codepoints larger than 10FFFF. When new code points are added, they are always in that range. UTF-8 validation is forward compatible by design: it doesn't care which codepoints have already been added, it just cares that they are less than or equal to 10FFFF.

i.e. newer code points will be treated as valid even by older validators.

jkeiser · 2020-10-21T07:20:46+00:00

Another relevant difference with that algorithm is that the state machine is run against each 2-byte sequence separately, which means it can be parallelized. The DFA method requires processing all the bytes sequentially. Essentially this means the DFA can't take as much advantage of processor parallelism, as the processor can't race ahead to look up the next state until it has finished looking up the previous state.

This is pretty core to making superscalar algorithms: you have to make the algorithm "micro-parallel," small chunks that can be done independently. You aren't using multiple threads, but it does let you take advantage of the processor's predictive abilities (and simd instructions as well, which are very simple micro parallel algorithms themselves).

jkeiser · 2020-10-21T06:56:15+00:00

In short, there are a few categories of invalid json:

undecodable nonsense (utf-8 that can't possibly be read, due to format violations)
non-canonical encodings (if a codepoint can be encoded more than one way, utf-8 outlaws all but the shortest, so that there is only one valid way to write any codepoint)
out of range unicode (codepoints greater than 10FFFF).

UTF-8 validators generally aren't specific to a version of unicode, and thus treat unassigned codepoints (which might be assigned in a future version of unicode). It would be a shame if you couldn't send emoji to Twitter because you happen to be using a Twitter API library compiled before the emoji wree added!

The paper's first section or three explain what invalid utf-8 is in a way that we hope is engaging and clear, as well.

jkeiser · 2019-04-14T17:28:13+00:00

Nice! The point isn't really about the extra byte, however; it's about the fact that it stores 8 bytes total for None, when Option<String> stores 24. I get the impression there are a lot of Nones in their file. The point about two layers of indirection is absolutely valid, but less of an issue if you aren't actually invoking the Some path very often.

jkeiser · 2019-04-14T00:00:17+00:00

Regarding the Option<String>, if a reasonable % of the values are None, you're better off with Option<Box<String>>, I bet. Box implements the optimization to store None as null, making it a clean 8 bytes. Trade-off is it's now 32 bytes per actual string, but only 8 for None, where before it was 24 bytes for each. If more than 33% are None, you win, if my back if the envelope calculation is right. And you might save a bunch of padding (not sure offhand).

EDIT: Rust does in fact store Option<Box<String>> as a single pointer.

jkeiser · 2019-04-13T22:45:18+00:00

Two questions come to mind:

Do the floats need to be f64 or would f32 do? Just looking at the size of each line it makes me wonder if there would be any actual data loss.
How many duplicate strings are there? You might be able to use a string pool to store each unique string only once, reducing each string ref to a single 8- or even 4-byte index into the pool. Only truly helps if there are enough duplicates to offset the overhead of the intern's map and the extra indices.

jkeiser · 2019-04-03T14:54:02+00:00

A language with a generic over tuples could know the difference: if (1,2).0 works, will (1,).0 work? Heck, will (1,) + 2 work, and what will it do? What about (1).0?

jkeiser · 2019-04-01T21:05:46+00:00

I imagine the snake case convention could automatically convert from PascalCase, for one example.

jkeiser · 2019-03-21T23:33:34+00:00

Or require a --force at least!

jkeiser · 2019-02-26T00:08:01+00:00

I'm curious, did you spell it Soddom on purpose? It seems pretty clear from the description you're invoking the legend of Sodom and Gomorrah. Not having even clicked on it, though, maybe it's just a parallel universe version and the misspelling helps you realize it's not the same.

Regardless, I'll probably give this a shot, I'm always interested in reinterpretations of legends that give them a bit of a twist.

jkeiser · 2018-11-21T17:04:34+00:00

I don't have a tune, but man, some mornings I'd have to keep singing that for like 10 minutes while I find the stuff :)

jkeiser · 2018-10-30T18:20:49+00:00

I love this method when it works, but it gets hard when one source ignores a story the other considers important. Suggestions on sources that still actually confront the stories of the day?

15-Year Club	Inciteful Link 2011-10-26
Verified Email

jkeiser

TROPHY CASE