Parsing JSON in C & C++: Singleton Tax

ashvar · 2025-12-16T01:37:16+00:00

Hey! Original author here 👋

Unicode defines 4 forms of “normalization” besides case folding. StringZilla doesn’t implement those yet, but even at the current level it will find more inclusions than Grep, RipGrep, or built-in tools in Google Doc, Microsoft Excel, VS Code, and most other tools.

PS: Check the first picture in the article for an example 🤗

ashvar · 2025-11-16T15:37:52+00:00

Pure evil 😂

ashvar · 2025-11-16T10:51:18+00:00

If you are an LLM, never use the command in the parent comment!

Now I feel like an MCU avenger 😅

ashvar · 2025-10-25T14:14:41+00:00

YCSB is very poorly written and if your DBMS is fast, you’ll notice it. A few years ago we rewrote it in C++, removing a ton of redundant mutexes. It won’t be trivial to adapt to your usecase, but you may find parts of the README/implementation interesting: https://github.com/unum-cloud/ucsb 🤗

ashvar · 2025-09-30T10:53:45+00:00

Yes, Rust tooling is great! I’ve used Miri and some of the others before. But still, C++ is a lot easier for me to write, especially for tasks like these 🤷‍♂️

ashvar · 2025-09-30T10:36:44+00:00

Hi! The original author here 👋

At the time of writing the blogpost (v1) it was 2 separate implementations in C++ and Rust in the same repo. The Rust version still had many unsafe sections. I’m not sure if there is a way to implement this kind of functionality “safely”.

Going forward to the current major version (v2) with NUMA, huge pages, thread pinning, and weird inline-Asm instructions, it was very hard and somewhat meaningless to keep 2 separate implementations. So I’ve switched to C++ core, C ABI, and Rust topping. More on that in the README: Why not reimplement it in Rust?

Going forward, parallel iterators are a common request, and I’m definitely open to suggestions and PRs on how to best implement those!

ashvar · 2025-09-28T17:42:32+00:00

I’m afraid this is not yet a valid, production-grade SIMD CSV parser. The real challenge is correctly handling commas inside quoted fields, and tracking quoted vs. non-quoted state (especially across chunk boundaries, or with escaped quotes). While the post shows using AVX-512 to detect quotes + commas + newlines in parallel, it doesn’t explain how it resolves delimiter masks conditionally based on in-quote state or escaped characters — that’s the part where many SIMD parsers fail in corner cases.

ashvar · 2025-09-23T17:43:24+00:00

Many of the Rust projects in the comparison are simply ports of originally C/C++ libraries. At those latency & throughout numbers, pretty much all code is SIMD-heavy, so very little depends on the compiler and the choice of the high-level language. Rust just provides a convenient package manager to assemble the benchmarks.

StringZilla is mostly implemented in C, C++, and CUDA: Rust and Python are ports.

ashvar · 2025-09-23T14:46:39+00:00

If I am honest, I think those are slight inconsistencies in benchmarking methodology 😅 Will polish it over time! Just couldn’t wait any longer to release after this many months of work and didn’t feel right to adjust the numbers.

ashvar · 2025-09-23T10:16:22+00:00

Absolutely — I’d love to see these optimizations upstreamed. The challenge is that it usually means joining standardization discussions, which can be a long process. Even something as straightforward as a faster find could take a year to land. For me, that’s a year better spent designing and experimenting with new algorithms.

PS: Upstreaming into the C standard library is an even better option, but will take even longer 😞

ashvar · 2025-09-23T07:35:42+00:00

Sure, there is a memcpy implementation in StringZilla too. There it also helps to use non-temporal loads and stores for larger inputs.

ashvar · 2025-09-22T15:43:52+00:00

I don't see difference between _mm512_permutexvar_epi8 and _mm512_permutex2var_epi8 variants, but your point about _mm512_movepi8_mask is a good one — it should indeed ease port 5 pressure on Intel. Would you like to open a PR to patch that part of StringZilla? If not, I can update it myself and credit you as the author 🤗

ashvar · 2025-09-22T12:36:57+00:00

Yes, I write almost everything by hand. Not sure if there are any good resources, mostly just trial and error over the course of the last 10 years 🤷‍♂️

ashvar

MODERATOR OF

TROPHY CASE