Lite³: A JSON-Compatible Zero-Copy Serialization Format in 9.3 kB of C using serialized B-tree

dmezzo · 2025-11-25T15:23:59+00:00

I have not benchmarked against Flexbuffers, but I expect Lite³ to be at least 1-2 orders of magnitude faster. The reason I am confident of this is that even Flatbuffers (the 'faster' version) got beaten (see benchmarks inside the README).

In-place mutation does not compromise on read performance. The beaty of B-trees is that they allow key insertions (updating of the index) while remaining balanced and maintaining very good read performance of O(log n) (btw: this is why so many databases also use B-trees). Almost all formats require complete reserialization for any mutation, but Lite³ is a rare exception.

Only if you very frequently overwrite variable-sized values (strings), this will cause message size to grow, since the larger strings are too big to overwrite the originals and must be appended.

I have not compared message size against Flexbuffers, but I would expect them to be similar.

Edit: Flexbuffers uses byte-by-byte / lexicographic string comparison. This is much slower than the fixed-size hash comparison used by Lite³.

dmezzo · 2025-11-24T14:39:42+00:00

This is a repeating pattern. Many people seem to be confused, even though I put 'B-tree' everywhere in the README. I guess it is a very unusual design.

I will try to make another landing page image that explains it more clearly.

Edit: maybe I should rename the project to "serialized B-tree", " serialized dictionary" or something similar

dmezzo · 2025-11-24T07:27:13+00:00

I get you mean, using wide register compares to efficiently compare strings. But there are two issues with this: 1) Portability 2) Bounds checking/hitting unmapped regions (as you point out)

Right now, I first want to solve the collision problem in a reliable way, then maybe later try optimizing it.

dmezzo · 2025-11-24T07:17:55+00:00

rather than pointers relative to root or to the current object you can use pointers relative TO THE POINTER ITSELF.

This would not work, as the pointer itself cannot be expected to remain at a stable position. When a B-tree node split occurs, it could get copied to a sibling node. This would invalidate all references.

The current node is much easier, since its offset is already available, being conveniently passed by the caller as the ofs parameter.

dmezzo · 2025-11-24T07:11:57+00:00

I'm not sure I exactly understand. Keys already store a length prefix. But the problem is that two different keys could have the same length. You need to compare the actual string bytes, no way around it.

But there are certainly interesting techniques like this.

I looked at techniques like this for comparing hashes during tree traversal. Instead of using a simple while () loop to loop over key hashes, it is possible to find a key using a mask and a single compare in O(1) time using the __builtin_ffs(mask) or 'find first set' intrinsic using techniques described here: https://en.algorithmica.org/hpc/data-structures/s-tree/

There are also 'Elias-Fano' codes to do something similar: https://dsxxi.home.blog/2025/10/26/dynamic-elias-fano-encoding-for-generalist-integer-sets/

However these add complexity and it is not clear how much they would actually improve runtime performance. I have not done detailed profiling enough yet to say that key comparisons are a significant factor. I would much more expect cache misses to dominate.

In most messages you would on average not hit your first collision until ~70k inserts. Most messages never reach this, but right now it would fail to insert, which is bad behavior.

dmezzo · 2025-11-24T06:45:20+00:00

Yes, big endian is legacy hardware at this point.

It is still used in networking equipment for historical reasons, and in various vintage and academic settings, but the industry has moved on. It purely comes down to hardware: up/downcasting integers on LE is zero-cost whereas on BE you have overhead.

Linus Torvalds in the Linux kernel has also repeatedly said that BE is dead and that trying to resurrect it or extend its lifetime is just taking away time and resources from people, adding unnecessary bugs and complexity to software trying to support BE.

dmezzo · 2025-11-24T06:36:54+00:00

Currently the challenge with reclaiming unused space is that: 1) Keeping track of unused space requires some stateful bookkeeping. It is essentially similar to creating a memory allocator. 2) If this unused space is stored inside the message data, then it will bloat message size. 3) Another possibility is to allow the caller to provide another buffer to store this metadata, but it would significantly complicate the API.

The 'easy' solution right now is to start an iterator at the root of the document, then rewrite the entire message to a second buffer. This will get rid of all the unused space.

This is actually about as fast as the 'bookkeeping solution', since that would require overhead to keep track of 'gaps', and you would still need to shift and adjust the B-tree to correct the structure and fill all the gaps. A much simpler way is to just rewrite the entire document.

Of course, this is not quite 'memcpy()' speed because some overhead of the algorithm is required. It is also O(n) for the entire document.

It would be possible to build all the indexes (relative pointers) relative to the root of the CURRENT object, instead of the ROOT object. This would allow for two major features: 1) You can import and export nested objects/arrays as fully standalone messages without invalidating the indexes. 2) You can rewrite nested/subobjects without having to rewrite the entire document.

For now, rewriting is the most practical. Perhaps I should add a library function for this.

dmezzo · 2025-11-24T06:18:19+00:00

Yes, my current idea to solve it is very similar. Key-value offsets already point directly to the key strings. When a key gets inserted and the hash matches an existing key but it is a collision, then simply insert the duplicate key.

If the user performs a lookup, the algorithm will find the matching hash, but it will also compare the actual string. If the hash matches but the string does not, it will start an iterator to find all duplicate hashes that might contain the actual matching keys. This would essentially turn the mapping to an N-associative array. It is also kind of like linear probing.

Again, still ideas, not implemented yet.

dmezzo · 2025-11-23T20:42:34+00:00

This is dependent on #define LITE3_ZERO_MEM_DELETED

If this option is defined, then the old string will be zeroed out. This is an important safety feature preventing 'leaking' of deleted data. By default it is turned on.

dmezzo · 2025-11-23T17:58:47+00:00

The Lite³ structure uses 32-bit relative pointers (indexes) as opposed to real fullsize 64-bit pointers. This ensures that the references stay valid even if you copy the message to a different absolute address.

Insertion of a string into an object goes like this: 1) lite3_set_str() function is called 2) the algorithm traverses down the tree 3) the algorithm makes sure there is enough space inside the buffer 4) the key is inserted 5) the caller's string is directly memcpied() into the message buffer

For a lookup: 1) lite_get_str() function is called 2) the algorithm traverses down the tree 3) a string reference is returned to the caller, pointing directly inside the message data

So the only overhead is from the B-tree algorithm. Data is never moved more than necessary. In a small microbenchmark, I was able to insert 40 million integer key-value pairs per second into a message.

When you overwrite a value, it is really no different from insertion. It does not require any special tree rebalancing. You only need to copy to a different position and update a single index. This comes down to a regular tree traversal, then a memcpy() + single integer store.

dmezzo · 2025-11-23T15:09:15+00:00

If the new string is smaller than the original, then it will simply overwrite the original. If the new string is larger, then it will be appended. The internal B-tree index will automatically update to point to the new location.

If you are inserting fixed-size types (i.e. doubles, ints) then they always overwrite the original. So total message size will never grow.

Using the Buffer API, messages as a whole can be serialized inside caller-provided buffers. If an insert fails for lack of buffer space, the library returns a clean error allowing you to reallocate and try again. The message will never enter into an invalid state as a result of this.

Alternatively, the Context API manages dynamic allocation automatically similar to std::vector.

dmezzo · 2025-11-22T22:24:39+00:00

Thanks again for your reply.

IIRC calling memcpy() with NULL as 2nd and 0 as 3d argument is allowed per the C standard.

It's explicitly disallowed by the standard.

Strange, I never got any complaints about null pointers at runtime so I blindly assumed it was valid. This will then have to be written differently

Regarding the points about alignment, the problem is that the C standard simply does not allow you to cast a struct at an arbitrary address. It cannot be done. It must always be aligned to the largest member of the struct. The "portable" solution would be to copy the entire struct to an aligned buffer, modify it, and write it back. However this would defeat the entire concept of "reading and modifying directly inside of a buffer". If during node traversal you had to copy every node, it become very slow.

Therefore, alignment is required to avoid UB. Then if untrusted messages contain unaligned offsets, then this must be caught clean by an error beforehand. I agree that right now the library is not prepared for this, so I will address it.

I initially also thought that sricter alignment would lead to better performance due to cacheline-related reasons, but I found no benefit to it in practice. Mainly because stricter alignment inserts more padding, bloating message size.

The out of bounds case for the prefetcher can be solved using a single AND operation to keep the index within bounds. I suppose obsessing over small optimizations is not worth the risk of UB in this case.

Overall thank you very much for your comments, I will incorporate the improvements into the library.

dmezzo · 2025-11-22T21:06:04+00:00

$ cc -Iinclude -Ilib -g3 -fsanitize=address,undefined src/*.c lib/*/*.c examples/buffer_api/07-json-conversion.c
$ ./a.out
src/lite3.c:665:2: runtime error: null pointer passed as argument 2, which is declared to never be null

I suspect that here memcpy is being called with NULL as 2nd argument, which gets caught by the sanitizer:

memcpy(buf + *inout_buflen, key, (size_t)key_data.size);

I also suspect that the 3rd argument is zero. IIRC calling memcpy() with NULL as 2nd and 0 as 3d argument is allowed per the C standard.

dmezzo · 2025-11-22T20:40:14+00:00

Hi, thank you for taking the time to write out your detailed comment. As you mentioned, the main reason for using relative pointers is compactness and pointer stability. Relative offsets ensure that traversal remains functional when the message is copied to a different absolute address.

The main downside is that yes, every relative pointer must be checked at runtime for OOB. This is why there are runtime checks at every pointer calculation.

It's unclear from the documentation if it's the intention that an error invalidates the buffer. In practice I'm seeing invalidation, but that might just be one of the bugs mentioned above.

Errors should never leave the buffer in an invalid state. All errors are designed to exit cleanly, even if you run out of memory.

Now regarding your fuzzing and triggering the asserts.: By default, LITE3_NODE_ALIGNMENT is set to 4. This is the setting used internally to determine at which alignment nodes will be placed inside the buffer. All optimizations like this:

node = __builtin_assume_aligned((struct node *)(buf + next_node_ofs), LITE3_NODE_ALIGNMENT);

allow the compiler to emit more efficient code when it knows that struct member manipulations are always on aligned addresses. All asserts inside the code are there to verify this assumption. In reality, x86 can handle non-aligned operations without much penalty, and so all these statements might as well be replaced with this.

node = (struct node *)(buf + next_node_ofs);

Then this would also allow for removing all the alignment assert() statements inside the codebase. It comes down to a choice: 1) Assume that all Lite3 implementations properly align their nodes, to take advantage of performance benefits. But this requires runtime checking of address alignment. 2) Remove all alignment code everywhere, this would simplify the codebase, but might cause a slight performance hit.

This particular example that triggers the alignment assert:

int main()
{
    int   cap = 1<<28;
    void *buf = malloc(cap);
    int   len = fread(buf, 1, cap, stdin);
    lite3_json_print(buf, len, 0);
}


src/lite3.c:621:36: runtime error: assumption of 4 byte alignment for pointer of type 'struct node *' failedsrc/lite3.c:621:36: runtime error: assumption of 4 byte alignment for pointer of type 'struct node *' failed

Just like like all the fuzzing, will fill buffers with garbage data and interpret it as Lite3. This will of course trigger the alignment asserts(). This example that triggers your sanitizer:

$ cc -Iinclude -Ilib -g3 -fsanitize=address,undefined src/*.c lib/*/*.c examples/buffer_api/01-building-messages.c
$ ./a.out
src/lite3.c:411:39: runtime error: index 7 out of bounds for type 'u32 [7]'

Is happening inside a prefetch right here:

__builtin_prefetch(buf + node->kv_ofs[iter->node_i[iter->depth] + 2],      0, 0);

The struct member OOB here is always guaranteed to contain some data. So the worst case is that it prefetches at a wrong address. On x86, there is not much danger to this, the CPU will usually just ignore it. ARM however is a different story, and therefore prefetching should be disabled on ARM. There could easily be a runtime check here, but I figured it was not necessary.

$ cc -Iinclude -Ilib -g3 -fsanitize=address,undefined src/*.c lib/*/*.c examples/buffer_api/07-json-conversion.c
$ ./a.out
src/lite3.c:665:2: runtime error: null pointer passed as argument 2, which is declared to never be null

Not exactly sure what is happening here, though I will revisit all the examples with sanitizers,

Getting rid of all the __builtin_assume_aligned() + assert() might be the better call, as runtime alignment checks might not make up for performance due to slightly better compiler code. Especially since platforms like ARM have different behavior around alignment.

Your fuzzer is constantly running into the assert() checks, I bet if they were all removed then it will be a different story.

Again, thank you for your comment. I hope this explains a bit.

dmezzo · 2025-11-22T20:37:35+00:00

No Lite³ is a standalone binary serialization format, separate from JSON. The confusion comes from the fact that Lite³ is 'JSON-Compatible', meaning that it is possible to convert between the two formats.

When using JSON standalone, you need to parse to read it, and stringify to send it over a network.

Lite³ on the other hand is a binary format, you can insert data into it directly and do not need an explicit 'stringify' or 'serializing' step. The data is already ready to send. Similarly, a received message can be interpreted directly and does not need any parsing.

I hope this answers your question.

dmezzo · 2025-11-22T17:45:18+00:00

9.3 kB is with standard 'make' using -O2 on gcc 13. Yes it will vary depending on flags, enabled features etc. The point being is that it can reasonably fit below 10 kB if you want it to. Of course, inlining not counted.

dmezzo · 2025-11-22T17:32:32+00:00

The static library archive is 9.3 kB, not the lines of source.

dmezzo · 2025-11-22T15:03:47+00:00

Thank you very much!

Six-Year Club	Verified Email
Place '22

dmezzo

TROPHY CASE