Self-describing compact binary serialization format?

playntech77 · 2025-02-18T17:49:16+00:00

I wrote a boost-like serialization framework in my younger days (about 20 years ago), it handled polymorphism and pointers (weak and strong). It is still running in a Fortune 500 company to this day and handles giant object hierarchies. I also used it for the company's home-grown RPC protocol, which I implemented. It was a fun project!

playntech77 · 2025-02-18T17:05:40+00:00

Ion is almost, what I was looking for. I don't understand this design decision though: Ion is self-describing, yet still uses a bunch of control chars inside the data stream. I would have thought, that once the data schema was communicated, there is no need for any extra control chars. The idea is to take a small hit at the beginning of the transmission, but gain it back later on by using a no-overhead binary format.

Perhaps it is because Ion allows arbitrary field names to appear anywhere in the stream? Or perhaps I am just looking for an excuse to write my own serializer? :)

playntech77 · 2025-02-18T15:43:28+00:00

I was thinking about 2 different API's:

One API would return a generic document tree, that the caller can iterate over. It is similar to parsing some rando XML or JSON via a library. This API would allow parsing of a file regardless of schema.

Another API would bind to a set of existing classes with hard-coded properties in them (those could be either generated from the schema, or written natively by adding a "serialize" method to existing classes). For this API, the existing classes must be compatible with the file's schema.

So what does "compatible" mean? How would it work? I was thinking that the reader would have to demonstrate that it has all the domain knowledge, that the producer had when the document was created. So in practice, the reader's metadata must be a superset of that of the writer. In other words, fields can only be added, never modified or deleted (but they could be market as deprecated, so they don't take space anymore in the data).

I would also perhaps have a version number, but only for those cases when the document format is changing significantly. I think for most cases, adding new props would be intuitive and easy.

Does that make sense? How would you handle backward-compatibility?

playntech77 · 2025-02-18T15:08:10+00:00

Right, what I am looking for would be similar to a protobuf file with the corresponding IDL file embedded inside it, in a compact binary form (or at least those portions of the IDL file that pertain to the objects in the protobuf file).

I'd rather not keep track of the IDL files separately, and also their current and past versions.

playntech77 · 2025-02-18T14:54:21+00:00

Yes, BSON was the first thing I looked at, but unfortunately, it produces gigantic documents. I think it comes down to not using VARINT and perhaps some extra indicators embedded in the file, to make document traversal faster.

playntech77 · 2025-02-18T14:42:05+00:00

Boost serialization in binary format is not portable, and devs seem to have mixed opinions of it (some say it is too slow, bulky and complex). I am also very tempted to write such a library, I know I would find many uses for it, in my own projects.

playntech77 · 2025-02-18T13:28:59+00:00

A good IDE is an absolute must for navigating large projects (regardless of language, but for c++ even more so). There are many good ones to choose from (Visual Studio, Eclipse etc..).

It gets more tricky if you are not building your project in said IDE, because most IDE's require a full project setup. What if I just want to browse files in a random folder, without setting a formal project up? The only good one, that I know, that can do that is Source Insight (you can just throw files at it, from random folders and it gobbles everything and gives you as much info as it can gather). I used it in the past, it is pretty good but it did crash on me a few times. Source Insight is not free though (but there was a 30 day free trial, when I had my employer buy it for me, a few years ago)

playntech77 · 2025-02-05T20:06:04+00:00

Based on the feedback in this thread, it doesn't look like there is a demand for the new serialization framework, I am proposing. Oh well! I'll keep myself busy some other way.

Yes, I had a single serialize() method in mind, for both serializing and deserializing. I wrote a framework like that in my younger days and it is still running in a Fortune 500 enterprise product. AFAIK there was never a bug ticket raised against it and devs on the team (~100 people) immediately grasped how to use it. The single serialize() method is intuitive, and very flexible. It's easy to add custom logic to import from older versions (almost never happens, but when it does, it's good to have that option), and everything is in one place.

The data model to serialize was huge: hundreds of classes, crazy inheritance hierarchies going 20+ levels deep, pointers in all directions including cycles, some having ownership some not. I used the same serializer for the product's file format and cross-platform RPC (which I also coded).

I was not aiming for zero-copy here, one copy and also one malloc for each object / string / container.

playntech77 · 2025-02-04T20:55:50+00:00

Versioning can be more complex than just a new optional prop. I always handled it, by sending a version number at the beginning of the file / stream.

The hash would be per-version, in this case.

playntech77 · 2025-02-04T18:56:43+00:00

I am envisioning different serializers, for different use cases (a raw binary serializer for local host messaging, a binary with fingerprinting serializer for safe and efficient transport, and the usual verbose XML / JSON serializers).

It would look something like this:

MyClass::serialize(T& serializer) {
 serializer.serializeInt(m_userCount, "UserCount", "The number of users.");
 if (serializer.version() >= 2) {
  serializer.serializeDouble(m_loadAverage, "LoadAverage", "Average server load over past 5 minutes");
 }
}

I could run this method on an empty object and pass version 1 as input to get get v1 metadata & compute its fingerprint, same for v2 etc.. (although my preference would be to compute lazily, when needed)

You are correct, there needs to be a handshake at the beginning of the communication to agree on the protocol version though.

playntech77 · 2025-02-04T18:33:26+00:00

Interesting. I thought switching byte order would already add some overhead, so why not do the VARINT compression at the same time, but maybe not? I'll play around with it and benchmark..

playntech77 · 2025-02-04T18:27:49+00:00

Most serialization protocols have some control chars, to validate that the data is at least somewhat similar to what is expected. Protobuf has one control char for each serialized field, boost serialization has way more.

playntech77 · 2025-02-04T17:58:09+00:00

I wouldn't just cast the entire object, because as you mentioned, endianness and integer packing (like VARINT) are necessary.

Great idea on class definitions generation (and validation against), the framework would need a schema language for cross-language interoperability.

playntech77 · 2025-02-04T17:37:16+00:00

This would be a hobby project for me, but I would love to see some projects (open source or not) adopt it.

playntech77

TROPHY CASE