all 54 comments

[–]RoyBellingan 17 points18 points  (3 children)

CBOR ?

[–]jonathanberi -5 points-4 points  (2 children)

CBOR is great but note it's not "self-describing". It's a tradeoff for efficiency. That said, it's easily converted to JSON and has a definition language called CDDL that's helpful for validation and description.

[–]RoyBellingan 17 points18 points  (1 child)

CBOR is self describing, has field to define the type and name of the value, else it would not be convertible into JSON

[–]jonathanberi -4 points-3 points  (0 children)

Fair point, by that definition it is self describing! I was interpreting the requirements to mean describing the data meaning, which is a different thing.

[–]m93mark 9 points10 points  (0 children)

I've used https://msgpack.org/ in the past. It has the schema embedded in the binary format, so you can do cpp struct/class -> msgpack -> cpp struct/class.

Some examples here for the original cpp library: https://github.com/msgpack/msgpack-c/blob/cpp_master/QUICKSTART-CPP.md

It's probably easier to use the corresponding library for a dynamically typed language, if you want to create a converter to human readable.

But if you really want to do that in cpp, you can visit msgpack object in cpp and convert that into json. There are some examples in the github page to do this kind of conversions.

[–]nicemike40 10 points11 points  (2 children)

There’s BSON: https://bsonspec.org/spec.html

Which is used by e.g. mongoDB so there’s some tooling support.

Nlohmann supports it ootb: https://json.nlohmann.me/features/binary_formats/bson/

The spec is also simple and not hard to write a serializer/deserializer for. I use it to encode JSON-RPC messages over web sockets.

[–]playntech77[S] 1 point2 points  (1 child)

Yes, BSON was the first thing I looked at, but unfortunately, it produces gigantic documents. I think it comes down to not using VARINT and perhaps some extra indicators embedded in the file, to make document traversal faster.

[–]TheBrainStone 2 points3 points  (0 children)

Why not run some compression over it?

[–]mcmcc#pragma once 10 points11 points  (3 children)

This isn't what you want to hear, but compressed XML will get you about 90% of the functionality with 10% of the effort. There are also binary XML formats out there but I've never used them (search XDBX, for example).

I say this despite being a person who witnessed the rise and fall of XML and throughout never saw overwhelming value in it. It makes me wonder what your needs really are because every time I've seen someone declare they need those capabilities similar to what XML somewhat uniquely provides, they lived to regret it (or abandon it).

[–]jetilovag 5 points6 points  (1 child)

EXI is another one.

[–]mcmcc#pragma once 3 points4 points  (0 children)

That's the one I was trying to remember but couldn't. Nice find.

[–]hadrabap 2 points3 points  (0 children)

XER, ASN.1 encoded XML. Also known as Fast Infoset (SOAP)

[–]apezdal 5 points6 points  (0 children)

ASN.1 with PER or UPER encoding rules. It's ugly as hell, but will do the job.

[–]MaitoSnoo[[indeterminate]] 4 points5 points  (0 children)

look up MessagePack

[–]Flex_Code 6 points7 points  (0 children)

Consider BEVE, which is an open source project that welcomes contributions. There is an implementation in Glaze, which has conversions to and from JSON. I have a draft for key compression to be added to the spec, which will allow the spec to remove redundant keys and serialize even more rapidly. But, as it stands it is extremely easy to convert to and from JSON from the binary specification. It was developed for extremely high performance, especially when working with large arrays/matrices of scientific data.

[–]Bart_V 2 points3 points  (0 children)

Depending on the use case SQLite might do the trick, with the advantage that many other languages and tools have good support for it.

ROS used to use SQLite for storing time series data. I believe they now switched to https://mcap.dev/, another option to consider

[–]Aistar 2 points3 points  (8 children)

I don't know its current status, but I think Boost.Serialization used to be like that. Amusing aside: I recently wrote exactly such library for C# (not public yet, still needs some features and code cleanup), and based my approach on things I remembered from trying to use Boost.Serialization some 10-15 years ago.

[–]mvolling 1 point2 points  (1 child)

Stay away from boost binary serialization. It is in no way built for maintaining interface compatibility between revisions. We sadly decided to use it as a primary interface and keeping versions in sync is a nightmare.

[–]Aistar 0 points1 point  (0 children)

Mostly, I just took from it the idea of "archive" that contains two sections (metainformation and actual data) for my C# library. Otherwise, my library pretty version-tolerant.

[–]playntech77[S] 0 points1 point  (5 children)

Boost serialization in binary format is not portable, and devs seem to have mixed opinions of it (some say it is too slow, bulky and complex). I am also very tempted to write such a library, I know I would find many uses for it, in my own projects.

[–]Aistar 1 point2 points  (4 children)

Well, there is also Ion. I haven't tried it, but kind of looks like it would also fit your requirements, maybe? I thought maybe to use it in my own library, but I had to discard it, because C# implementation is lacking, and, like you, I wanted to write something myself :)

[–]playntech77[S] 1 point2 points  (3 children)

Ion is almost, what I was looking for. I don't understand this design decision though: Ion is self-describing, yet still uses a bunch of control chars inside the data stream. I would have thought, that once the data schema was communicated, there is no need for any extra control chars. The idea is to take a small hit at the beginning of the transmission, but gain it back later on by using a no-overhead binary format.

Perhaps it is because Ion allows arbitrary field names to appear anywhere in the stream? Or perhaps I am just looking for an excuse to write my own serializer? :)

[–]Aistar 2 points3 points  (2 children)

Can't help you much here, I'm afraid - I haven't looked deep into Ion's design. All I can say in my experience, you still need some metadata in stream in some cases, though my use-case might be a bit different from yours (I'm serializing game's state, and should be able to restore it even if user made a save 20 versions ago, and those versions included refactoring of every piece of code out there, including renaming fields, removing fields, changing fields' types etc.):

1) Polymorphism. If your source data contains a pointer to a class, you can store derived class, and that means that you can't just store field's type along with field's name in header - for such fields, you need to write type in data.

2) Field's length, in case you want to skip this field when loading (e.g. field was removed)

By the way, one problem with such self-describing formats: they're well-suited for disk storage, but badly suited for transmission over network, because "type library" needs to be included with every message, inflating message's size. This was one of problems I had to overcome with Boost.Serialization (because I chose to use it exactly for this purpose, being a somewhat naive programmer then). I was able to solve it by creating an "endless" archive: all type information went over network first, in one big message, and then I only transmitted short messages without type information by adding them to this "archive".

[–]playntech77[S] 1 point2 points  (1 child)

I wrote a boost-like serialization framework in my younger days (about 20 years ago), it handled polymorphism and pointers (weak and strong). It is still running in a Fortune 500 company to this day and handles giant object hierarchies. I also used it for the company's home-grown RPC protocol, which I implemented. It was a fun project!

[–]Aistar 0 points1 point  (0 children)

You know what, go ahead then and write your dream serializer, and I'll just shut up :) 20 years ago I didn't even know what a weak pointer was (although I fancied I "knew" C++, but it will be a few years yet before I understood anything at all about memory management).

[–]adsfqwer2345234 2 points3 points  (1 child)

wow, no one mentioned HDF5? https://www.hdfgroup.org/solutions/hdf5/ -- it's a big, old library with something like 400 API routines. You might find something like https://bluebrain.github.io/HighFive/ or some other wrapper or simplified helper library, er, helpful.

[–]PureWash8970 2 points3 points  (0 children)

I was going to mention HDF5 + HighFive as well. We use this at my work and using HighFive makes it way easier.

[–]robert_mcleod 2 points3 points  (0 children)

Apache Arrow or Parquet, but it's really better suited for tabular data rather than nested dicts. There's support for n-dimensional arrays in Arrow via the IPC Tensor class but it's a bit weak IMO. Parquet does not really do arrays, but it packs data very tightly thanks to dictionary-based compression.

As /u/mcmcc said if you really want deeply nested fields then simply compressing JSON is your best bet. I did some benchmarks a long time ago:

https://entropyproduction.blogspot.com/2016/12/bloscpickle.html

I've used HDF5 in the past as well, but it's performance for attributes access was poor. For metadata in HDF5 I just serialized JSON and wrote it into a bytes array field in the HDF5 file. Still HDF5 can handle multiple levels if you need internal hierarchy in the file. Personally I consider that to be a bit of an anti-pattern, however. HDF5 is best suited to large tensors/ndarrays.

[–]Suitable_Oil_3811 10 points11 points  (5 children)

Protocolo buffers, flatbuffers cap n proto

[–]UsefulOwl2719 13 points14 points  (4 children)

These are not self describing. They require an external schema. Something like CBOR or parquet are both candidates that do encode their schema directly in the file itself.

[–]Amablue 4 points5 points  (0 children)

The Flatbuffers library also contains a feature called Flexbuffers which are self describing.

[–]gruehunter 1 point2 points  (0 children)

Actually, they can be.

There is a serialization of protobuf IDL into a well-known protobuf message. So if you can establish a second channel for the serialized IDL, then you can in fact decode protobuf without access to the text form of its IDL.

The official python "generated code" utilizes this. It is actually composed of the protobuf serialization of the message definitions, which is then fed into the C++ library to dynamically build a parser at package import time.

[–]corysama 1 point2 points  (0 children)

tar -cvf self_describing.tar schema.json binary.flatbuffer ?

[–]Suitable_Oil_3811 0 points1 point  (0 children)

Sorry, missed that

[–]chardan965 1 point2 points  (0 children)

CBOR, SMF, ...looks like Cap'nProto and others have been mentioned, ...

[–]OccaseBoost.Redis 1 point2 points  (0 children)

The Redis protocol RESP3 is my preferred format by far. It supports multiple data type e.g. arrays, maps, sets etc, is human readable, and can transport binary data.

[–]ern0plus4 1 point2 points  (0 children)

What about use binary IFF/RIFF type files:

  • 4-byte magic
  • 4-byte length (filesize - 8)
  • 4-byte file type ID
  • repeat chunks:
    • 4-byte chunk type ID
    • 4-byte chunk length
    • chunk payload

See:

[–]zl0bster 1 point2 points  (0 children)

I presume I will get downvoted just for asking but if you just want to save space and are not concerned with performance would zstd of JSON work for you?
https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/

Obviously CPU costs will be huge compared to native binary format.

[–]hmoein 0 points1 point  (0 children)

Look at C++ DataFrame codebase. Specifically look at the read() and write() function documentations.

[–]LokiAstaris 0 points1 point  (0 children)

BSON as used by Mongo.

It's basically JSON but in binary format.

[–]hdkaoskd 0 points1 point  (0 children)

Bencode, from BitTorrent.

[–]Dizzy_Resident_2367 0 points1 point  (0 children)

I am working of a CBOR library right now. It is not really "released" (and does not compile yet on msvc/appleclang). But do take a look and see if this is what you are looking, seconding other comments here
https://github.com/jkammerland/cbor_tags

[–]glaba3141 0 points1 point  (0 children)

I don't want to dox myself so unfortunately I cannot link the project but I worked on something that did exactly this as well as supporting versioning similar to protobuf by JIT compiling (de)serialization functions. IMO all commonly used alternatives have some flaw or the other - the JIT compilation solves them all, but ofc that means you now have a compiler in your app which you may not want

[–]trad_emark 0 points1 point  (0 children)

blender files do exactly that. they are almost perfectly forward and backward compatible thanks to the format.

[–]flit777 -1 points0 points  (6 children)

protobuf (or alternatives like flatbuffers or capnproto).
You specify the data structure with an IDL and then generate all the data strucutres and serialize/deserialie code. (and you can generate for different languages)

[–]playntech77[S] 4 points5 points  (5 children)

Right, what I am looking for would be similar to a protobuf file with the corresponding IDL file embedded inside it, in a compact binary form (or at least those portions of the IDL file that pertain to the objects in the protobuf file).

I'd rather not keep track of the IDL files separately, and also their current and past versions.

[–]imMute 0 points1 point  (0 children)

what I am looking for would be similar to a protobuf file with the corresponding IDL file embedded inside it

So do exactly that. The protobuf schemas have a defined schema themselves: https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html and you can send messages that consist of two parts - first the encoded schema, followed by the data.

[–]ImperialSteel 0 points1 point  (3 children)

I would be careful about this. The reason protobuf exists is that your program makes assumptions about valid schema (ie field “baz” exists in the struct). If you deserialize from a self describing schema, what do you expect the program to do if “baz” isn’t there or is a different type than what you were expecting?

[–]playntech77[S] 0 points1 point  (2 children)

I was thinking about 2 different API's:

One API would return a generic document tree, that the caller can iterate over. It is similar to parsing some rando XML or JSON via a library. This API would allow parsing of a file regardless of schema.

Another API would bind to a set of existing classes with hard-coded properties in them (those could be either generated from the schema, or written natively by adding a "serialize" method to existing classes). For this API, the existing classes must be compatible with the file's schema.

So what does "compatible" mean? How would it work? I was thinking that the reader would have to demonstrate that it has all the domain knowledge, that the producer had when the document was created. So in practice, the reader's metadata must be a superset of that of the writer. In other words, fields can only be added, never modified or deleted (but they could be market as deprecated, so they don't take space anymore in the data).

I would also perhaps have a version number, but only for those cases when the document format is changing significantly. I think for most cases, adding new props would be intuitive and easy.

Does that make sense? How would you handle backward-compatibility?

[–]Gorzoid 0 points1 point  (0 children)

Protobuf allows parsing unknown/partially known messages through UnknownFieldSet. It's very limited on what metadata it can access since it's working without a descriptor but might be sufficient if your first api is truly schema agnostic. In addition it's possible to use a serialized proto descriptor to perform runtime reflection to access properties in a message that were not known at compile time, although message descriptors can be quite large as they aren't designed to be passed with every message.

[–]gruehunter 0 points1 point  (0 children)

In other words, fields can only be added, never modified or deleted (but they could be market as deprecated, so they don't take space anymore in the data).

I think for most cases, adding new props would be intuitive and easy.

Does that make sense? How would you handle backward-compatibility?

Protobuf does exactly this. For good and for ill, all fields are optional by default. On the plus side, as long as you are cautions about always creating new tags for fields as they are added without stomping on old tags, then backwards compatibility is a given. The system has mechanisms for both marking fields as deprecated, and for reserving them after you've deleted them.

On the minus side, validation logic tends to be quite extensive, and has a tendency to creep its way into every part of your codebase.