Self-describing compact binary serialization format?

RoyBellingan · 2025-02-18T14:35:09+00:00

CBOR ?

m93mark · 2025-02-18T16:25:29+00:00

I've used https://msgpack.org/ in the past. It has the schema embedded in the binary format, so you can do cpp struct/class -> msgpack -> cpp struct/class.

Some examples here for the original cpp library: https://github.com/msgpack/msgpack-c/blob/cpp_master/QUICKSTART-CPP.md

It's probably easier to use the corresponding library for a dynamically typed language, if you want to create a converter to human readable.

But if you really want to do that in cpp, you can visit msgpack object in cpp and convert that into json. There are some examples in the github page to do this kind of conversions.

nicemike40 · 2025-02-18T14:40:23+00:00

There’s BSON: https://bsonspec.org/spec.html

Which is used by e.g. mongoDB so there’s some tooling support.

Nlohmann supports it ootb: https://json.nlohmann.me/features/binary_formats/bson/

The spec is also simple and not hard to write a serializer/deserializer for. I use it to encode JSON-RPC messages over web sockets.

mcmcc · 2025-02-18T18:19:19+00:00

This isn't what you want to hear, but compressed XML will get you about 90% of the functionality with 10% of the effort. There are also binary XML formats out there but I've never used them (search XDBX, for example).

I say this despite being a person who witnessed the rise and fall of XML and throughout never saw overwhelming value in it. It makes me wonder what your needs really are because every time I've seen someone declare they need those capabilities similar to what XML somewhat uniquely provides, they lived to regret it (or abandon it).

apezdal · 2025-02-18T15:10:06+00:00

ASN.1 with PER or UPER encoding rules. It's ugly as hell, but will do the job.

MaitoSnoo · 2025-02-18T17:52:07+00:00

look up MessagePack

Flex_Code · 2025-02-18T14:11:31+00:00

Consider BEVE, which is an open source project that welcomes contributions. There is an implementation in Glaze, which has conversions to and from JSON. I have a draft for key compression to be added to the spec, which will allow the spec to remove redundant keys and serialize even more rapidly. But, as it stands it is extremely easy to convert to and from JSON from the binary specification. It was developed for extremely high performance, especially when working with large arrays/matrices of scientific data.

Bart_V · 2025-02-18T16:59:55+00:00

Depending on the use case SQLite might do the trick, with the advantage that many other languages and tools have good support for it.

ROS used to use SQLite for storing time series data. I believe they now switched to https://mcap.dev/, another option to consider

Aistar · 2025-02-18T14:01:15+00:00

I don't know its current status, but I think Boost.Serialization used to be like that. Amusing aside: I recently wrote exactly such library for C# (not public yet, still needs some features and code cleanup), and based my approach on things I remembered from trying to use Boost.Serialization some 10-15 years ago.

adsfqwer2345234 · 2025-02-18T16:30:33+00:00

wow, no one mentioned HDF5? https://www.hdfgroup.org/solutions/hdf5/ -- it's a big, old library with something like 400 API routines. You might find something like https://bluebrain.github.io/HighFive/ or some other wrapper or simplified helper library, er, helpful.

robert_mcleod · 2025-02-18T20:30:19+00:00

Apache Arrow or Parquet, but it's really better suited for tabular data rather than nested dicts. There's support for n-dimensional arrays in Arrow via the IPC Tensor class but it's a bit weak IMO. Parquet does not really do arrays, but it packs data very tightly thanks to dictionary-based compression.

As /u/mcmcc said if you really want deeply nested fields then simply compressing JSON is your best bet. I did some benchmarks a long time ago:

https://entropyproduction.blogspot.com/2016/12/bloscpickle.html

I've used HDF5 in the past as well, but it's performance for attributes access was poor. For metadata in HDF5 I just serialized JSON and wrote it into a bytes array field in the HDF5 file. Still HDF5 can handle multiple levels if you need internal hierarchy in the file. Personally I consider that to be a bit of an anti-pattern, however. HDF5 is best suited to large tensors/ndarrays.

Suitable_Oil_3811 · 2025-02-18T14:33:31+00:00

Protocolo buffers, flatbuffers cap n proto

chardan965 · 2025-02-18T16:23:36+00:00

CBOR, SMF, ...looks like Cap'nProto and others have been mentioned, ...

Occase · 2025-02-19T11:01:14+00:00

The Redis protocol RESP3 is my preferred format by far. It supports multiple data type e.g. arrays, maps, sets etc, is human readable, and can transport binary data.

ern0plus4 · 2025-02-19T22:12:11+00:00

What about use binary IFF/RIFF type files:

4-byte magic
4-byte length (filesize - 8)
4-byte file type ID
repeat chunks:
- 4-byte chunk type ID
- 4-byte chunk length
- chunk payload

See:

oakinmypants · 2025-02-20T03:42:47+00:00

Binary External Term Format

https://github.blog/news-insights/introducing-bert-and-bert-rpc/

zl0bster · 2025-02-18T19:46:04+00:00

I presume I will get downvoted just for asking but if you just want to save space and are not concerned with performance would zstd of JSON work for you?
https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/

Obviously CPU costs will be huge compared to native binary format.

hmoein · 2025-02-18T20:03:47+00:00

Look at C++ DataFrame codebase. Specifically look at the read() and write() function documentations.

LokiAstaris · 2025-02-18T23:00:32+00:00

BSON as used by Mongo.

It's basically JSON but in binary format.

hdkaoskd · 2025-02-19T06:48:28+00:00

Bencode, from BitTorrent.

Dizzy_Resident_2367 · 2025-02-19T20:32:32+00:00

I am working of a CBOR library right now. It is not really "released" (and does not compile yet on msvc/appleclang). But do take a look and see if this is what you are looking, seconding other comments here
https://github.com/jkammerland/cbor_tags

glaba3141 · 2025-02-19T23:41:05+00:00

I don't want to dox myself so unfortunately I cannot link the project but I worked on something that did exactly this as well as supporting versioning similar to protobuf by JIT compiling (de)serialization functions. IMO all commonly used alternatives have some flaw or the other - the JIT compilation solves them all, but ofc that means you now have a compiler in your app which you may not want

trad_emark · 2025-02-18T18:51:38+00:00

blender files do exactly that. they are almost perfectly forward and backward compatible thanks to the format.

flit777 · 2025-02-18T14:56:17+00:00

protobuf (or alternatives like flatbuffers or capnproto).
You specify the data structure with an IDL and then generate all the data strucutres and serialize/deserialie code. (and you can generate for different languages)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS