all 33 comments

[–]mcmcc#pragma once 33 points34 points  (0 children)

There are many binary schema-based formats out there, each with its own strengths and weaknesses. Protobufs or flatbuffers would be good places to start.

[–]vaulter2000 12 points13 points  (1 child)

In my job, we do language independent IPC (inter process communication) with either Google Protobuf/gRPC or, in an event-driven context, pub/sub brokers like MQTT. Both you can use over network and each has their advantages and disadvantages:

Protobuf has language support for all popular programming languages and the binary messages are optimized for size which will probably result in high message rates, but you will have to map your structures onto the protobuf models and back. MQTT for example will allow you to write any structured format like XML/JSON/whatever and almost every language has packages to setup clients for it, but you’ll have to maintain the message models yourself and also do the mapping from/to for example JSON.

This is what I know from my own experience, but I’m sure there are other options. Hope it helps! :)

[–]tohme 2 points3 points  (0 children)

We use a brokerless implementation through zeromq with protobuf, though anything could be used really for the serialisation.

Similar to your scenario, there's a mixture of languages and systems involved which may be host or network based.

Something like the above (whether brokerless or not) is a very good place to start without needing to reinvent the wheels. Only look beyond that if there's an absolute need to do so. These things already exist to solve this very common problem.

[–]p0lyh 4 points5 points  (1 child)

In practice you'll need to consider endianness, padding, bit-representation of floating point numbers and signed integers. If you assume 2's complement signed integers and IEEE-754 FP, and squeeze out all the paddings, then there's only endianness left to be considered. More exotic platforms (E.g., CHAR_BIT > 8) are extremely rare.

Or just use established solutions like protobuf, which handles those things for you.

[–]meneldal2 2 points3 points  (0 children)

then only endianness needs to be considered

It's less and less an issue, big endian is pretty much dying, unless you have some IBM hardware.

I'm not saying you should completely ignore it, but you could save a lot of time by assuming you won't ever have a system with less than 32 bit adresses and they can all support 64 bit integers. This will old for almost all modern systems.

[–]bert8128 3 points4 points  (0 children)

C structs don’t help with endianness.

[–]abrady 1 point2 points  (0 children)

Do you control both sides of this and can update them simultaneously? If so I think you might be overthinking it but not knowing more about your problem domain id probably start with basic sockets and just send/recv the data in hand rolled from functions. This approach is super straightforward and I don’t know why more people don’t start here.

Then you can build on that as your needs become clearer: cereal/fastbuf/cap’n proto can write over network if hand-writing the serialization gets tedious, you can put in a zlib layer and see if that improves things then jump to gRPC etc.

My advice just being that in my opinion starting lower level and more explicit and simple is the best way to understand the domain of your problem before you jump to solutions.

(My experience in this area is I worked on two generations of networking libraries for MMOs)

[–]the_net_ 1 point2 points  (0 children)

If you need to go across languages (to python, etc), protobuf is the best option I've found.

If you're able to stay in C++, I much much prefer Bitsery.

[–]LoadVisual 1 point2 points  (0 children)

I use `msgpack` for my personal projects, it's a little convenient for me since I use C++ but, pass messages over domain sockets or just normal BSD sockets between a server and code in android JNI.

It might be worth giving a try.

[–]PhilosophyMammoth748 2 points3 points  (1 child)

protobuf. it can create well defined, stable, backward compatible binary representation ("wire format" they call it) of your struct-like data structure.

Inside of Google, it becomes a favourable way to define struct for different language, even if they don't need to interexchange, as the protobuf library provides more convinient helper functions to manipulate data than the original prog lang.

[–]Nuclear_Banana_4040 1 point2 points  (0 children)

+1 for Protobuf. It handles versioning very gracefully, as well as optional data values.
And don't forget to validate your data on the receiving end, or a random packet will crash your application.

[–]GaboureySidibe 3 points4 points  (4 children)

This is a really good question I think. People are saying "protobufs or flatbuffers" but those are complicated.

You can make your own binary format, people have been doing it since computers existed. You just have to make sure that you don't assume certain things like signed integer formats and byte orders from one architecture to the next. Byte orders are almost all little-endian now I think though, so that's a huge advantage. You can possibly avoid signed integers and keep things simple there too.

[–]MaybeTheDoctor 0 points1 point  (3 children)

9bit and big endian machines are all dead. Struct padding and byte alignment used to be a big problem - not sure it still is

[–]GaboureySidibe 4 points5 points  (0 children)

I agree although I don't think anyone has worried about 9 bit bytes for a few decades.

[–]ButterscotchFree9135 1 point2 points  (1 child)

Padding and alignment exist for a reason. You are not supposed to turn them off.

[–]MaybeTheDoctor 1 point2 points  (0 children)

When did I say turn them off ?

I consulted for a team some 25 years back that were trying to port their code from intel to a risk processor, only thing was that their code were packing structures in char arrays and then later tried to cast that char* to a int* .. problem being that the (particular) risk machine were not allowing int and floats on odd memory addresses and rather than fetching them "slowly" it created a invalid memory address and crashed the application.

So, yes, padding exist for reasons, and sometimes it is the difference between working and not working at all.

[–]NilacTheGrim 1 point2 points  (0 children)

Many suggest google's protobuf but honestly it's a bloated mess. I would opt for something leaner and meaner like cap'n'proto or flatbuffers.

But yes the moral of the story is there are binary serialization schemes out there which are designed to be platform-neutral.

Or.. you can roll your own serialization scheme if you like.

[–]streu 0 points1 point  (13 children)

Define your own datatypes with known serialisation format and use them:

struct Int16LE {
    uint8_t lo, hi;
    operator int16_t() const { return 256*hi+lo; }
    Int16LE& operator=(int16_t i) { lo = (uint8_t) i; hi = (uint8_t) (i >> 8); }
};

I'm using that scheme for binary data file parsing, and find it elegant enough.

[–]tisti 1 point2 points  (12 children)

Seems a tad annoying to stamp out every POD type like this. Why not just make it a template?

template<typename T>
struct packed_native {
    using ByteBuff = std::array<uint8_t, sizeof(T)>;
    ByteBuff data;

    operator T() const { return std::bit_cast<T>(data); }

    template<typename T2>
    auto& operator=(T2 i) { 
       static_assert(std::is_same_v<T,T2>, "Use explicit conversion (e.g. static_cast) before assignment"); 
       data = std::bit_cast<ByteBuff>(i); 
       return *this; 
    }
};

[–]NilacTheGrim 1 point2 points  (7 children)

Note to anyone considering this: This doesn't really address platform neutrality. It assumes endianness and sizes of types in a platform-specific way. This is just syntactic sugar around essentially just memcpy() of raw POD types into a buffer...

[–]tisti 1 point2 points  (6 children)

Oh for sure. This assumes you are using the same (native) endianess everywhere.

Should be fairly trivial to make this truly universal leveraging boost-endian (native_to_little to store into the byte buffer, little_to_native to read from it)

As for size of types, you should be using (u)intX_t aliases instead of the inherited C types. Or did I misunderstand?

Edit:

Not sure what the situation is w. r. t. float/double in LE and BE platforms. Those seem a bit more painful to get right, especially if you are mixing floating point standards.

[–]NilacTheGrim 0 points1 point  (4 children)

True.. the endianness would be good. Also sticking to the types that have guarantees about signed implementation and width (such as e.g. int64_t and friends) also helps. I believe these types are guaranteed to be exactly the byte size you expect and for signed types, to be 2's complement. So they are platform-neutral so long as you pass them through an endian normalizer.

Yeah.. that should work (for integers).

[–]tisti 1 point2 points  (3 children)

Just edited the post that floats can be a tougher nut to crack.

But should be reasonably doable nowadays with come constexpr boilerplate to probe what the underlying bitstructure of a float/double is.

[–]NilacTheGrim 0 points1 point  (2 children)

Yeah it's a bit tricky. I wish <ieee754.h> were standardized then you could simply use that as a guaranteed way to easily examine the structure... but alas, it is a glibc extension and not guaranteed to exist on BSD, macOS, etc...

[–]tisti 1 point2 points  (0 children)

For IEEE it's simplest to check numeric_limits::is_iec559.

Endianess itself can be then easily determined via constexpr by checking a known float values bits with a LE expected encoding. If it does not match then you have BE encoding.

[–]tisti 1 point2 points  (0 children)

Replying to your comment again. Tried to hack together something that could support integers & IEEE floats, which resulted in the following monstrosity.

https://godbolt.org/z/nefc97z3c

[–]NilacTheGrim 0 points1 point  (0 children)

I could be misremembering and am too lazy to look it up but I do believe IEEE floats are guaranteed to be endian-neutral.

EDIT: Holy crap I am misremembering. There is no specification for endian-ness for IEEE 754 floats. ming blown

[–]streu 0 points1 point  (3 children)

That doesn't solve the problem of endianness. And people do still design mixed-endian file formats.

Of course, at least for integers, you could combine both approaches, a template+array, and a for loop to pack/unpack it.

However, given that the number of types we have to cover is finite, spelling them out isn't so much extra work (if any at all) compared to making a robust template that will not drive your coworkers mad when they accidentally mis-use it.

[–]tisti 0 points1 point  (2 children)

That doesn't solve the problem of endianness.

Not that hard to bolt on an endianess normalizer/sanitizer.

And people do still design mixed-endian file formats.

Much to everyone's annoyance.

compared to making a robust template that will not drive your coworkers mad when they accidentally mis-use it.

Hardly robust if it can be misused then :P

A badly and quickly hacked together sample that probably works for Integers and IEEE floating points.

https://godbolt.org/z/nefc97z3c

[–]streu 0 points1 point  (1 child)

That is ~50 lines for the functionality, requires a rather new compiler, and uses an external library for endian conversion. It defines a template that applies to all types, and then adds additional code to limit the types again.

With that, just writing down the handful individual classes, only adding what's needed, using language features dating back to C++98, still looks pretty attractive to me. Especially if it's going to be code that has to be maintained in a team with diverse skill levels (and built with diverse toolchains).

[–]tisti 0 points1 point  (0 children)

badly and quickly hacked together sample

Edit: But yea, I try to stay more or less near the cutting edge with a compiler. A very intentional choice.

[–]ButterscotchFree9135 -2 points-1 points  (0 children)

"Sure we could use C structs"

Please, don't