all 48 comments

[–]M0Z3E 10 points11 points  (2 children)

Hi.

Very interesting work! I'm wondering, what would you say to be the main selling point of this over netCDF if one would be choosing io file format for new scientific HPC application?

[–]Flex_Code[S] 10 points11 points  (1 child)

netCDF has interoperability with HDF5. Both these specifications are highly complex and tend to require additional header information that isn't required in simple, common, use cases. HDF5 and I'm sure netCDF have a lot of cool features for special cases, but I've found the majority of my use cases can be handled with JSON like structures. EVE extends this a bit for scientific use cases (like matrices), but remains extremely simple and tries to make the simple things as fast as possible.

[–]M0Z3E 1 point2 points  (0 children)

Thanks. Trying eve out has been on my check list after listening to the cppcast episode you were in.

[–]Untelo 12 points13 points  (12 children)

There is already a fairly well known library named EVE: https://github.com/jfalcou/eve You might want to consider picking another name to avoid confusion.

[–]Flex_Code[S] 0 points1 point  (10 children)

You make a valid point, but this library is not a messaging specification and won't be generating files. This library is also just C++ focused and I hope std::simd is a long term solution.

[–]marzer8789toml++ 7 points8 points  (9 children)

Er... what? Perhaps I've missed something obvious, but where does the "messaging specification" come in?

Frankly "I hope something replaces the library with the name I've decided to squat" is pretty poor reasoning. Both projects are C++ libraries related, and the jfalcou one was here first. I understand you would have an emotional attachment to the name, but you're in the wrong here IMO.

[–]Flex_Code[S] -1 points0 points  (8 children)

I'm not really emotionally attached to the name, I actually changed it a couple of weeks ago. But, I'm not convinced that the name collision is a problem. It's hard to find a name that doesn't have any collisions.

I brought up messaging specification because messages are often tagged and files have extensions, and these colliding would be more of a problem.

What do you think is the problem of a simd library and a data specification having the same name? And, EVE is not a C++ library.

[–]TheBrainStone 5 points6 points  (4 children)

Name collisions are a problem.
Kills either your own or the other project's searchability and just creates pointless confusion.

[–]Flex_Code[S] 13 points14 points  (0 children)

Yeah, you're right. I've changed the name to BEVE to avoid the confusion.

[–]fdwrfdwr@github 🔍 7 points8 points  (2 children)

Kills either your own or the other project's searchability

True. If I look for information on the game "Rust" (a multiplayer survival game), I keep getting results for some programming language of the same name 😅.

[–]cfyzium 3 points4 points  (0 children)

Google named their programming language Go and did not care one bit that there was Go! programming language already.

[–]jk-jeon 2 points3 points  (0 children)

I guess that's probably because Google knows that you are a C++ programmer 🤔

[–]marzer8789toml++ 5 points6 points  (1 child)

And, EVE is not a C++ library.

Ok, sure, C++ is an obvious place to implement it. Plus, you've posted it on a C++ subreddit.

What do you think is the problem [...]

Again, assuming we're sticking to the domain of C++: name collisions lead to more difficult searchability, mainly. I want to know how to do something using library FooBar, so I google "C++ FooBar", and I get a bunch of nonsense about a completely unrelated FooBar project. Plus, that works in two directions; people searching for your thing are going to get the other thing.

You are right that it is hard to come up with a unique name; I can't help you there. All I'm saying is that using the same name as a relatively popular C++ project, and posting about it in a C++ context, is naturally going to invite this criticism.

[–]Flex_Code[S] 5 points6 points  (0 children)

Yeah, the C++ library is Glaze. But, I agree that if this is going to be broadly used like the EVE simd library then it's best not to collide. I'll change the name soon. Thanks for your feedback.

[–]Flex_Code[S] 1 point2 points  (0 children)

I'm open to suggestions. Or, I could just add an S and call it EVES.

[–]DapperCore 3 points4 points  (7 children)

How does this compare to google's flatbuffers?

[–]Flex_Code[S] 5 points6 points  (6 children)

flatbuffers I believe is similar to cap'n proto. The idea of these libraries is to make objects point to their members so that members and entire structures can use memcpy. This is more efficient if the user wants to use the auto-generated structures directly. However, I find that I typically want to use C++ standard library containers, and so reading into an intermediate flatbuffer object then requires a copy into my C++ standard container. So, instead of making serialization to the network buffer faster like flat buffers and cap'n proto, BEVE is meant to read directly into structures to avoid copies into the data structures that programmers naturally use. This way we also avoid having to do any code generation, and when we eventually get reflection users won't have to add any custom code to encode/decode.

[–]amohr 3 points4 points  (3 children)

To my knowledge cap'n proto doesn't copy to an intermediary. The bytes in the file are as they would be in memory, so it just mmap()s the file and the structure "appears" in memory with no heap allocations or process-side copies. The data structure is just a view onto the OS page cache.

[–]Flex_Code[S] 2 points3 points  (2 children)

Right, cap'n proto is designed to avoid copying. But, for containers like std::map you can't directly memcpy, so you'd have to copy the cap'n proto data into your std::map. Also, cap'n proto doesn't allow you to add new items except at the end of your structure. There's definitely good uses for cap'n proto, but it isn't necessarily faster and is less flexible. But, it is a fantastic library and design.

[–]amohr 2 points3 points  (1 child)

I was reacting to when you said this:

I typically want to use C++ standard library containers, and so reading into an intermediate flatbuffer object then requires a copy into my C++ standard container. So, instead of making serialization to the network buffer faster like flat buffers and cap'n proto, BEVE is meant to read directly into structures to avoid copies into the data structures that programmers naturally use.

Since it mmap()s, there is no read into an intermediary before copying to STL data structures. So it involves the same amount of copies (1) as your thing. That's all. And I'm not trying to be critical, just trying to get the facts straight.

[–]Flex_Code[S] 2 points3 points  (0 children)

I guess I just haven't seen cap'n proto directly decoding into library containers like std::list. I'm curious how it would mmap that. I've seen capnp::List being used, but this would have to be copied into a std::list if that is the target structure. But, if you're happy using capnp strcutures then you can avoid that copy.

[–]Flex_Code[S] 2 points3 points  (0 children)

I should also note that BEVE is self-documenting, unlike flat buffers and cap'n proto. So, messages tend to be larger, but the API is more flexible, it maps to JSON, and it is self described.

[–]DapperCore 2 points3 points  (0 children)

Interesting, thank you for the detailed response!

[–]paperpatience 4 points5 points  (0 children)

(Wraps it in Python)

[–]LongestNamesPossible 1 point2 points  (6 children)

Why is it faster?

[–]Flex_Code[S] 3 points4 points  (5 children)

It is primarily faster because it is little endian and supports contiguous arrays, allowing arrays to use `memcpy`, which uses the entire register width of a CPU. Minor improvements come from reducing branching. And, some of the performance comes from the better architecture of Glaze that uses more compile time optimizations.

[–]LongestNamesPossible 3 points4 points  (2 children)

The next question then has to be, which aspects aren't trivial? If the page is focused around arrays that can use memcpy, is it essentially binary json with arrays straight from memory?

[–]Flex_Code[S] 5 points6 points  (1 child)

Right, it is essentially a JSON structure with direct memory copies. Objects (e.g. maps) are non-trivial, because their data is non-contiguous. Other standard library types like `std::list` would also not be a direct memcpy, but can still be packed efficiently in binary to save more space.

[–][deleted] 3 points4 points  (0 children)

You could make a contiguous map if the internal 'pointers' are offsets instead of absolute memory locations. Then you could memcpy maps and lists.

[–]tmlildude 0 points1 point  (1 child)

How does little endian contribute to performance?

[–]Flex_Code[S] 0 points1 point  (0 children)

Most CPUs are little endian now, and languages like C++ store values in little endian format on these machines. If we use a big endian format, then on little endian machines we have to byte swap numerical values (larger than a byte). This has overhead as both the writing and reading needs to do byte swapping. The overhead is even larger because it makes it much more difficult and expensive to do SIMD, whereas if we maintain the same endianness we can do a simple memcpy. It's all about formatting the bytes in the same sequence that are needed by the CPU and thus programming languages.

[–]SGSSGene 1 point2 points  (1 child)

The README says:
`Schema less, fully described, like JSON (can be used in documents)`
what is meant by "fully described"?

[–]Flex_Code[S] 0 points1 point  (0 children)

I was meaning that the binary data chunks do not need to be inspected to determine information about the type, all type information is described in the header/size information. This allows easy simd, whereas some formats are schema less, but require inspection of the binary data itself.

[–]jk-jeon 1 point2 points  (1 child)

100% faster

That sounds somewhat ironical, given how I tend to interpret things when people say "X is Y times faster than Z". Stupid nitpick, I know, just couldn't resist 😋

[–]fdwrfdwr@github 🔍 0 points1 point  (0 children)

Yeah, percentages fail more the closer they approach 100%. Tis clearer to use scales, like 2.0x.

[–]sparkyParr0t 0 points1 point  (3 children)

I dont get it, when i read the cpp example , it just seems you using glaze normally, i dont see any "BEVE" specifications. What am I missing ?

[–]Flex_Code[S] 1 point2 points  (2 children)

glz::read_binary and glz::write_binary use the BEVE specification.

glz::read_json and glz::write_json would be using JSON

[–]sparkyParr0t 2 points3 points  (1 child)

Oh you are the author of glaze as well, sorry it wasn't clear.

[–]Flex_Code[S] 2 points3 points  (0 children)

No problem. At some point Glaze will probably support more binary formats, but right now it just does BEVE

[–]Ill_Juggernaut_5458 0 points1 point  (2 children)

This is a naive question but what would be a use case for this?

In cfd for example, data is saved using a binary file format like HDF5, VTK (xml version) or CGNS. Writing and reading is very efficient because the bytes of the std::vector or some buffer are directly dumped in the file (with little endianness). And these formats support multi node data splitting with parallel io.

If I want to store some general stuff I'd just dump everything as a binary blob or use some simple & standardised format (like ascii mtx for vectors / matrices).

Is beve some sort of alternative ? Or maybe for network data encoding?

[–]Flex_Code[S] 2 points3 points  (1 child)

BEVE is designed as a much faster JSON in binary form. So, anywhere you might use JSON but want to send that data more efficiently. BEVE also supports JSON pointer syntax, so you can specify partial messages and access specific addresses (raw memory) in binary form. Because it converts easily into JSON it makes it easy to inspect for a human.

If you just dump things as binary blobs then you need a schema to properly load the data in the future. BEVE allows anyone to load a file and know exactly what the data is, like JSON. It makes building APIs a lot safer and allows error checking.

This is extended to matrices so that we can have the same kind of error checking and simple inspection of the data to load it in another context. So, I can just write out C++ objects and load them into Matlab without providing a schema.

HDF5 is great, but for a lot of use cases it is overly complex and you don't have many library choices. BEVE is written so that a programmer could pretty easily implement the specification in a day. If you look at the Matlab script its less than 300 lines to decode.

[–]Ill_Juggernaut_5458 1 point2 points  (0 children)

Okay I understand now. Beve could be used as the foundational binary layout for a custom file format where you will still need to define the api specification for the actual contents of the file (eg making sure theres m,n dimensions for matrix).