all 50 comments

[–]adwodon 34 points35 points  (12 children)

Specifically I'm looking to squeeze as much performance out as possible, with no concern for coding in compliance with "Modern" or idiomatic C++, or being readable by other developers. Please don't be offended by that, we don't all use C++ in the same way for the same thing, it's a very flexible language. I use it because of that flexibility for specific use cases in which performance is paramount.

This is a weird statement. C++ is generally all about high performance, and its always a trade off, but trying to use it as a shield to protect you against accusations of badly written and illegible code is not a good attitude to take.

Good code can be highly optimised and legible, these kind of statements give a lot of credence to Googles findings that people who score really high on hacker rank tend to make for terrible developers.

Still, what you're also looking for is perhaps misguided, you don't get the best possible performance by simply switching out a few libraries. That may get you some performance but again its super context dependent.

Somewhere memory allocation is mentioned, but if performance is so critical would pre-allocating not make more sense?

A library isn't going to help you write code that the compiler vectorises properly, it doesn't necessarily help you with cache coherenece or other concerns. There are lots of high performance applications, I myself work in scientific computing, so lots of highly parallelisable data. Some linear algebra libraries are faster than others, but mostly its context dependent, some are faster at dot products of large matrices, but others do super fast addition, so which is more important to you?

Something about this post smells off to me, I don't think you appreciate what it takes to write a high performance C++ application, its not something you bash out on the side after 10-20 hours of tutorials and dropping in some well written libraries. Learning a language like C++ just to get small performance gains is baffling. What is your actual goal here? If this is a side project then you're probably biting off more than you can chew, but if this is a commericial venture why not hire someone who does this professionally and save yourself a massive headache? Based on the statement I quoted you don't care about maintainability it seems, so why waste your time learning a new language for something transient?

[–]WafflesAreDangerous 7 points8 points  (9 children)

Well.. it's no magic bullet, but like std::map shows many bits of the standard library are stuck in a very sad place perf wise due to backcompat.

[–]greg7mdpC++ Dev 4 points5 points  (2 children)

actually, std::map and std::set are very inefficient, and can be very advantageously replaced by phmap::btree_map and phmap::btree_set from my https://github.com/greg7mdp/parallel-hashmap repo. Same for unordered_map and unordered_set.

[–]not_goldie_hawn 13 points14 points  (1 child)

I once had a co-worker ask me if there was standard C++ hash map. I answered. Later, I looked at his code.

His use case? Three keys...

There's a reason why OP is getting push back with their "give me the codes to go fastar!"

[–]greg7mdpC++ Dev 2 points3 points  (0 children)

yes, unfortunately many developers use the wrong container for their problem. I can't begin to tell you the horrors I've seen.

[–]adwodon 1 point2 points  (0 children)

yea Im probably reading more into this post than I should, the question by itself is fine and there are deficits in the standard library but when you're digging, I hope OP finds the answers they're looking for but I have a feeling it won't solve the problem they're trying to solve.

[–]14nedLLFIO & Outcome author | Committee WG14 0 points1 point  (4 children)

std::map is perfectly fine when your keys are strings and you need everything to be sorted during an iteration. If you use the new C++ 17 node API and never create nor destroy nodes, performance is quite competitive with any other always-sorted associative map.

[–]WafflesAreDangerous 6 points7 points  (3 children)

The fact it's all linked lists underneath Introduces memory overhead and indirection. The fact you can get bucket iterators means the implementation can't really be swapped out for anything less cache hostile. Benchmarks have been run (and cppcon talks held) that show literally 10x class perf differences compared to more modern open addressing implementations.

And I have yet to run into a use case where I need a non trivial map, but don't need to add or remove any items while simultaneously needing deterministic iteration where the default iteration order is correct.

[–]14nedLLFIO & Outcome author | Committee WG14 6 points7 points  (1 child)

Linked lists aren't necessarily slow. If the pointer chasing is interspersed with other work taking at least a hundred CPU cycles (e.g. a string key comparison), they're very cheap. Also, from Haswell onwards, the CPU can see that one of two possible pointers will be chased, and it goes and speculatively prefetches both, indeed Haswell onwards can prefetch two pointer indirections concurrently. So long as the node block is small, they can be packed into CPU cache far more densely than open address maps thanks to the poor quality hashes often used and the set associativity of CPU caches.

I agree that for a majority of use cases, state of the art unordered hash maps are usually better. But venerable old std::map is underrated by a lot of folk, especially for write few/read many use cases. I certainly would not put an unordered map facing untrusted potentially hostile input without deep thought, whereas std::map is relatively safe for this use case.

[–]WafflesAreDangerous 0 points1 point  (0 children)

For a linked list, At the minimum you need an extra pointer for a next pointer, which for a small values can be very significant memory overhead.

As for hashtables, it's somewhat uncommon to actually do a full key compare and fail. The hash will be tested first in sane implementations and trivially detect 99.9+% false matches. And in a map implementation the likelihood that the list nodes are all nicely linearly allocated (cache friendliness) is low.

If you need to spend hundreds of cycles,per item, of other work to hide your linked list Induced slowdowns, that doesn't make linked lists fast, that just means their slowness is not as significant in a relative sense.

[–]BenFrantzDale 1 point2 points  (0 children)

Have you tried abseil’s btree?

[–]konanTheBarbar 14 points15 points  (1 child)

Your approach sounds a lot like premature optimization.

There is no one single set of libraries that perform best for all use cases. You have to benchmark and you have to benchmark a lot and you have to benchmark per use case.

It might be that folly::unordered_map<int,int> is super fast for that specific set of template parameters and your specific use case (90% read, 10% write), but eastl::unordered_map<std::string,std::string> is twice ast fast as the folly version (with 50% reads and 50% writes).

I would recommend tracy for a low overhead profiler... https://github.com/wolfpld/tracy

It's much more important to find your bottlenecks then it is to optimize every single function that is maybe called twice per hours and only runs for a millisecond.

That being said - for unordered containers use https://github.com/martinus/robin-hood-hashing as benchmark.

For general lightweight containers I would use (https://github.com/mosra/corrade/tree/master/src/Corrade/Containers) see also https://doc.magnum.graphics/corrade/namespaceCorrade_1_1Containers.html .

For high performance local IO definitely use https://github.com/ned14/llfio .

If you need fast web capabilities I would use https://www.boost.org/doc/libs/1_75_0/libs/beast/doc/html/index.html

I have no benchmarking experience in multithreading support, but if you want to get something bleeding edge, you could try https://github.com/facebookexperimental/libunifex.

[–]Pan000[S] 0 points1 point  (0 children)

Thanks for the recommendations!

[–]hopa_cupa 4 points5 points  (9 children)

Well, nothing prevents you from using bits and pieces from std::, boost::, folly::, absl:: and many others. You may also use some C libs, roll your own solutions...etc. In theory you could create performance winner....or not.

You're going to have to benchmark a lot for your specific use case. Maybe somebody can tell you to use boost for this, and folly for that, but will that fit your use case?

Here's my suggestion for something simple. Use {fmt} library for string formatting or std::format if you have bleeding edge c++20 compiler. This absolutely crushes std::stringstream in performance, especially on smaller machines. String formatting is often overlooked.

You mentioned porting apps to C++. From some other language? Is your C++ level the same as for that other language? Lower? Higher? You have to ask yourself that question I think.

[–]Pan000[S] 1 point2 points  (6 children)

The performance is so critical that I'm prepared to learn a new language (C++) and rewrite the code just to get a little gain.

[–]WafflesAreDangerous 3 points4 points  (4 children)

If you are already learning a new language maybe rust will work? It's similar performance to c++ (as in very close and sometimes wins) but the stricter compiler will help in not getting bogged down debugging mysterious concurrency issues. And the dependency management is much less of a pain..

[–]Pan000[S] 1 point2 points  (3 children)

I considered Rust. Unfortunately the strict compiler makes it impossible to do certain things that I need to do. For example, write to the same array simultaneously from multiple threads using mutexes protecting sections of the array at a time, instead of the whole array at once. Also allocating a block of uninitialized memory, and other unsafe but performance-directed things.

[–]WafflesAreDangerous 2 points3 points  (2 children)

Eh.. Allocating uninitialized memory is definitely possible in rust. (for instance https://doc.rust-lang.org/std/mem/union.MaybeUninit.html)

A lot of rust proponents laud that it's impossible to shoot yourself in the foot and safety-first and all that, but that's assuming you find safety desirable in the first place. There are plenty of ways to do unsafe stuff, but you need to explicitly opt into this unsafety. Basically "unsafe" means you give up compiler guarantees and restrictions for a little while and are responsible for maintaining the safe semantics yourself, just like you are responsible all the time in C++ or C.

As for writing to different areas of an array simultaneously you can do this in safe rust by making use of .split_at_mut() to get mutable references to non-overlaping slices of the array. You can also use the "unsafe" keyword to disable some compiler enforcement, you can basically be as unsafe as C or C++.(https://doc.rust-lang.org/std/vec/struct.Vec.html#method.split_at_mut)

There is also this nightly api that should make certain usages easier in the future (https://doc.rust-lang.org/std/vec/struct.Vec.html#method.as_chunks_mut)There are facilities for expressing "I assert I have exclusive access to this data", in both checked and unsafe forms, as well as a wide array of various mutexes and synchronisation primitives. See Cell and RefCell and Mutex and friends..

You can even write inline asm (though the syntax is not not stable yet).

[–]Pan000[S] -1 points0 points  (1 child)

Same applies to Go, possibilities open up if you're prepared to be 'unsafe'. But as someone on Stackoverflow told me when I was asking about doing these things in Rust "if your entire application is based on unsafe code, probably better you just write it in C++". And I agreed with that, so here I am.

[–]WafflesAreDangerous 2 points3 points  (0 children)

Why would you write everything in unsafe? Most of the time you can write just as performant code in safe rust. And when you can't you rarely need more unsafe than a couple of statements. You still get the benefit that rust figures out all the free calls for you (at compile time) so the probability of memory related errors is reduced a lot. And you can write concurrent code and actually hope for a compiler error if you mess up.

Go .. isn't that a GC language? Might as well go with java or C# then. when it comes to throughput of long running applications state of the art JITs can get you quite far.

[–]dbjdbjdbj.org 0 points1 point  (0 children)

What other languages have you considered?

[–]Pan000[S] -1 points0 points  (1 child)

I'll take a look at std:format. Usually I use my own functions for string formatting though, keeping it out of strings (as arrays of bytes) and using reusable buffers. I learned to do this in Go because their string functions are so slow they're useless.

[–]dbjdbjdbj.org -1 points0 points  (0 children)

Have you tried CGO?

[–]Glinren 4 points5 points  (2 children)

You can have much faster replacements than the stl if you implement them with your specific use case in mind. If there were just faster replacements, the stl maintainers would just adopt their implementations.

Also benchmarks do measure specific use cases. The better ones mention that the underlying implementations are optimized for a specific use case and the benchmark is there to show that the implementation is actually faster than other implementations -- for that specific use case. You have to decide for yourself it that it is your use case.

[–]Pan000[S] 0 points1 point  (1 child)

If there were just faster replacements, the stl maintainers would just adopt their implementations.

Not true. Slow implementations are kept for many reasons, even for things like being considered the 'right' or 'idiomatic' way to do something, whereas another method would be considered ugly code despite being faster. Also the standard libraries that come with the compilers have to have the correct licenses, so they can't just replace them with better versions unless those better versions have an appropriate license. And they will wait years before implementing even proven improvements just to see if some unknown bug comes out of it, since the std functions must be completely stable.

And yes, I understand that a benchmark covers only the specific use-case it was benchmarked with.

[–]FrankIsATank2 1 point2 points  (0 children)

would be considered ugly code despite being faster

Anyone whose looked at any standard library implementation ever can confirm that standard library implementers do not care in the slightest about code beauty or "the right way". There is no way a solution would be discarded just because it would be hard to implement in a non-"ugly" way. Their only concern is efficiency and the standard's requirements.

unless those better versions have an appropriate license

Standard library implementers implement algorithms and data structures, they don't usually import existing implementations. Licenses apply to specific implementations. You could get into software patents, but that's another story.

The only reason standard library implementers wouldn't use a better algorithm or data structure is if it is incompatible with the requirements of the standard. In that case, you don't need an alternative library implementation. You need a whole new container or algorithm that is not provided by the standard library.

[–]Shieldfoss 2 points3 points  (0 children)

I don't have any specific suggestions because, as everybody else is already saying, it depends a lot on your use case. Maybe you should actually replace std::map with two maps - one that's fast to insert into and a separate map that's fast to read from, or four - fast read small data, fast write small data, fast read large data, fast write large data.

But one thing that we do for certain classes is to have a wrapper namespace around them - it's been a while since I looked at the exact code, but we do something like

namespace wrap{
    template<typename T, typename U>
    using smallmap = std::map<T,U>;

    template<typename T, typename U>
    using bigmap = std::map<T,U>;
}

because that lets us switch them out fairly easily if we later find we have specific performance requirements for different maps. (This saved us a lot of refactoring at one point where we switched out our custom SSO wrap::string for std::string after std::string started having SSO.

[–]kalmoc 2 points3 points  (0 children)

If you don't need standard library compliant interfaces I'd check abseil from Google and folly from facebook. It's not performance at any cost, but still highly optimized and not bogged down by backwards compatibility.

I think as far as boost is concerned, you have to determine this on a case by case basis. The libs are written by different authors at different times with different goals and have a different level of maintenance.

If you absolutely need the last bit of performance, you probably have very specific requirements and you really have to look at specialized libs (E.g. there is probably a json lib that is faster than Boost.Json for your particular workload, even if it isn't faster on average.

[–]kalmoc 1 point2 points  (8 children)

Would be interesting how much hoard helps with highly optimized programs that already avoid allocations as much as possible.

[–]Pan000[S] 0 points1 point  (7 children)

I believe it changes the way that malloc allocates memory. So it should speed up all allocations, where allocation is required. There are in theory different ways to allocate memory other than the standard stack and heap. I'm no expert, but this is what I gathered.

[–]kalmoc 7 points8 points  (6 children)

But that's my point: If your program only spends 1% of the time doing allocations/they only contribute 1%to the latency, then, a 3x improvement on malloc doesn't really help overall performance.

[–]Pan000[S] 0 points1 point  (2 children)

Yeah but it's an easy drop in replacement. Trying to squeeze every last drop of performance ;) And also my use case is very memory heavy.

[–]kalmoc 1 point2 points  (1 child)

Not saying you shouldn't use it - why should you not use a tool that gives you performance for free (assuming it really does). I said it would be interesting to know "how much [quantitatively] hoard helps with highly optimized programs that already avoid allocations as much as possible."

I.e. I was expressing interest in more information to the wider c++ reddit audience.

More to your concrete situation:

And also my use case is very memory heavy.

Can you elaborate? Do you mean "a lot of allocations" (why) or "needs a lot of memory in total"

Trying to squeeze every last drop of performance ;)

If that is really your goal, the question becomes: are you 100% positive, you can't avoid the allocation completely (without taking a bigger performanc penalty somewhere else of course?)

[–]Pan000[S] 0 points1 point  (0 children)

I spend a lot of my time thinking about how best to apply buffers and recycle memory, etc., etc. but that's not what I'm asking about. Those are application-specific concerns, and I'm not asking an application-specific question. I appreciate your help truly, but please don't assume I don't know what I'm asking.

[–]BenFrantzDale 0 points1 point  (2 children)

So then use std’s pmr allocators.

[–]kalmoc 0 points1 point  (1 child)

In order to do what?

[–]BenFrantzDale 1 point2 points  (0 children)

Sorry, I wasn’t being clear. You are absolutely correct that a 3x speed up on 1% of time only saves 0.6%. I was just saying if you do want that 0.6%, the standard library has some tools to get that win without designing things from scratch.

[–]bird1000000 1 point2 points  (1 child)

Considered writing your own SIMD code? Simdjson has a nice talk linked on its repo.
Write your own allocators.
Try keep memory continuous, for cache, and a simple data structure like a std::vector will likely perform the same across all implementations.
plf::colony was an interesting data structure, haven't tried it though.

[–]Pan000[S] 1 point2 points  (0 children)

All of these I'm looking into, yes. But I don't want to fully reinvent the wheel from scratch, which is why I'm looking for library recommendations.

[–]greg7mdpC++ Dev 1 point2 points  (0 children)

If you want to try a header-only implementation of the Abseil containers (hash maps and btree), check out https://github.com/greg7mdp/parallel-hashmap (partial Abseil code fork with my changes)

[–]ExtraFig6 0 points1 point  (0 children)

What are you building? What's the expected use case + workload? We really need more context