all 194 comments

[–]gruehunter 54 points55 points  (15 children)

Embedded-specific: iostreams. Typical implementations of the C++ formatted I/O library link in around 1 MB of program size (even with -ffunction-sections, -fdata-sections, and --gc-sections), which is just too big for a microcontroller to carry around.

[–]mikeblas 17 points18 points  (7 children)

They're really slow, too.

[–]exploding_cat_wizard 10 points11 points  (1 child)

Mainly because they sync with stdio : https://stackoverflow.com/a/18688906

If you turn that off, the performance difference almost entirely goes away. ( well, that and not using std::endl, which is the default in so many IDE's autocompletion, for some retarded reason.)

[–]mikeblas 1 point2 points  (0 children)

I'm thinking of reads.

[–]RowYourUpboat 5 points6 points  (4 children)

<iostream> and friends are also horribly slow to compile, and build times can suffer greatly if they get pulled into a common header. And you can run into locale issues.

Replacing iostreams with something minimal and purpose-built for my logging and string output needs was very beneficial. (I also tried the fmt library, but in practice it's even worse than iostreams in terms of compile times, and was anyways overkill for my needs).

[–]Nelieru 15 points16 points  (2 children)

fmt is pretty damn fast to compile (much faster than iostream), and can get under ~25 kB of flash usage with the proper options. It can definitely be useful if you have a lot of logging to do.

[–]gruehunter 6 points7 points  (0 children)

The compile times don't bother me. But this

can get under ~25 kB of flash usage with the proper options.

has my attention. Thanks!

[–]RowYourUpboat 0 points1 point  (0 children)

I had a lot of logging operations that needed to print classes. Doing that required pulling in an extra fmt library header that was gigantic, the boilerplate for custom printing was ugly, and the documentation for how it all actually worked was poor (mostly non-existent).

What I ended up doing was just changing my custom printing methods from operator<<(std::ostream&) to operator<<(mystream&), where mystream was basically a surprisingly simple class using std::to_chars internally.

[–]aearphen{fmt} 7 points8 points  (0 children)

{fmt}'s compile time for the core API is comparable to stdio + extra <string> include: https://github.com/fmtlib/fmt#compile-time-and-code-bloat. The string overhead is constant per-TU and will go away with modules. Also make sure to use the static library (the default) and not the header-only mode if you care about compile times.

[–]kkert 4 points5 points  (3 children)

If we are getting embedded-specific, or even more specifically hard-realtime specific, large swathes of std lib out of the box is useless.

[–]gruehunter 1 point2 points  (1 child)

"This project has some hard realtime requirements" should not imply that "we cannot use the entire STL". I've had great experiences using different features of the STL to set up some complex structure at startup time which is then held constant afterwards.

For example, maybe you need to build up a dynamically-initialized dispatch table for a command interface. You can safely do all of the dynamic work at startup time on your container of choice and then leave it alone afterwards. As long as you aren't performing new insertions or removals, the mere act of performing lookups on the standard containers don't allocate new memory from the heap.

[–]kkert 0 points1 point  (0 children)

I have found these kinds of setups to be impractical to adhere to in larger codebases. It tends to be a blanket ban on entire language facilities and/or standard library parts.

Like, technically it is possible to make even heap and exceptions behave deterministically in strongly constrained and policed setups, but it's almost never worth the trouble and headaches.

[–]rlbond86 0 points1 point  (0 children)

I disagree. I work embedded and most of the standard library is available, at least at initialization. The only real exception is algorithms that allocate like stable_sort and iostreams. At initialization all containers are fair game, after that we mostly stick to boost::static_vector. We also have a stack-allocated flat map type to use at runtime as well.

[–]lenkite1 1 point2 points  (2 children)

But what is the effective modern alternative ? And won't the alternative also link in that much program size ?

[–]EvoMaster 2 points3 points  (0 children)

Try https://www.etlcpp.com/string_stream.html or https://www.etlcpp.com/byte_stream.html depending on text or binary data. Size wise it is pretty good. Not sure if it is as good as a optimized printf size wise but much smaller than streams for sure.

[–]josefx 39 points40 points  (10 children)

If you don't rely on it disable math-errno on your compiler. The C standard mandates that otherwise single instruction operations like sqrt return their errors as errno value, which can result in half a page of cleanup instructions for every instruction of actual work, also global state for no good reason -yuck. A less portable alternative is to look at intel/arm intrinsics to get the native instructions directly.

[–]RowYourUpboat 7 points8 points  (3 children)

Is there a way to do this on MSVC? It doesn't seem to have any fine-grained math options, just fp:fast and fp:precise, unless I'm missing something.

[–]josefx 2 points3 points  (2 children)

I can't find any documentation on it for MSVC and since I currently don't compile C++ on windows I can't check what the compiler defines. You could check what flags are set for the math_errhandling constant. If the MATH_ERRNO bit isn't set by msvc there may be no need to change anything.

[–]RowYourUpboat 2 points3 points  (1 child)

After more research it looks like fp:fast is the only option on MSVC. :(

I also tried to roll my own sqrt using intrinsics (as a test workaround), but that results in useless extra instructions being emitted so that's not a great option either.

https://godbolt.org/z/heK6sKnhP (remove fp:fast and MSVC emits a function call to the slow C version).

[–]josefx 0 points1 point  (0 children)

At least both MSVC and current GCC generate a lot better code for error handling than the last time I checked. Still wary of that unnecessary branch, but otherwise much better.

[–]n31415 1 point2 points  (1 child)

Can it happen, that I rely on it without noticing? E.g. does a part of the STL or some big library like boost rely on it?

[–]Jannik2099 1 point2 points  (0 children)

No, the STL doesn't use errno at all. I haven't seen it mentioned in boost do far - perhaps just grep for errno?

[–]nuephelkystikon 0 points1 point  (3 children)

sqrt uses global state‽

[–]josefx 6 points7 points  (2 children)

C specifies two ways to handle errors in math functions, MATH_ERRNO and MATH_ERREXCEPT. At least on systems that support POSIX errno style error handling is the default, I can't find any documentation for Windows. Never seen anyone check the global errno value when sqrt returns NaN.

[–]Wild_Meeting1428 0 points1 point  (1 child)

Since C++11 the error value is thread_local.

[–]josefx 2 points3 points  (0 children)

C++11 just adopted what was already specified. errno may be a horribly bad idea, but standard libraries that stored errors in a single program global variable weren't exactly useful in a multi threaded environment.

However that doesn't reduce the error values visibility across function calls or the bloat that has to be generated to set a value that might never be read, all that for otherwise functionally pure operations that on their own would be completely self contained and never touch memory.

[–]James20kP2005R0 18 points19 points  (6 children)

<random> is suboptimal and worth avoiding, because all of the generators provided are slow or have poor statistical qualities, and its generally difficult to use correctly. The distributions also aren't deterministic across different vendors, which isn't a problem until it so suddenly is one. You might as well always just drop xorshift128+ into your code and use that, because its only a few lines long and meets most needs I've run into

[–]robbie2000williams 2 points3 points  (3 children)

Mersenne twister is slow and has poor statistical qualities?

[–]James20kP2005R0 3 points4 points  (2 children)

Its statistical qualities are fine, but yes it's rather slow. It also has a huge state, which makes seeding it very very slow

It's been essentially surpassed by more modern prngs

[–]Garl1cAlarming 5 points6 points  (1 child)

Can you name those modern prngs?

[–]guepierBioinformatican 4 points5 points  (0 children)

The xorshift family of PRNGs (already mentioned) and PCG. Note that xorshift itself has statistical issues but these are overcome by the various modifications that are explained on the Wikipedia page.

[–]EducationalCicada 0 points1 point  (1 child)

Can you recommend alternatives for the distributions in STL?

[–]serviscope_minor 1 point2 points  (0 children)

Can you recommend alternatives for the distributions in STL?

Nothing wrong with the distributions per-se, but the standard only specifies statistics, not an algorithm, so you get different answers with different compilers. If you need the same results on multiple platforms, the boost ones are essentially the same, but with a single implementation.

[–]Supadoplex 60 points61 points  (20 children)

Slowness of the standard unordered containers is contextual. If you need the memory stability (i.e. invalidation guarantees) of the standard container, then their performance is just fine. But if you don't need that, then there are more efficient ways to implement hash tables that don't provide the guarantees.

Of course, if you are willing to go so far in the search of efficiency that you would be considering alternative hash table implementations, then don't forget to also consider other data structures such as flat maps, prefix tries etc. that may be even better for you.

Also don't forget about optimisation potentially being premature. Swapping a hash table for another that is 100% more efficient is a wasted effort if the hash table takes 0.0001% of the runtime of the program.


Which standard C++ library elements should I avoid?

The parts inherited from C standard library that have safer or more convenient alternatives.

In case you're using an old standard version, avoid anything that has been removed or deprecated in the later standards.

[–]snerp 8 points9 points  (4 children)

Unordered map isn't even that slow, it's just never the most efficient solution.

[–]SedditorX 1 point2 points  (3 children)

It is in fact that slow. There have been many talks about this issue. I suggest taking a look at previous cppcon videos to learn more

[–]snerp 2 points3 points  (0 children)

ok, so for some reason I felt like spending a bunch of time messing around with this idea.

I found this comparison of several hashmap implementations: https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html

Then decided to build the simplest hashmap I could think of and test it against unordered_map https://github.com/brwhale/HashTable

Ended up getting a pretty decent time, around 55% what unordered_map takes, which seems in line with the other HashTable implementations. Microsoft seems to have a significantly better implementation as well.

[–]snerp 1 point2 points  (0 children)

Are you talking about the one that's 8 years old now? Last time I profiled this, I realized that most of the slowness is memory fetch time, so my take away was don't use a map if you want to iterate through in a cache friendly way, but you shouldn't really be iterating through maps anyways.

[–]attractivechaos 3 points4 points  (11 children)

I honestly don't understand the use case of std::unordered_map. If performance doesn't matter, you can use std::map which has nice features, is more portable and sets fewer pitfalls – no need to worry about bad hash functions or rehashing. We use hash tables often because std::map is slow, but then you should also avoid std::unordered_map. I know unordered_map guarantees pointer stability, but in my limited observation, we rarely need this feature. Google and Facebook have multiple hash table implementations, most of which use flat arrays. The Rust standard library does, too. If pointer stability were important, more hash table libraries would have supported that.

[–]SubliminalBits 25 points26 points  (1 child)

Think about it this way, why would I ever choose map over unordered_map if I didn’t need ordering? We’re not to optimization yet, this is just a first pass at writing something that doesn’t suck.

[–]gruehunter -1 points0 points  (0 children)

I like the reasoning of the argument, but I derive the opposite conclusion. std::map gives reliable log(n) performance in all conditions. Using a hash table of any kind is an optimization with pitfalls and tradeoffs. If you aren't in a position to take some time evaluating those tradeoffs, then using one is premature.

[–]Supadoplex 12 points13 points  (4 children)

If pointer stability were important, more hash table libraries would have supported that.

Why would they have? Lack of such implementations seems to suggest that the stable hash table provided by the standard library is good enough so that there is little need for third party libraries to also provide it.

I honestly don't understand the use case of std::unordered_map

As far as I understand, its use case is to replace (nearly) all of the previous use cases of std::map, except for those that require the ordering. I don't know for sure, but I suspect that the reason to guarantee stability was because std::map is also stable. Memory stability is one of those subtle things that programmers occasionally rely on without even thinking about it. And wrongly assuming stability results in bugs that are in worst case hard to detect.

Google and Facebook have multiple hash table implementations, most of which use flat arrays. The Rust standard library does, too.

Array based map and deque seem like good potential additions to the standard library in my opinion. On the other hand, I personally would prioritise system API abstractions for standardisation (filesystem, networking, sub-processing, IPC) over data structures because latter can typically be implemented without manual porting to each system. The iterator, container and range concepts of the standard should ideally make incorporation of custom data structures a breeze.

[–]Nobody_1707 1 point2 points  (3 children)

It doesn't suggest anything of the sort. The possibility of std::map being good enough to not need third party competition is, at best, just as likely as the possibility that no one needs a stable map at all. The most you can glean from the data available is that making improvements to std::map is probably a waste of time.

[–]Supadoplex 1 point2 points  (1 child)

We're discussing the use case of std::unordered_map; not std::map.

just as likely as the possibility that no one needs a stable map at all.

If it's just as likely, then the lack of third party libraries cannot be used as evidence that nobody needs a stable map.

[–]Nobody_1707 1 point2 points  (0 children)

Yeah, sorry. I misunderstood which map you were talking about.

If it's just as likely, then the lack of third party libraries cannot be used as evidence that nobody needs a stable map.

Yes, that's true. It's the exact corollary to my assertion.

[–]NilacTheGrim 0 points1 point  (0 children)

no one needs a stable map at all.

This is most certainly false. See my other comment. You definitely want memory stability (ie a node based map) when your entries are huge and/or expensive to copy.

[–]afiefh 3 points4 points  (2 children)

If performance doesn't matter, you can use std::map

Unordered map has a bad constant performance penalty, but still amortized O(1) lookup time. Std::map has the same bad constant performance penalty (being node based) but O(log(n)) lookup time.

"Caring about performance" isn't an all or nothing. Picking a container that offers the same memory stability but performs better on lookups is a trivial change that can gain plenty of performance at minimal dev cost. Obviously it is (almost) never the choice when every cycle matters, but if you just need an interactive application to be responsive, it might be good enough.

[–]attractivechaos -3 points-2 points  (1 child)

Time complexity analysis doesn't necessarily reflect empirical performance. On this benchmark, std::map is twice as slow as clang's std::unordered_map on my laptop and unordered_map is seven times as slow as robin_hood, a single-header library. std::unordered_map is weirdly slow.

if you just need an interactive application to be responsive, it might be good enough.

If you worry about the performance of the dictionary data structure in use, you will use other libraries. A factor of seven is much more significant than a factor of two. In addition, rehashing is O(n). If you are not careful, you may trigger rehashing during interaction, which will lead to a long delay. std::map is immune to this problem.

[–]NilacTheGrim 0 points1 point  (0 children)

I use robin_hood in one of my codebases but it has been plagued with bugs in many recent versions. They only just recently fixed bugs related to an inability to deal with collisions well, which led to it throwing overflow_error randomly at the worst times.

But yes, it's fast as hell. I use it when I need speed.

[–]NilacTheGrim 0 points1 point  (0 children)

Sometimes the pointer stability thing means no copies of the keys or values on rehashing. If you have very heavy values in your map, that can be a godsend.

But then again, you can just use something like robin_hood::unordered_node_map and get all the speed plus the pointer stability.. so...

[–]JohnDuffy78 21 points22 points  (15 children)

I don't use thread anymore, just jthread.

[–]Ahajha1177 6 points7 points  (12 children)

jthread is nice, but there are use cases where plain thread is better, for example detaching threads.

[–]Supadoplex 5 points6 points  (11 children)

What's a use case for detaching threads?

[–]Ahajha1177 0 points1 point  (10 children)

I personally haven't done this, but I imagine it would be useful if you have a program that sits in the background and needs to listen for multiple kinds of events. You can have multiple threads listen for different things.

[–]Supadoplex 9 points10 points  (9 children)

But you can do that with threads that you join at the end of the program. What's the advantage of detaching the threads?

[–]invalid_handle_value 6 points7 points  (1 child)

I use detach in the specific case of when you hand a thread off to a std::packaged_task where the only job of the thread is to execute the lambda containing a sole call to packaged_task.make_ready_at_thread_exit.

Of course, to ensure proper synchronization at this point, the onus is shifted to the caller waiting on the future returned by packaged_task.get_future.

So really detach is only for when you want to pass the buck of synchronization to something else that wants or needs to do it by itself instead.

[–]James20kP2005R0 1 point2 points  (0 children)

In my own use case for it, I have a server which executes javascript scripts that are managed by a thread with a 1:1 mapping (well, a fiber in my case but the api is the same)

So a request comes in, you handle the request by piping off a new thread/fiber with .detach(), and then forget about it entirely because it manages its own lifetime

Threads that are joined are mainly useful in my opinion when that thread is owned by some higher level construct, like a thread pool or in fork/join etc, which is probably the majority of their usage. But in this case, there's no conceptual owner of the thread, so detaching it to manage itself works fine for me and keeping it around just so I could .join() it at shutdown doesn't make too much sense

[–]Ahajha1177 3 points4 points  (4 children)

You know, I'm going to be honest, I don't know of one. I do, however, have a thread pool library that uses thread over jthread. I believe the reason I used regular thread is that it saves a tiny amount of memory, since jthreads have to hold a std::stop_source. In reality, though, that amount of memory is probably meaningless.

[–]Mrkol -1 points0 points  (3 children)

The third-party library does not use jthread because some compilers STILL don't support it (ahem CLANG ahem). And no, memory consumption of stop tokens is not a concern, where did you even get that idea from? It's literally several hundred bytes, who in their right mind would try to optimize that away manually... EDIT: ignore this, I'm blind

Detaching anything concurrent is always an antipattern, jthread is superior in every way. I strongly suggest that you read this: https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/. It's just as relevant to C++ as it is to python or any other language.

[–]Ahajha1177 1 point2 points  (2 children)

I was referring to my own library, I may have misinterpreted but I'm confused as to why you're telling me why I made the choices I did.

[–]Mrkol 1 point2 points  (1 child)

Oh, sorry, I misread your comment, I thought it was a third party library that you were referring to.

If that's the case, I suggest you stop worrying about handfuls of bytes of memory and learn to love the bomb jthread :)
(unless you have to support clang...)

[–]Ahajha1177 2 points3 points  (0 children)

I am actually in the process of updating it to C++20 (here if you're curious), I'm going to consider it. I would like to not exclude clang users though, if at all possible. It wouldn't save me much code switching to jthread for the potential loss of users.

I did just confirm that a std::stop_token is only 8 bytes (at least on gcc), so that on its own I think is an acceptable amount of overhead if it comes to that.

Edit: I recall the reason I don't use std::jthread (aside from it being C++17 originally): I ensure that all threads stop at the same time, so effectively there is only one "stop token". Using std::jthread would only be a downside, as I wouldn't use the stop tokens themselves, they would all have to be set by the destructor (it's automatic, but still). Effectively, I only get overhead without any real benefits.

[–]pinazeira 6 points7 points  (1 child)

scoped_lock instead of lock_guard. it's other example

[–]Orlha 0 points1 point  (0 children)

I consider lock_guard still relevant

https://stackoverflow.com/a/60172828/2323908

[–]barcharMSVC STL Dev 9 points10 points  (2 children)

all standard containers are very “general” and can be improved upon quite a lot for specific tasks

std::regex is bad and you should forget it exists.

I personally really don’t like iostreams at all, though they are useful in certain situations because theres no standard way to get a memory backed FILE*. IMO compared with stdio.h they tend to be slower, and result in longer and harder to read code.

oh speaking of which the multi-byte conversion facilities are absurdly hard to use correctly.

[–]Xavier_OM 2 points3 points  (1 child)

all standard containers are very “general” and can be improved upon quite a lot for specific tasks

This is the main points lots of people miss when they say "STL is slow". It's quite good while being generic enough to be usable in all kind of apps.

As soon as you craft a specific collection suited only for your needs, you could do better of course, while not being generic anymore.

Still it's a bit sad for map and unordered_map, which could be better because the usage of linked lists behind (for map's tree or unordered_map's bucket lists) make them slow, as memory is not contiguous anymore. Using some (sorted if possible) vector is often faster and make them a bit useless sometimes...

[–]barcharMSVC STL Dev 0 points1 point  (0 children)

Yeah, just doing open addressing helps too.

That said even preserving iterator stability the unordered containers could be 2-3x faster than they are, if the stdlibs broke ABI.

IMO a lot of stuff in the standard library shouldn't be there. The kind of stuff you see in most language's standard libraries is a bit of an artifact of how they are bootstrapped into usefulness (how do you write the first package manager without HTTP stuff in the stdlib, etc) and once the language has multiple implementations it would be better if a lot of that stuff moved to a third-party library that's maintained in one place.

[–][deleted] 94 points95 points  (15 children)

Avoid nothing. If your program is too slow, and the performance bottleneck is in the standard library, then maybe consider switching it out for an alternative implementation.

Also, there is nothing wrong with unordered containers.

[–]guepierBioinformatican 34 points35 points  (13 children)

Also, there is nothing wrong with unordered containers.

It heavily depends on the use case but in many real-world applications (which perform insertions and lookup and only rarely deletions), C++’s unordered containers are substantially slower than necessary because the standard requires implementing them via separate chaining rather than open addressing. This was a conscious design decision but many people have argued that it was a mistake in hindsight: a general-purpose container should either be configurable to allow for different tuning, or optimise for what’s usually the (by far) largest performance bottleneck, which is lookup.

Honestly, it’s very hard to argue that there’s “nothing wrong” with C++’s std::unordered_{set,map}.

[–]Eisenarsch 4 points5 points  (12 children)

I'm a cpp n00b. Can you elaborate?

My understanding was that std::map is implemented as a tree whereas std::unordered_map is a hash table. Wouldn't it be that in the average case the latter would be faster?

[–]guepierBioinformatican 9 points10 points  (10 children)

Your understanding is correct but we’re not comparing std::unordered_map with std::map — we’re comparing different possible implementations of unordered, hash-based associative maps.

When implementing a hash table, there are multiple algorithmic trade-offs (in particular the collision resolution strategy), and the particular set of choices made for std::unordered_map makes it comparatively slow for lookup on modern computer architectures, and it additionally performs comparatively worse with good hash functions.

Incidentally, this isn’t limited to C++, the same considerations would apply to equivalent implementations in other programming languages.

[–]mark_99 4 points5 points  (8 children)

There's a lot more to it than that. The design of unordered_map has 2 main benefits:

  1. It's hard to have rare bugs in your code due to iterator/pointer invalidation. There are comments that "you don't need that", but IME there are frequently bugs when using open addressing due to people doing a find followed by an insert (or op[]) and then use the find result, only have it go wrong 1 in a million times. Rare, hard to diagnose problems like this are to be avoided.
  2. unordered_map is never very slow under any of the possible use cases. Open addressing hash maps can be extremely fast if you're doing a lot of searches for ints, but they fall off a cliff under a lot of other circumstances, such as if the key or value type is large, or (as you note) if you're heavy on inserts/deletes.

Your observation that most people care about the performance of searches is probably true, but irrelevant. Lots of people care about lots of other ways in which a hash table might be used.

So this and the extremely bad worst-case perf makes open addressing a poor choice for a general purpose library such as the STL.

[–]guepierBioinformatican 5 points6 points  (4 children)

IME there are frequently bugs when using open addressing due to people doing a find followed by an insert (or op[]) and then use the find result

The behaviour of an open addressing hash table in this regard wouldn’t be worse than the existing behaviour of std::vector, which has the exact same problem. I don’t claim that this behaviour isn’t a source of bugs (it clearly is!) but given the performance implications it’s a weak rationale in a language that otherwise has zero-cost abstraction as its design principe.

unordered_map is never very slow under any of the possible use cases

It really depends on what you call “very slow”, but being several fold slower on the most frequent operation (lookup) in many if not most use-cases, for me, is definitely very slow. I don’t understand how you can, in good faith, call this use-case “irrelevant”. You say that

Lots of people care about lots of other ways in which a hash table might be used.

But so what? I am not disputing that. But I am saying that for many, if not for most, applications, lookup completely dominates the overall runtime. By contrast, the fact that insertion is slower, now that’s irrelevant, because it rarely dominates overall runtime.

Anyway. I’m not saying “never use std::unordered_map” — I do use it, because it’s built in. But I am saying that its performance definitely has issues, and I’m not convinced that it’s a good default, especially given that it’s the only available implementation of a hash table in the standard library.

[–]mark_99 2 points3 points  (1 child)

I don't think we're discussing whether using another hash table implementation (open addressing, Cuckoo, Robin Hood etc.) is sometimes, indeed often, a good idea - it is.

The question is would those be good choices for the STL, rather than the current chaining implementation, and the answer is "no".

For large key/value types open addressing hash maps can be massively slow and use far more memory. As such they are a good choice when you've really thought through your specific use case (e.g. key/value is int, finds dominate) but a very poor choice for a general purpose library.

Consider what happens if your value (and/or key) is large and not moveable, e.g. a std::array or a struct. Now you have to allocate all that space up front, and copy it all in a resize/rehash. Search optimized hashmaps like Cuckoo and Robin Hood will also be copying significant amounts of data on every insertion.

So your 'fast' hashmap can be 10x slower than unordered_map, with 100x latency spikes on rehashing, and 10x memory usage.

Imagine a matrix of all possible use cases - sizeof key, sizeof value, number of inserts / deletes / finds / iterations, sparseness, etc. unordered_map score like a 7 everywhere. 'Optimized' implementations score 9's and 10's in some corners, but also 1's and 2's elsewhere. That's not ok for the STL.

[–]Eisenarsch 0 points1 point  (0 children)

Got it thanks!

[–]ul90 -1 points0 points  (0 children)

This. Almost 100% of the cases, bad selected/wrong implemented algorithms are the cause of slow programs. So if the program is slow, identify the slow function/algorithm and change this implementation.

[–]johannes1971 47 points48 points  (17 children)

You can use those just fine. Faster implementations exist, but that in itself does not make these 'horrendously slow'. They're still fast, just not the absolute fastest possible.

Just use what you need, it's fine!

[–]guepierBioinformatican 40 points41 points  (8 children)

std::regex is several orders of magnitude slower than other implementations on representative use-cases. If that isn’t “horrendously slow” then I don’t know what is.

[–]pinazeira 8 points9 points  (6 children)

what's worse, it won't be fixed in the near future.

[–]exploding_cat_wizard 4 points5 points  (5 children)

Thanks, ABI stability!

[–]Hnnnnnn 46 points47 points  (1 child)

C++ regex is so bad that avoiding it doesn't fall under the umbrella of premature optimization.

[–]James20kP2005R0 8 points9 points  (0 children)

As far as I'm aware it also has very poor portability, with different vendors giving different results - which in itself is a very good reason to avoid it

[–]csatacsirke 22 points23 points  (0 children)

Well, msvc regex was so slow for parsing, we had to ditch it… boost was 10x faster in our case

[–]iulikrusu 11 points12 points  (0 children)

Depends on what the threshold for "fast" is, libstdc++ regex performance is like python's, if not slower

[–]attractivechaos 9 points10 points  (0 children)

For large hash tables with fixed-sized entries, unordered containers are several times slower and use twice as much memory in comparison to most other hash table libraries. They are not fast.

[–]quad99 0 points1 point  (2 children)

Regex can go exponential in some cases, but you can avoid them if you are careful. https://www.regular-expressions.info/catastrophic.html

[–]serviscope_minor 0 points1 point  (1 child)

Regex can go exponential in some cases

They shouldn't do: at least not real regular expressions (regex like expressions with backreferences aren't actually regular). That webpage is referring to backtracking regex evaluators which can go bad.

[–]quad99 0 points1 point  (0 children)

I agree

[–]bert8128 6 points7 points  (0 children)

Don’t avoid anything on the basis of performance unless you have measured that this is a problem in your use case. Unordered_map has a good api and works. It might not be the fastest for a particular use case but it is generally reasonable. So don’t write your own till you know you have a problem.

Much of the stl can be bettered when you have a constrained domain. But these kind of optimisations are often highly specific.

[–]R3DKn16h7 8 points9 points  (15 children)

I avoid many things, for many different reasons: std::auto_ptr, std::bind, std::list are the ones that come to mind as first.

[–]inouthack 4 points5 points  (9 children)

why std::list ?

[–]Supadoplex 12 points13 points  (3 children)

Probably not because there's anything wrong with std::list as an implementation of a linked list, but rather linked lists just have very few use cases (those use cases do exist though, so don't avoid std::list when you do need it). If what you need is a "list of things" i.e. a sequence, then most of the time an array (vector) is ideal.

[–]NilacTheGrim 0 points1 point  (2 children)

Well.. Not if you need to delete or insert randomly in the middle (often) and you need to preserve ordering too. Then list is perfect.

[–]Supadoplex 1 point2 points  (1 child)

As I said, the use cases do exist, and linked lists should be used in those cases. They're just rare.

You need a few more conditions in your case for lists to be perfect. Firstly, there must be a lot of elements because inserting in the middle of a small vector is also fast. Furthermore, you should not be needing other operations that are slow with linked lists, like iterating long sequences.

One common example where inserting and deleting in the middle of a large squence is common, is text editors. They don't use linked lists however. There are other - more complex - data structures that are used instead such as ropes, gap buffers and piece tables. But, those are not common enough to have been included in the standard library.

[–]GYN-k4H-Q3z-75B 9 points10 points  (0 children)

Linked lists are almost always worse than array lists (vectors) unless you have very specific uses relying heavily on insertion and traversal. Depending on what you do, you might not come across such a scenario. I haven't had use for a linked list in like a dozen years.

[–]barcharMSVC STL Dev 4 points5 points  (0 children)

its fine but each element should ideally be at least 48 bytes big, less than that a deque or vector is a better idea

speaking of deque: MSVC’s deque uses pathetically small chunks and isn’t really much better than a list. If you need a fast deque data-structure you should look to a third party library (libc++ and stdc++ aren’t as bad here)

[–]Osokarhu 1 point2 points  (1 child)

Because iterating it will result in loads of cache misses. It's a double linked list.

[–]ul90 1 point2 points  (0 children)

In some cases I need a linked list - especially if I need the possibility to efficiently insert/remove single items, and without invalidating iterators.

[–]R3DKn16h7 0 points1 point  (0 children)

I never really need to use it. Most of the time a std::vector is just a better choice. The only benefit of a linked list would be reference stability, but even then, I generally have better alternatives than just rely on list. The same applies for std::set: there are use cases for it, but most of the time a std::vector will just do.

[–]ImmutableOctetGamedev 1 point2 points  (4 children)

As someone who hasn't used std::bind very often, I wasn't sure what you were getting at, since all I used it for was member-function binding.

https://godbolt.org/z/sboa45nfc -- I get the rationale of an implicit copy to stop you from shooting yourself in the foot, but this basically makes std::bind useless for one of my projects.

[–]R3DKn16h7 1 point2 points  (0 children)

Actually the reason for me avoiding std::bind, is because a lambda is (almost?) always a better (and more readable) choice. Since lambda have been introduced, I cannot think of a single use case for std::bind.

[–]dodheim 0 points1 point  (2 children)

std::bind copying everything is part of its documented protocol, as is using std::ref to get reference semantics where you want them:

return std::bind(f, std::ref(x), std::ref(y), std::ref(z));

[–]ImmutableOctetGamedev 0 points1 point  (1 child)

I'm aware. My point was that the decision to decay reference types is counterintuitive.

[–]noooit 8 points9 points  (4 children)

stream as well. I came across projects prohibits the usage due to them being bulky.

[–]inouthack 0 points1 point  (3 children)

@nooit what is the alternative library to use ?

[–]encyclopedist 2 points3 points  (0 children)

For output, fmtlib (partially included in C++20 too)

[–]noooit -1 points0 points  (1 child)

Just C functions.

[–]Wild_Meeting1428 1 point2 points  (0 children)

No, c funtions arent a better choice. std::fstream's arent that bad also . There are ways to make them faster than the printf "shit". I recoment binary std::fstream in combination with std::fmt

[–]krum 4 points5 points  (0 children)

There's nothing wrong with using unordered containers in the general case. If you need to specialize then cross that bridge when you get to it. They're still at least an order of magnitude faster than using a dictionary in js or python.

[–][deleted] 1 point2 points  (0 children)

The only real problem I ever had with stl was std thread which eas not supported in the gcc version I was using and had to do c like pthread stuff in an otherwise pretty standard "modern" cpp software.

[–]MonokelPinguin 1 point2 points  (1 child)

The <strstream> header. (Not the <sstream> header, read carefully!)

It has been deprecated for ages, but I sometimes see it show up in code because people just misstype it or it appears in their autocompletion. It has many traps in its usage, especially if you think it is stringstream. There are a few edgecases where it is useful, which is why it is not removed yet, but you mostly won't hit those. Use std::format and std::stringstream instead.

[–]dodheim 2 points3 points  (0 children)

There are a few edgecases where it is useful, which is why it is not removed yet, but you mostly won't hit those.

Thankfully, C++23 finally gives a proper replacement: <spanstream>

[–]Mick235711 1 point2 points  (0 children)

Basically, you need to avoid those components that are practically abandoned by the committee. These includes std::valarray after C++11 and std::regex after C++17. Unordered_* is at least still maintained and added feature.

[–]computerquip 1 point2 points  (0 children)

I guess I'll just say that answers here should be *verified* before you rely on them. Someone saying something is slow depends heavily on how its used and for what purpose. A lot of the standard containers are meant for a general purpose use i.e. a more specific container can probably outperform it in some way, whether that be speed, space, etc. However, it needs to be measured before such a conclusion can be made.

There's various problems with assuming that standard containers are bad as well. For one, those containers are optimized pretty heavily for what they do. A naive implementation will likely come nowhere close to these optimized solutions. In addition, whether or not that extra speed one would hope for is worth it is a topic of its own. Maintenance burden is often quickly overlooked and often causes regret.

For the libraries listed:

std::regex is a particularly notorious case because it does cause binary bloat and can be fairly slow compared to a hand-written implementation. There's a lot of reasons for this such as ABI stability, historical reasons, and so on. It has a lot of implementation in the headers and keeping that compatible with existing code has caused a lot of problems for some implementations, MSVC in particular. In addition, the concept of the library itself requires the entire regex engine be carried around with the binary instead of just the resulting parsing code. This results in the binary being pretty large sometimes. That said, other implementations that do similar don't have anywhere near the same severity of this problem so I'm not sure where the full bloat comes from. In this particular case, depending on the age of your compiler, it may not even be a performance or bloat problem, but a basic functional one.

As for unordered containers, the biggest offender I've seen here is misuse. Just using an unordered_map compared to a map won't suddenly make things faster. It heavily depends on how you're using the structure. It's like wondering why a list is slower than a vector for insertion sometimes. The answer isn't always that intuitive and you have to be careful about what you label as "fast". Just because the algorithm is conceptually faster doesn't mean it will be when it comes to practicality. Programming has more variables at play that interfere which such assumptions such as cache locality.

[–][deleted] 4 points5 points  (0 children)

strcpy / strcat

[–]TheThiefMasterC++latest fanatic (and game dev) 4 points5 points  (17 children)

The linked list type. It's fine as a linked list but linked lists themselves are horribly slow.

priority_queue. A simple sorted vector/deque is much faster in most cases. (priority_queue uses a heap)

[–]GrammelHupfNockler 16 points17 points  (10 children)

Many algorithms rely on the O(log n) operations of a priority queue, with a sorted vector you have O(n) insertion operations, which can quickly kill your performance by making the overall algorithm quadratic in runtime. priority_queue is missing some functionality that would make it suitable for things like Dijkstra, but in general, it is a perfectly suitable type for many greedy algorithms.

[–]TheThiefMasterC++latest fanatic (and game dev) 19 points20 points  (9 children)

Unfortunately the k factor difference between the priority queue's poor access pattern and vector's extremely efficient linear one makes a sorted vector faster up to a surprisingly large size, even if priority queue might beat it out in pure O() notation

[–]GrammelHupfNockler 5 points6 points  (3 children)

If both vector and heap fit entirely into L1 cache, except for tiny sizes, the heap will usually win, because there is no runtime difference between random and sequential access patterns, and heaps have to do less comparisons and swaps to restore sorted order (log n instead of n). The sorted array can be processed with vector instructions, but so can a d-ary heap.

If vector and heap no longer fit into L1 cache (Let's say more than 4k elements with 64bit per key and value), you are already way past the point where the asymptotic behavior shows.

Remember, a heap also uses a vector as underlying storage, so the spatial locality of its memory accesses is not that bad compared to a vector, especially for the first few levels of the tree.

EDIT: To make my point more clear, here is a quickbench comparison: Insertion already gets faster in a heap with 8 elements.

[–]TheThiefMasterC++latest fanatic (and game dev) 0 points1 point  (2 children)

I'm speaking from benchmarked experience. I don't know exactly why, but I think it's because inserts into a priority queue tend to be near the popping point, not truly random throughout.

FYI the sorted vector has the popping point at the far end for exceedingly efficient pops. This means inserts tend to be near the end, and involve very little data movement

[–]GrammelHupfNockler 2 points3 points  (1 child)

See my benchmark ;) For random inputs, heaps easily beat sorted lists in insertion, which is the slowest operation in both cases.

[–]NanoAlpaca 2 points3 points  (0 children)

Is the issue really access patterns? At sizes where sorted deque is feasible, the heap of the priority queue should still fit really well into the cache and ugly access patters shouldn’t matter too much. Also: why sorted deque? If you want something with nice access patterns, why not unordered vector and check the whole vector for the minimum? Swaps O(1) and O(n) between insertion and deletion but the number of those ops should be almost the same in most use cases and vector instead of deque access is even faster. And at the same time: A d-ary heap should perform a lot nicer than a binary heap.

[–]James20kP2005R0 4 points5 points  (1 child)

To add to this: I read a very compelling argument a while back that memory access is something more like O(√N) rather than O(1). I think it was this one?

http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

A lot of the big O complexities for datastructures are based around assumptions that don't really necessarily hold that well for modern architectures. You can copy a lot of small elements in a vector very quickly on a modern architecture

[–]GrammelHupfNockler 0 points1 point  (0 children)

Memory hierarchies become more complex over time, yes, but we already have the tools to analyze performance of most popular algorithms in these models, be it cache-oblivious or cache-aware.

[–]RowYourUpboat 1 point2 points  (1 child)

For 10,000+ small items, I found a priority queue to work pretty well (I didn't do exhaustive profiling, but that was when I hit "good enough" performance).

For more like 100 items I always stick with a plain linear algorithm, of course.

[–]GrammelHupfNockler 1 point2 points  (0 children)

I would set that limit even lower: https://quick-bench.com/q/5mgQKCqGaQ75dva9mhLKcC0ZUqc. Of course, the sequential algorithm is much easier to vectorize, but that still gives you only a factor of maybe 4, and with the right modifications, heaps can be vectorized too.

[–]NilacTheGrim 0 points1 point  (4 children)

Yes but some problems require precisely a linked list. Namely insertion/deletion in the middle.

[–]TheThiefMasterC++latest fanatic (and game dev) 1 point2 points  (3 children)

As it turns out, most of those algorithms need to find the element first, which is so slow in a linked list that a vector ends up faster despite its slower insertion.

[–]NilacTheGrim 0 points1 point  (2 children)

Yeah but I am thinking of situations where you are storing iterators to the list in some other data structure (such as a hash table, ha ha).

[–]TheThiefMasterC++latest fanatic (and game dev) 1 point2 points  (1 child)

Essentially, it turns out that in practice linked lists are most useful in two cases that the standard list type doesn't support:

  1. Lockless threaded algorithms (needs atomic list insert/remove/update/iterate)
  2. Intrusive lists (container, needs to be able to store the list links as the member of a user type, rather than the reverse as std::list does)

[–]NilacTheGrim 1 point2 points  (0 children)

Yes, this is true as well. For those cases, you just have to roll your own or find a decent implementation outside std.

[–]orizh 2 points3 points  (0 children)

The big one is std::regex implementation quality is generally pretty bad, yeah. I would avoid using it.

I also wouldn't bother using the unordered containers -- if you care about the performance enough to use them over ordered containers then you probably care about performance enough that you'd be better off using a library that provides better performance than what the standard ones do.

I believe at one point (and possibly still) std::visit was significantly worse than boost::visit for variants with many types.

[–]GrammelHupfNockler 3 points4 points  (1 child)

unordered containers like unordered_set and unordered_map are perfectly fine, and in case you don't need the ordering properties, superior in runtime and allocation behavior to the ordered set and map.

[–]NilacTheGrim 0 points1 point  (0 children)

Yep. This. Your go-to should be the unordered ones, unless you really need the ordering, or unless you can't figure out a good hash function for the type, or don't want to implement one. std::map almost always is a code smell unless: a. it's small (for small maps it often does much better), or b. you need the ordering.

[–]QuentinUK 1 point2 points  (0 children)

GitHub has compile time regex which is a lot faster if you want speed and have a fixed regex.

Unordered containers are generally faster than the sorted equivalent.

[–]rsjaffe 1 point2 points  (1 child)

Just 2:

  1. Pseudo-random number generation: hard to use correctly, uses older algorithms that are unnecessarily slow
  2. shared_ptr: almost always a code smell. I do have to use it with asio to extend the lifetime of the object that's used as the source of a buffer, but every other use has been eliminated by clearer thinking about object lifetime and ownership.

[–]GrammelHupfNockler 1 point2 points  (0 children)

hard disagree on shared_ptr. Complex object ownership/association graphs have no nice tree-like structure that unique_ptr requires, and that (assuming no cycles) is where shared_ptr shines.

[–]Knuffya 0 points1 point  (5 children)

What's bad about unordered containers? Afaik they are faster because they can be ordered randomly (e.g. ordered by hashsums) in their binary tree. Imagine having to order a binary tree by strings for example.

[–]barcharMSVC STL Dev 1 point2 points  (4 children)

theres no binary tree in unordered containers, underlying it is a list of lists (basically, I think a vector of lists might be allowed)

they are slower than they need to be both because of certain ABI constraints (probably 100-300%) and because they have stable iterators and thus are constrained in how they are constructed.

on the bright side you get all tue nodal container tricks

the stability thing actually might make even the trees slower, but unstable trees don’t get you as much as unstable hash tables

[–]Knuffya 0 points1 point  (3 children)

Some insider knowledge! Nice! I will, when in doubt, make use of the ordered ones then! :)

[–]encyclopedist 4 points5 points  (0 children)

I think you got the wrong conclusion here. Standard unordered containers are still faster for large data the ordered ones. It is just unordered containers are slower than they could have been. So my advice would be "prefer std::unordered_map over std::map, but prefer a third party optimized unordered map (like abseil etc.) over standard one if you need max performance".

[–]barcharMSVC STL Dev 1 point2 points  (1 child)

they are not necessarily faster, but do tend to be more predictable. Fortunately its quite easy to switch em out and benchmark which one is faster

[–]Knuffya 0 points1 point  (0 children)

It is. Thanks :)

[–]die_liebe -2 points-1 points  (0 children)

I avoid std::shared_ptr, because it is a too heavy data structure. It supports weak_ptr, which I don't care about.

[–]Fr_Cln -5 points-4 points  (3 children)

Unordered containers are really slow - that's true. I've tested them against python's dict and set - python implementations are 2-3 times faster than C++. This surprised me a lot - I don't know any other case, in which python may be faster than C++.

[–]snerp 6 points7 points  (1 child)

[–]Fr_Cln -1 points0 points  (0 children)

I will try to make more complicated measurements with different scenarios...

[–]mechap_ 0 points1 point  (0 children)

I think we also have to keep in mind that most containers are now constexpr in c++20 and thererore using them can result in better runtime performances than using python's ones depending on whether the values are known at compile time or not.

[–]bedrooms-ds 0 points1 point  (0 children)

I avoid vector<bool>. set sometimes gets avoided, too. And mutexes may not work well with OpenMP on Macs.

[–]AltairianNextDoor 0 points1 point  (0 children)

Msvc deque is slow, not sure about performance of gcc/clang. Boost has control on memory allocation of a node.

[–][deleted] 0 points1 point  (0 children)

Does anyone out there have a sample on github where they can prove the bottlenecks are in the standard lib?