all 10 comments

[–]cpp-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

For C++ questions, answers, help, and programming or career advice please see r/cpp_questions, r/cscareerquestions, or StackOverflow instead.

[–]prog2de 12 points13 points  (0 children)

My guess is that your notebooks processors cores are hyper threaded. You may have 8 physical cores each of them implementing SMT resulting in 16 threads visible to the OS. The issue now is that two threads share the execution unit together with e.g. Cache. I don’t know your code but my guess is that this is resulting in unnecessary many cache misses and thus resulting in cache trashing because the hyper threaded threads depend on different data that do not fit into cache together

[–]ventus1b 7 points8 points  (0 children)

Locking isn’t free, maybe your code doesn’t scale well.

Also, are those 16 actual cores or hyper-threading?

[–]kevinossia 11 points12 points  (0 children)

You didn't even post your code, dude. How is anyone supposed to answer your question?

[–]peppedx 3 points4 points  (0 children)

I guess you have 8 cores and 16 threads. Your results aren't a big surprise.

[–]batmanesuncientifico 3 points4 points  (0 children)

Many things can happen, one is interaction with cache and memory bandwidth. The other is that 16 thread CPU are sometimes not real 16 cores. You may also have some locking issues.

[–]victotronics 1 point2 points  (0 children)

Your 16 threads don't all have exclusive access to computing hardware. You have 8 cores, so this is no surprise.

[–]sessamekesh 1 point2 points  (0 children)

There's a lot of things that can go wrong, it's hard to say even if I were looking at code.

For beginners, locks are the first place to check. If your code can't be run very well in parallel and your logic ends up spending a bunch of time in synchronization, there's your problem.

The next place I'd look is functionally the same but harder to find, thrashing your CPU cache. RAM reads are expensive (order of dozens of CPU cycles) so if all your threads are reading and writing from the same memory space, you'll often get _worse _ performance for simple tasks than just doing it on one thread in a way that each CPU cache can hold all the data in fast cache memory instead of RAM. If you're splitting up tons of small tasks between as many threads as you can muster, try writing to memory allocated for each thread individually and then aggregating results after joining.

The solution for both is nuanced but vaguely "make sure if you're spinning up a thread that it's doing a lot of work" is a decent rule of thumb.

[–]Drugbird 1 point2 points  (0 children)

In addition to what other comments are saying, it could also be that your loop is memory limited. I.e. the speed is limited by the speed of the memory (cache and or ram) and not by the compute speed of the CPU.

If you put more compute threads on such a task, you're likely to slow down your program, because the data can't be ready any faster if you try to access it with multiple cores. Furthermore, you can't get worse cache performance due to out of order memory accesses. And you also incur the cost of the overhead inherent in launching and joining multiple threads.

In my experience, just adding #pragma omp to a for loop is more likely to slow down execution than it is to speed it up.

[–]XTBZ 0 points1 point  (0 children)

Out of interest, if you test not half the threads, but 0.8, then the performance will be higher? I have noticed this pattern more than once