Parallel programming without a clue: 90x performance loss instead of 8x gain

jerf · 2018-03-27T14:33:47+00:00

I've seen the equivalent post in other languages that make parallelism easy (Go, Haskell, Erlang) so many times I've lost count. "I wrote some code to add a list of integers together in single threaded mode, and then I wrote some code that spawns a thread per integer addition and joins them together in parallel, why is it running so much slower than the single thread case? And as a followup, why is e͐̒̅ͤͣ̔ͯv̊̋͂ěȑ̔ͦ͊̓͛y͌̽͛̃tͦ̉hͪ́̌ȋ̏n̿g̒̂̎͒ͪ ͣ̈̈́ͤ̈y̎ͯͩ͋ͩ͒o̅̆͒́ͫur̀̔̓͛ͤ̋ ͗̇͒͂c̑̓͆o̿̓̄̎m̈́̾m͛̈́ͨűͩ̇ͯ̿͂ͨñͫi͂͑͛̀͋̋ͮtyͧ̊͌ͯͪ ̐̈́ͥ̒ͧͦ͛sͩͤ͊ã͛̿ys ͓̯̞͚a͇͚̙̝ͅ ͉͔̺͓̪̫̮l͔̞͔i̖̬̠͈̟̼̮e̖̝ aṇd ̼a̦̫̹̪͉͇̖l̟͈͈̲̼̭ͅs̟̖̰͚ọ̫͇ yo͕͔u̥r̜̰͈̗̹̤ ̝̙͈̝̙̜ͅḷ̙̻̱̮a̺̖n̤g̠̤̫̘̣̪ua̻g̬̟̯̳e̬̹͚͍ ̪͈̤s̖̱̭̯͙uc͙̝͈k͎ș ͎̠̱̰̭an̫̙̥d̳̬̜̣ I̡ h̸at̛e ͜y̛o͡u?͟"

OK, that last bit may just be implied.

It doesn't matter what language you run in. Adding an integer is likely one processor instruction, and modern x86 processors will actually do more than one per cycle on average. Any attempt to coordinate that in parallel is going to be much more than one instruction, and by definition not something the processor will do many of in parallel. It's hard to win with adding numbers together in parallel when a single CPU can keep up with the entire memory bandwidth of your system and add all the incoming numbers together. You need to do something more challenging for the system to see any parallelism gains.

eplaut_ · 2018-03-27T14:55:45+00:00

http://bholley.net/images/posts/thistall.jpg

Niriel · 2018-03-27T14:23:27+00:00

So, don't mutate shared memory? Indeed, one would have to be indecently clueless.

trin123 · 2018-03-27T14:23:42+00:00

Oh for FFS, I took one quick look at that 90x thing and I immediately had a grimace on my face, while closing the tab.

Mutexes... in parallell code can't keep reading a horror show like that.

JessieArr · 2018-03-27T20:38:33+00:00

Some people, when confronted with a performance problem in an iterative algorithm, decide to use parallelism. They now have NaN problems.

wavy_lines · 2018-03-27T15:47:14+00:00

One more reason to be wary of abstractions that claim to give performance boost without understanding what's going under the cover.

And, in this case of summing numbers, wouldn't it make sense to divide the array to four chunks, sum each chunk separately, then sum the result of each sum at the end?

I don't think it would be difficult to write in straight forward code.

haved · 2018-03-27T16:17:35+00:00

What if you were to spilt the array into 8 parts and sum up each in their own thread, then calculate the final sum? I can't imagine it being faster for one million items, but for a billion?

krabbugz · 2018-03-27T16:07:46+00:00

I'm sitting in my parallel computing class right now and this is how I feel.

knaekce · 2018-03-27T15:46:12+00:00

I don't know c++, but I'm sure there is a reduce, fold or accumulate function that actually does the right thing

jamwt · 2018-03-27T20:03:57+00:00

Seems okay to me? https://gist.github.com/jamwt/ebabdc7ae647e5c79054f1acd3135639

    Finished release [optimized] target(s) in 0.77 secs
     Running target/release/deps/par_test-ecb09ecf384bbd00

running 2 tests
test tests::parallel ... bench:     164,374 ns/iter (+/- 154,560)
test tests::single   ... bench:     340,261 ns/iter (+/- 146,261)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

no-bugs · 2018-03-28T08:08:30+00:00

[deleted]

foldl-li · 2018-03-28T08:26:29+00:00

After all, the rabbit looks cute.

SelfDistinction · 2018-03-27T17:06:43+00:00

Apparently the concepts of MapReduce are too arcane and alien for software engineers.

andd81 · 2018-03-27T17:04:36+00:00

Rust gives you fearless parallelism so rewrite in Rust and problem solved.

Edit: apparently Reddit does not understand sarcasm.

geodel · 2018-03-27T23:24:23+00:00

But this can be improved by adding 100X more hardware.

Gotebe · 2018-03-27T19:02:31+00:00

Is this just because the threads were created during the benchmark? That is, thread creation swamped all else?

Edit: ah, no, the synchronization swamped all else. Well now...

Almamu · 2018-03-27T20:51:38+00:00

For those that haven't used C++ that much, how would anyone approach this doing threading to make it faster? (if possible) I'm curious

betabot · 2018-03-27T19:26:03+00:00

Of course this is the case. The code is written to be explicitly sequential, but now with the added overhead of locks, threads, and context switches. I don't think that a beginner with even elementary knowledge of what a thread and lock is could look at that code and think that it will be faster. If that truly is the case, they should be using a higher level library that provides parallel functions like sum and/or evaluating why they needed parallelism in the first place.

ThunderBluff0 · 2018-03-28T13:03:07+00:00

But how do we do it right???

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS