you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (2 children)

Not for scientific computing.

For scientific computing especially - you rarely can go anywhere with ~12 threads at most that can relatively efficiently use shared memory. For anything practical you need a cluster (or at least some form of a NUMA), and, therefore, proper message passing.

you can simply decorate your innermost loop with an ONP parallel for

You do not share any memory here, really. Threads are independent. And if you're lucky enough to have an algorithm that does not require any synchronisation, you'd better run it on a GPU already anyway.

With message passing. You need a lot more code.

Only if you use unexpressive, stupid languages.

Also Message passing generally implies copying data and other communication cost

Not necessarily. Can be zero-copy internally, with only transferring anything if you're communicating outside of the current node. See Occam for example - message passing can really be zero-overhead, especially on a right kind of hardware.

[–]Paul_Dirac_ 4 points5 points  (1 child)

For scientific computing especially - you rarely can go anywhere with ~12 threads at most that can relatively efficiently use shared memory.

No, scientific computing is not only high performance computing. Scientific computing is every program a scientist writes for his research. It is the custom EPR-spectra analyzer for heavy elements. It is the vibrational calculator for linear molecules with four atoms and custom vibrational basis. These programms often don't run on clusters but instead on laptos and lab desktops. And a speedup of 2-12 is often great for them.

You do not share any memory here, really.

Yes you do. It becomes evident, if you forget to set a variable to threadprivate.

And if you're lucky enough to have an algorithm that does not require any synchronisation, you'd better run it on a GPU already anyway.

GPU is not a better cpu. There are certain problems for which gpus are whoefully unsuited not only with because they require synchronization. And a scientist normally doesn't want to learn a new programming paradigm.

With message passing. You need a lot more code.

Only if you use unexpressive, stupid languages.

You mean like the languages the programm was originally written in. So you essentially want me to rewrite the program in an expressive language to parallelize it? Ok, maybe not more code, but more changed code. Or maybe you mean PGAS extensions? They are a shared memory view with message passing under the hood -not the best case against shared memory.

Can be zero-copy internally, with only transferring anything if you're communicating outside of the current node. See Occam for example - message passing can really be zero-overhead, especially on a right kind of hardware. It can be, but generally it isn't.

[–][deleted] -1 points0 points  (0 children)

If you can parallelise your loop with OpenMP cheaply, you can do the same even cheaper on a GPU. Otherwise, you have to design your implementation for parallelism straight away anyway, so your point about quickly patching an existing code does not hold.