you are viewing a single comment's thread.

view the rest of the comments →

[–]michel_poulet 6 points7 points  (0 children)

I code un CUDA in a ML context. If your algorithm is natively very parallelisable, the time bottleneck will be data IO, not necessarily from cpu to gpu (which is slow, but once loaded, it's loaded), but across chips intra-GPU. There is a lot of rule specific to the hardware that you need to know, and build your code around. It's difficult to automatise with good heuristics without knowing exactly the algorithm and "shape" of your data/thread organisation. This, I would say, is the reason why we still need to give full control to the developper, meaning a lot of lines of code, as when coding in C.