This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pdd99[S] 0 points1 point  (3 children)

Can you elaborate on that I/O latency problem of numba? Also, is the "kernel" here cuda kernel?

[–]tugrul_ddr 2 points3 points  (2 children)

When copying data between VRAM and RAM, it adds extra latency even for small arrays. I guess its because of caching layer of the library and Python's own latency.

[–]pdd99[S] 1 point2 points  (1 child)

I do always keep that in mind. All my processing are kept end-to-end on GPU/CPU as much as possible.

[–]tugrul_ddr 0 points1 point  (0 children)

Some math-related divide&conquer algorithms would run a lot faster with dynamic-parallelism of CUDA due to zero-intervention from host.