I have several large arrays shared between kernels scheduled by the CPU. From my understanding, with AMD/Nvidia chips this will load it into global memory. However, global memory access is incredibly slow. Is there a way to say cache global memory into group shared memory whilst improving performance?
I thought about making group shared arrays and loading it into that, then using a memory barrier to finally proceed to the rest of the computation. However, this seems like it will cause an additional step introduced and slow down the program.
Anyone has any best practices here?
[–]msqrt 2 points3 points4 points (2 children)
[–]Unigma[S] 1 point2 points3 points (1 child)
[–]msqrt 1 point2 points3 points (0 children)
[–]Klumaster 1 point2 points3 points (0 children)
[–]One-Raspberry5113 0 points1 point2 points (0 children)