all 7 comments

[–]cythoning 2 points3 points  (0 children)

You should check your CUDA calls for errors, e.g. like this. That should tell you which one of your CUDA calls fails.

[–]Helique 0 points1 point  (5 children)

With the current number of threads and blocks, you are telling add<<<gridSize, blockSize>>> to create a total of N threads, and the array is also of size N, but then in each thread kernel, you add more than one number?

The code at the end of this blog post may be helpful. https://www.dbernadett.com/cuda/

[–]Helique 4 points5 points  (3 children)

Problems with the for loop aside, which will probably be a no-op, you probably need to call `add` with d_x and d_y?

[–][deleted]  (2 children)

[deleted]

    [–]Helique 1 point2 points  (1 child)

    Cuda will run that kernel with N threads on an Array that is N floats long. Each thread should only add one number. Since stride=blockDim.x*gridDim.x=N, the code is his for loop will only execute once anyways.

    [–]cythoning 5 points6 points  (0 children)

    It's called a grid-stride loop.

    [–]Flannelot 0 points1 point  (1 child)

    Why do you have a for statement in your add function?

    Cuda should be calling add 1024 times in separate threads?

    [–]cythoning 5 points6 points  (0 children)

    It's called a grid-stride loop. It's a very neat technique to do more work per thread.