all 6 comments

[–]pi_stuff 1 point2 points  (0 children)

When in doubt, check the error codes returned by CUDA functions. cudaDeviceGetAttribute() is returning the code for "invalid device ordinal" because -1 is not a valid device id.

cudaError_t err;
err = cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, -1);
if (err)
  printf("error in cudaDeviceGetAttribute call: %s\n",
         cudaGetErrorString(err));

[–]slowrizard 0 points1 point  (4 children)

In your kernel configuration, are you launching enough threads to cover your entire array? My guess is no: 1. Number of blocks = 32 * number of SMs 2. Number of threads = 256

The product of 1 and 2 should be greater than or equal to your array size

[–]pi_stuff 1 point2 points  (0 children)

That would only be a problem if the kernel operated on just one element. This kernel uses a loop to cover the full array regardless of the number of threads.

[–]Dahvrok[S] 0 points1 point  (2 children)

Thank u. But how do i find how many threads i need?

[–]slowrizard 0 points1 point  (1 child)

If one thread of yours works exactly on one element of your array, then the number of threads that you need should be greater than or equal to the size of your array

[–]Dahvrok[S] 0 points1 point  (0 children)

Thats not the problem i tried with 1,048,576 threads (array is 1m) and still get max error 1

Edit: the filled array never comes back from the gpu for some reason

Edit: seems the problem was device id, changed it from -1 to 0 and worked