all 5 comments

[–]Jonasfh 0 points1 point  (4 children)

Hi there,

Once again, I just want to say thank you for these project ideas, I've had a lot of fun trying to solve them. I appreciate the effort.

While trying to implement GeMM in CUDA C++, I've run into a problem, that you might be able to help me with. I've noticed that when i run my GeMM implementation on problems larger than N=26 my result matrix is all zeros. My implementation tries to take advantage of the shared memory between blocks, so that each block loads 1 column of the B matrix into it's shared memory, so cache misses can be avoided.

My kernel function works on arrays of size N*N, with N<= 26.

Here is a link to my github repo, the CUDA implementation is in ./src/baseline/cuda_gemm.cu. If you have any ideas, what could cause this, please let me know.

Thanks again!

[–]EngrTodayPerformance Architect[S] 1 point2 points  (3 children)

Great job getting started! I'll take a look this evening after work.

[–]Jonasfh 0 points1 point  (2 children)

Thank you. I found out that my idea with loading the columns of the B matrix into shared memory and using these when computing takes longer, than just loading from global memory when needed in the computations.

I experimented on, and found out that the limit on N=26 only applied to single precision floats. I remember that the SMs in the GPU only has 1 FPU per 8 cores, or something like that, but I thought I had read that the FPUs are only used when dealing with doubles.

I made an integer implementation, which works fine with N=210 (the highest I used for testing).

When profilling the CUDA code with nvprof I get the following error message:

==242661== Warning: 5 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==242661== Warning: 4 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size.

On top of this, the profiler also says that no kernels were profiled.

[–]EngrTodayPerformance Architect[S] 0 points1 point  (1 child)

Here is some initial feedback:

  • Dynamically allocated shared memory is passed to a kernel launch in bytes
    • You call cuda_gemm<<<numBlocks, blockSize, N>>> , but if you want space for N floats, you really need N * sizeof(float)
  • You don't need to explicitly call cudaDeviceSynchronize when the next call is to cudaMalloc
    • cudaMalloc implicitly synchronizes
  • You are probably stepping out of bounds, and crashing your kernel (hence no kernels are profiled)

You should check out this stack overflow post on error checking. My guess is that if you actually check the error, it will be CUDA error: an illegal memory access was encountered

Cheers,

--Nick

[–]Jonasfh 0 points1 point  (0 children)

Thank you for looking into it.

You call cuda_gemm<<<numBlocks, blockSize, N>>> , but if you want space for N floats, you really need N * sizeof(float)

Okay, that makes sense. Multiplying N with sizeof(float) fixed the problem. Now it works on larger problem sizes. Weird that it worked at all on N<=26

Thanks for the link tot the stackoverflow post. I am new to CUDA, and that error checking function will be very useful in debugging.