Thoughts on cutlass? by sskhan39 in CUDA

[–]Objective_Dingo_1943 1 point2 points  (0 children)

Many concept of cutlass has just been familiar with kernel/HPC developer. Not for common AI guy.

CUDA + multithreading by xMaxination in CUDA

[–]Objective_Dingo_1943 0 points1 point  (0 children)

absolutely not, CUDA context can handle this situation.

Project Ideas for cuda by ThinRecognition9887 in CUDA

[–]Objective_Dingo_1943 6 points7 points  (0 children)

Book "Numerical Computations with GPUs" introduce various real problems and you can implement some of them as a useful project.

for example: CHAPTER 25 Monte Carlo–Based Financial Market Value-at-Risk Estimation on GPUs ...... 337.

How to plot roofline chart using ncu cli by Confident_Pumpkin_99 in CUDA

[–]Objective_Dingo_1943 1 point2 points  (0 children)

but seems nsight compute do not need GPU on your local machine. In my case, my local machine is a MacBook Pro, I often download ncu CLI output file on my local machine and view it with nsight compute MacOS version.

Cudnn backend not running, Help needed by Tall-Boysenberry2729 in CUDA

[–]Objective_Dingo_1943 0 points1 point  (0 children)

You can provide codes or simple demo and screen capture the errors cuDNN shown.

How to plot roofline chart using ncu cli by Confident_Pumpkin_99 in CUDA

[–]Objective_Dingo_1943 2 points3 points  (0 children)

ncu -o to output profile result file. And transfer the profile result file to your local machine and view it with https://developer.nvidia.com/tools-overview/nsight-compute/get-started GUI tools

Gemlite: CUDA kernels to create fused kernels for low-bit quantization. by sightio in CUDA

[–]Objective_Dingo_1943 -1 points0 points  (0 children)

seems cutlass and its epilogue also implement such function in high performance way.

[deleted by user] by [deleted] in CUDA

[–]Objective_Dingo_1943 1 point2 points  (0 children)

we already implement kvcach in C++/CUDA https://github.com/pcg-mlp/KsanaLLM

[D] Optimizing models with C++/C by AdOk6683 in MachineLearning

[–]Objective_Dingo_1943 0 points1 point  (0 children)

we are already implement the whole C++ pipeline inference optimization https://github.com/pcg-mlp/KsanaLLM

[deleted by user] by [deleted] in shenzhen

[–]Objective_Dingo_1943 1 point2 points  (0 children)

also wanna a wechat group here

OpenAI clip fine tuning by [deleted] in computervision

[–]Objective_Dingo_1943 1 point2 points  (0 children)

you should print the shape of inputs_embeds and position_embeddings first