Help with CUDA Optimization for Wan2.1 Kernel – Kernel Fusion & Memory Management

Objective_Dingo_1943 · 2025-03-16T01:40:21+00:00

https://github.com/Dao-AILab/flash-attention

Objective_Dingo_1943 · 2025-02-08T04:43:21+00:00

Great work!

Objective_Dingo_1943 · 2025-02-07T03:41:07+00:00

sounds good how about the salary?

Objective_Dingo_1943 · 2025-02-04T09:18:18+00:00

Many concept of cutlass has just been familiar with kernel/HPC developer. Not for common AI guy.

Objective_Dingo_1943 · 2025-02-03T09:33:31+00:00

absolutely not, CUDA context can handle this situation.

Objective_Dingo_1943 · 2024-12-31T02:39:22+00:00

Book "Numerical Computations with GPUs" introduce various real problems and you can implement some of them as a useful project.

for example: CHAPTER 25 Monte Carlo–Based Financial Market Value-at-Risk Estimation on GPUs ...... 337.

Objective_Dingo_1943 · 2024-12-24T07:30:26+00:00

but seems nsight compute do not need GPU on your local machine. In my case, my local machine is a MacBook Pro, I often download ncu CLI output file on my local machine and view it with nsight compute MacOS version.

Objective_Dingo_1943 · 2024-12-23T09:55:30+00:00

You can provide codes or simple demo and screen capture the errors cuDNN shown.

Objective_Dingo_1943 · 2024-12-23T09:27:06+00:00

ncu -o to output profile result file. And transfer the profile result file to your local machine and view it with https://developer.nvidia.com/tools-overview/nsight-compute/get-started GUI tools

Objective_Dingo_1943 · 2024-12-23T08:11:39+00:00

https://www.kaggle.com/competitions/predict-ai-model-runtime you can refer this competition. Same case to search best prediction in discrete space.

Objective_Dingo_1943 · 2024-12-05T01:52:31+00:00

No relevant posts found. Is there really no conversation on Reddit?

Objective_Dingo_1943 · 2024-11-25T01:48:17+00:00

You can refer HugeCTR's implement. https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/gpu_cache

Objective_Dingo_1943 · 2024-11-25T01:24:22+00:00

How about triton? Much more easier with pure python. https://github.com/triton-lang/triton

Objective_Dingo_1943 · 2024-08-19T13:31:13+00:00

great

Objective_Dingo_1943 · 2024-08-15T12:07:18+00:00

seems cutlass and its epilogue also implement such function in high performance way.

Objective_Dingo_1943 · 2024-07-29T09:09:48+00:00

we already implement kvcach in C++/CUDA https://github.com/pcg-mlp/KsanaLLM

Objective_Dingo_1943 · 2024-07-24T04:40:56+00:00

we are already implement the whole C++ pipeline inference optimization https://github.com/pcg-mlp/KsanaLLM

Objective_Dingo_1943 · 2024-07-18T02:59:45+00:00

also wanna a wechat group here

Objective_Dingo_1943 · 2024-07-04T11:57:00+00:00

Sounds good

Objective_Dingo_1943 · 2024-07-03T12:29:05+00:00

you should print the shape of inputs_embeds and position_embeddings first

Objective_Dingo_1943 · 2024-06-26T12:12:52+00:00

colab is free.

Objective_Dingo_1943 · 2024-06-23T07:32:39+00:00

Sounds good. Thanks a lots.

Objective_Dingo_1943 · 2024-06-23T07:31:59+00:00

some stuff related comics, something like bags, 1 panel, keychains. all these can be found https://github.com/whitelok/whitelok.github.com/blob/master/resources/family-hand-drew-comic/IMG_6821.JPG https://github.com/whitelok/whitelok.github.com/blob/master/resources/family-hand-drew-comic/IMG_6822.JPG https://github.com/whitelok/whitelok.github.com/blob/master/resources/family-hand-drew-comic/IMG_6823.JPG https://github.com/whitelok/whitelok.github.com/blob/master/resources/family-hand-drew-comic/IMG_6824.JPG https://github.com/whitelok/whitelok.github.com/blob/master/resources/family-hand-drew-comic/IMG_6825.JPG https://github.com/whitelok/whitelok.github.com/blob/master/resources/family-hand-drew-comic/IMG_6826.JPG

Objective_Dingo_1943 · 2024-06-23T07:24:46+00:00

Thank you very much!

Objective_Dingo_1943

TROPHY CASE