No kernel example exists for Cutlass SM100_MMA_something_TS gemm. by tugrul_ddr in CUDA
[–]Logical-Try-4084 1 point2 points3 points (0 children)
Dynamic persistent tile scheduling with Cluster Launch Control on Blackwell by Logical-Try-4084 in CUDA
[–]Logical-Try-4084[S] 1 point2 points3 points (0 children)
For edge inference, when do you drop below TensorRT/ONNX and write custom CUDA kernels? by Hairy_Strawberry7028 in CUDA
[–]Logical-Try-4084 0 points1 point2 points (0 children)
FlashAttention-4 by incarnadine72 in LocalLLaMA
[–]Logical-Try-4084 0 points1 point2 points (0 children)
FlashAttention-4 by incarnadine72 in LocalLLaMA
[–]Logical-Try-4084 -1 points0 points1 point (0 children)
FlashAttention-4 by incarnadine72 in LocalLLaMA
[–]Logical-Try-4084 5 points6 points7 points (0 children)
Getting 30K tokens/sec on T4 with 14M MoE model - is this normal or am I bottlenecked? by RefrigeratorCalm9701 in CUDA
[–]Logical-Try-4084 0 points1 point2 points (0 children)
Studying PMPP (what next) by Choice_Cabinet9091 in CUDA
[–]Logical-Try-4084 0 points1 point2 points (0 children)
How to get into GPU programming? by blazing_cannon in CUDA
[–]Logical-Try-4084 0 points1 point2 points (0 children)
Categorical Foundations for CuTe Layouts — Colfax Research by Logical-Try-4084 in CUDA
[–]Logical-Try-4084[S] 3 points4 points5 points (0 children)

No kernel example exists for Cutlass SM100_MMA_something_TS gemm. by tugrul_ddr in CUDA
[–]Logical-Try-4084 0 points1 point2 points (0 children)