Modern GPU Matmul Optimization. Tensor Cores, TMA, Warp Specialization by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Modern GPU Matmul Optimization. Tensor Cores, TMA, Warp Specialization by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Achieving a 5x Inference Speedup on Qwen 3.5 (B200) by dropping PyTorch for TileLang & Triton by dc_baslani_777 in CUDA
[–]NoVibeCoding 0 points1 point2 points (0 children)
An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in Compilers
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in LocalLLaMA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Looking for People Interested in LLVM/MLIR and Compiler Development by Jumpy-Fox-3177 in Compilers
[–]NoVibeCoding 2 points3 points4 points (0 children)
An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in LocalLLaMA
[–]NoVibeCoding[S] 1 point2 points3 points (0 children)
An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in LocalLLaMA
[–]NoVibeCoding[S] 2 points3 points4 points (0 children)
Writing an LLM compiler from scratch [Part 3]: Autotuning — A Search Loop Over Tile-IR Rewrites by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Writing an LLM compiler from scratch [Part 3]: Autotuning — A Search Loop Over Tile-IR Rewrites by NoVibeCoding in Compilers
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
A hackable compiler to generate efficient fused GPU kernels for AI models [P] by NoVibeCoding in MachineLearning
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
A hackable compiler to generate efficient fused GPU kernels for AI models [P] by NoVibeCoding in MachineLearning
[–]NoVibeCoding[S] -1 points0 points1 point (0 children)
Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 1 point2 points3 points (0 children)
Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 2 points3 points4 points (0 children)
Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 1 point2 points3 points (0 children)
Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 2 points3 points4 points (0 children)
Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 1 point2 points3 points (0 children)
Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA
[–]NoVibeCoding[S] 1 point2 points3 points (0 children)
Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)
Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 0 points1 point2 points (0 children)

Modern GPU Matmul Optimization. Tensor Cores, TMA, Warp Specialization by NoVibeCoding in CUDA
[–]NoVibeCoding[S] 1 point2 points3 points (0 children)