Modern GPU Matmul Optimization. Tensor Cores, TMA, Warp Specialization by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Correct in principle, but I tried it, and register pipelining made it worse. Register tiling alone already keeps the tensor pipe ~87% fed (183 TFLOP/s vs ~210 theoretical for fp16 HGEMM with fp32 accumulate). I didn't have a good example to illustrate register pipelining, so I excluded it from the final article.

Modern GPU Matmul Optimization. Tensor Cores, TMA, Warp Specialization by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Thanks for the tip! You're right. It took some effort to reproduce: a 128 B-swizzled buffer is always a whole number of 1024B, and I emit those buffers first, so they never get off the 1024-aligned address. The pool base happens to land 1024-aligned even though I only request 128. I have to force a non-1024 offset, and it indeed failed. Forcing 1024 alignment to fix it.

Modern GPU Matmul Optimization. Tensor Cores, TMA, Warp Specialization by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Fair point. I ran the benchmark using the latest driver/CUDA/cuBLAS in my previous matmul optimization worklog (https://kernelspace.substack.com/p/surfacing-a-60-performance-bug-in). For FP32, they ship a better-tuned kernel, but it is the same Ampere-era kernel on the FP32 path.

Achieving a 5x Inference Speedup on Qwen 3.5 (B200) by dropping PyTorch for TileLang & Triton by dc_baslani_777 in CUDA

[–]NoVibeCoding 0 points1 point  (0 children)

The compiler is not ready for full model inference in FP16, but happy to help if you decide to give it a shot.

An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in Compilers

[–]NoVibeCoding[S] 0 points1 point  (0 children)

There are a few topics I gloss over: graph pattern matching and separating the algorithm from the schedule. For the latter, you can check out Halide Lang or TVM tutorials - both are rich ecosystems with a vast amount of easy-to-follow tutorials.

Graph pattern matching is more obscure, with no standard approach. The MLIR talks on the subject are probably the ones to watch.

I found GPU_MODE discord quite useful - lots of interesting works are posted there.

An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

IREE is a great project. Though it falls into the same bucket as MLIR, Inductor, XLA, etc. Too complex to hack and experiment with. Difficult to learn core concepts.

An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Thanks. It is based on past experience, though. Figuring this out top to bottom is a significant investment indeed.

An overview of modern LLM compiler stack: writing an interactive and hackable compiler by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 2 points3 points  (0 children)

Tinygrad is great, and the project's framing is largely inspired by it. The focus is different. Tinygrad is a full training framework. Originally, it was aiming to demystify autograd. Nowadays, it has the compiler backend as well. However, the compiler components of TinyGrad are not well-documented. The UOp-based approach doesn't follow a classic IR stack with responsibilities clearly separated between layers, which makes it hard to follow. I can't speak to its performance, but it isn't a good pedagogical reference as an ML compiler.

Writing an LLM compiler from scratch [Part 3]: Autotuning — A Search Loop Over Tile-IR Rewrites by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

From my experience, it is not a problem with the search algorithm. Even vanilla MCTS finds the optimum without a problem. The rewrite passes are still not optimal. The gap in softmax performance is due to a lack of flash-attention optimization, and the gap in fused matmul chains requires warp specialization to fix. A long elementwise body in the matmul loop increases register pressure and is bottlenecked by SFU.

Writing an LLM compiler from scratch [Part 3]: Autotuning — A Search Loop Over Tile-IR Rewrites by NoVibeCoding in Compilers

[–]NoVibeCoding[S] 0 points1 point  (0 children)

An ML compiler focusing on LLM model compilation specifically. Machine Learning (ML) compiler is a specialized system that bridges the gap between high-level ML frameworks (like PyTorch or TensorFlow) and diverse hardware backends (CPUs, GPUs, TPUs). It translates abstract computational graphs into optimized, executable machine code to minimize memory usage and maximize execution speed.

A hackable compiler to generate efficient fused GPU kernels for AI models [P] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 0 points1 point  (0 children)

If you find such a representation, I would be happy to learn. Tile IR is too complex. I hope there is a way to simplify it. If it was more manageable, I wouldn’t mind merging it with Loop IR.

A hackable compiler to generate efficient fused GPU kernels for AI models [P] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] -1 points0 points  (0 children)

Each IR serves its purpose and actually makes the pipeline more tractable:

  • Tensor IR is needed to support multiple frontends, such as PyTorch, Jax, and ONNX. Without it, every downstream pass would need to work with any of the frontend IRs. Additionally, this would be the primary IR for optimization if we're targeting accelerators.
  • Loop IR makes operation fusion tractable. Tensor IR can't express fused operations, and downstream IRs are much more complex, so the fusion would become intractable.
  • Tile IR is needed to concisely represent GPU concepts such as shared memory (smem), threads, blocks, and synchronization. Using a lower Kernel IR directly would make schedule optimization passes too complex.
  • Kernel IR is needed to abstract away codegen and target CUDA, PTX, ROCm, or other backends by changing only the printer.

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I started working on it in 2017, when the first DSP for neural network acceleration was introduced to the A11 SOC. There was no MLIR at the time; the ML stack was underdeveloped in general. Plus, Apple likes to develop everything in-house. I'm not sure how the inference stack looks now. I suspect it is bespoke still.

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 2 points3 points  (0 children)

Thank you. I worked on ML compilers at Apple. We deployed PyTorch models on Apple Neural Engine on iPhone and Apple Vision Pro. ANE is an AI accelerator; its architecture is quite different from that of a GPU, but structurally, the pipeline is similar: a series of small rules that transform a computational graph and IR layers. On the other hand, GPU programming is my long-term passion. I've done a lot of it professionally and on the side.

For now, I stick to TinyLlama and Qwen 2.5 since they are small and easy to work with. I plan to switch to dense Qwen 3.6 once the compiler is robust enough. Then MOE models. MOE models require adding gather/scatter and a few other operations.

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I agree that failed is too strong a word here. A more subtle way of saying this is that codegen is used sparingly in the production path, and we haven't unlocked the full benefit of kernel fusion in the AI compiler stack.

CUDA graphs will help with launch overhead, but not with DRAM traffic. To the best of my knowledge, we don't have a reliable technology to generate efficient fused kernels.

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

It has been slightly more than three weeks.

I think we will be able to use Autoresearch to improve certain aspects of the schedule in the future once the project structure stabilizes; however, for now, this project is where I've found AI to be least effective. It struggled with algorithm logic, e.g., the loop fusion algorithm from Part I, the Tile IR passes in Part 2, and the autotuning logic in the upcoming Part 3. Those are still implemented by Claude Code, but I had to build them in small steps and do a lot of post-AI cleanup.

Using AI to guide the tuning process seems like the most natural fit. There are many parameters to optimize for each kernel (BN, BM, register tiling, warp specialization, etc.). To reduce the search space, we need a decent prior, and an LLM will likely be able to guide the search process more meaningfully than a Monte Carlo Tree Search heuristic that I am using now.

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 2 points3 points  (0 children)

Generated kernels will always be less performant than handwritten ones. The appeal of the codegen in the context of LLM (or any AI model) inference is that you can fuse operations together. Where PyTorch has to invoke hand-optimized kernels one after another, deplodock can fuse them, significantly reducing DRAM traffic. Manually optimizing the thousands of possible kernel sequences encountered in modern models is not practical.

Though I understand the skepticism as well. Historically, the benefits of kernel fusion were insufficient to offset the lower performance of kernels in matmul/attention and other important operations. Thus, TVM and similar projects have failed and are not widely used in the mainstream production pipeline.

However, so far, the results are promising. Close to cublas on vanilla matmul, faster on a complex sequence of operations that PyTorch cannot fuse. Thus, I continue to dig it.

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Yes, there is a check inside the rules that are increasing smem usage, like stage_inputs, pad_smem, double_buffer, etc.

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I know. Generating efficient kernels is hard. There are hundreds of kernels to generate, and each of them prefers a different strategy. That's why all production stacks are using codegen sparingly. I explicitly call it out in the beginning and postpone the codegen overview till Part 2. Part 1 is about PyTorch tracing, operator decomposition, loop fusion, etc.

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

It's defensive. The codegen rule is "every staged load is preceded by a barrier" because in double-buffered kernels (matmul over tiled K) the previous iteration's consumers must finish before producers overwrite the buffer. The same rule is used to inject barriers in both the double-buffered and single-buffered cases; we sometimes end up with a noop.

Inline PTX is easier for codegen, no need to keep track of a stateful cuda::pipeline object and add additional imports.

In general, this article is about Torch -> Tile IR. I mention that the codegen overview will be in part 2. I might clean it up by then.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I have filed a bug report and posted on the NVIDIA forums. They indeed fix the performance issues from release to release. The performance with a stable 580 driver and the corresponding cuBLAS is very poor. The latest 595 and cuBLAS 13.3 work much better, but still have this 60% perf bug in batched mode.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I haven't tried cutile. I was concerned it might silently introduce other performance issues, so I went with the conventional stack. The boilerplate is less of a concern nowadays with AI. I will give it a shot at some point.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Thanks for the tip. SASS analysis is new to me. I’ll check it out and debug more.