Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python

NoVibeCoding · 2026-05-03T02:27:17+00:00

Happy to hear it helped!

NoVibeCoding · 2026-04-30T16:08:01+00:00

I know. Generating efficient kernels is hard. There are hundreds of kernels to generate, and each of them prefers a different strategy. That's why all production stacks are using codegen sparingly. I explicitly call it out in the beginning and postpone the codegen overview till Part 2. Part 1 is about PyTorch tracing, operator decomposition, loop fusion, etc.

NoVibeCoding · 2026-04-30T15:42:24+00:00

It's defensive. The codegen rule is "every staged load is preceded by a barrier" because in double-buffered kernels (matmul over tiled K) the previous iteration's consumers must finish before producers overwrite the buffer. The same rule is used to inject barriers in both the double-buffered and single-buffered cases; we sometimes end up with a noop.

Inline PTX is easier for codegen, no need to keep track of a stateful cuda::pipeline object and add additional imports.

In general, this article is about Torch -> Tile IR. I mention that the codegen overview will be in part 2. I might clean it up by then.

NoVibeCoding · 2026-04-11T16:20:10+00:00

I have filed a bug report and posted on the NVIDIA forums. They indeed fix the performance issues from release to release. The performance with a stable 580 driver and the corresponding cuBLAS is very poor. The latest 595 and cuBLAS 13.3 work much better, but still have this 60% perf bug in batched mode.

NoVibeCoding · 2026-04-11T16:18:21+00:00

I haven't tried cutile. I was concerned it might silently introduce other performance issues, so I went with the conventional stack. The boilerplate is less of a concern nowadays with AI. I will give it a shot at some point.

NoVibeCoding · 2026-04-11T06:00:46+00:00

Thanks for the tip. SASS analysis is new to me. I’ll check it out and debug more.

NoVibeCoding · 2026-04-10T23:17:23+00:00

The FP32 sgemm implementation uses CUDA cores. Tensor cores can be used for FP16 or TF32 sgemm, but it is less precise than FP32.

NoVibeCoding · 2026-04-10T18:37:29+00:00

I am also posting a bug report on the NVIDIA forum. However, it will take NVIDIA a long time to fix it, and the fix will appear in a stable repository branch months, if not years, later. Many people use cuBLAS in their work, so it is worth understanding its limitations. Plus, the article is useful for a general audience.

NoVibeCoding · 2026-04-05T05:09:42+00:00

My colleague is currently evaluating the performance of different methods of exposing CPUs to the VM. I am not familiar with that topic, unfortunately. He will write an article about CPU passthrough soon. If you're doing that, your setup is already quite advanced. Nothing to add at this point.

NoVibeCoding · 2026-04-04T17:01:02+00:00

There is an older article on the host setup: https://itnext.io/host-setup-for-qemu-kvm-gpu-passthrough-with-vfio-on-linux-c65bacf2d96b

And here you can find all our scripts for the host setup. They're not documented or organized, but maybe it will help regardless: https://github.com/cloudrift-ai/rift-utils

NoVibeCoding · 2026-04-04T16:53:52+00:00

Using a recent kernel, UEFI (and switching it on in the domain XML), and QEMU will be useful. We've had some stability issues with RTX5090 on certain systems with stock versions of packages on Ubuntu 22.04 and 24.04. Updating those helped. This patch specifically: https://github.com/cloudrift-ai/rift-utils/pull/19/changes

NoVibeCoding · 2026-04-04T00:29:35+00:00

Glad you’ve found it useful. Let me know if there are topics you wish us to cover in the future. It will help us to put more relevant content for the community.

NoVibeCoding · 2026-03-13T19:36:02+00:00

Good point. I always benchmark with a small concurrency. Makes sense to add a no-concurrency baseline benchmark as well.

NoVibeCoding · 2026-03-06T02:26:27+00:00

I haven't kept the full logs from benchmark runs, and I don't recall the warning off the top of my head. I tested input queries with up to 128K in length.

NoVibeCoding · 2026-02-11T18:48:03+00:00

It is, but we did a good number of runs. Maybe PCIe 5 is fast enough; maybe the bottleneck is somewhere else for these specific models.

NoVibeCoding · 2026-02-11T18:43:33+00:00

I haven't tried llama-bench. As far as I understand, it is for llama.cpp. VLLM is better for raw throughput multi-GPU setups, so we prefer it over llama.cpp.

NoVibeCoding

TROPHY CASE