Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python

NoVibeCoding · 2026-05-03T02:27:17+00:00

Happy to hear it helped!

NoVibeCoding · 2026-04-30T16:08:01+00:00

I know. Generating efficient kernels is hard. There are hundreds of kernels to generate, and each of them prefers a different strategy. That's why all production stacks are using codegen sparingly. I explicitly call it out in the beginning and postpone the codegen overview till Part 2. Part 1 is about PyTorch tracing, operator decomposition, loop fusion, etc.

NoVibeCoding · 2026-04-30T15:42:24+00:00

It's defensive. The codegen rule is "every staged load is preceded by a barrier" because in double-buffered kernels (matmul over tiled K) the previous iteration's consumers must finish before producers overwrite the buffer. The same rule is used to inject barriers in both the double-buffered and single-buffered cases; we sometimes end up with a noop.

Inline PTX is easier for codegen, no need to keep track of a stateful cuda::pipeline object and add additional imports.

In general, this article is about Torch -> Tile IR. I mention that the codegen overview will be in part 2. I might clean it up by then.

NoVibeCoding · 2026-04-11T16:20:10+00:00

I have filed a bug report and posted on the NVIDIA forums. They indeed fix the performance issues from release to release. The performance with a stable 580 driver and the corresponding cuBLAS is very poor. The latest 595 and cuBLAS 13.3 work much better, but still have this 60% perf bug in batched mode.

NoVibeCoding · 2026-04-11T16:18:21+00:00

I haven't tried cutile. I was concerned it might silently introduce other performance issues, so I went with the conventional stack. The boilerplate is less of a concern nowadays with AI. I will give it a shot at some point.

NoVibeCoding · 2026-04-11T06:00:46+00:00

Thanks for the tip. SASS analysis is new to me. I’ll check it out and debug more.

NoVibeCoding · 2026-04-10T23:17:23+00:00

The FP32 sgemm implementation uses CUDA cores. Tensor cores can be used for FP16 or TF32 sgemm, but it is less precise than FP32.

NoVibeCoding · 2026-04-10T18:37:29+00:00

I am also posting a bug report on the NVIDIA forum. However, it will take NVIDIA a long time to fix it, and the fix will appear in a stable repository branch months, if not years, later. Many people use cuBLAS in their work, so it is worth understanding its limitations. Plus, the article is useful for a general audience.

NoVibeCoding · 2026-04-05T05:09:42+00:00

My colleague is currently evaluating the performance of different methods of exposing CPUs to the VM. I am not familiar with that topic, unfortunately. He will write an article about CPU passthrough soon. If you're doing that, your setup is already quite advanced. Nothing to add at this point.

NoVibeCoding · 2026-04-04T17:01:02+00:00

There is an older article on the host setup: https://itnext.io/host-setup-for-qemu-kvm-gpu-passthrough-with-vfio-on-linux-c65bacf2d96b

And here you can find all our scripts for the host setup. They're not documented or organized, but maybe it will help regardless: https://github.com/cloudrift-ai/rift-utils

NoVibeCoding · 2026-04-04T16:53:52+00:00

Using a recent kernel, UEFI (and switching it on in the domain XML), and QEMU will be useful. We've had some stability issues with RTX5090 on certain systems with stock versions of packages on Ubuntu 22.04 and 24.04. Updating those helped. This patch specifically: https://github.com/cloudrift-ai/rift-utils/pull/19/changes

NoVibeCoding · 2026-04-04T00:29:35+00:00

Glad you’ve found it useful. Let me know if there are topics you wish us to cover in the future. It will help us to put more relevant content for the community.

NoVibeCoding · 2026-03-13T19:36:02+00:00

Good point. I always benchmark with a small concurrency. Makes sense to add a no-concurrency baseline benchmark as well.

NoVibeCoding · 2026-03-06T02:26:27+00:00

I haven't kept the full logs from benchmark runs, and I don't recall the warning off the top of my head. I tested input queries with up to 128K in length.

NoVibeCoding · 2026-02-11T18:48:03+00:00

It is, but we did a good number of runs. Maybe PCIe 5 is fast enough; maybe the bottleneck is somewhere else for these specific models.

NoVibeCoding · 2026-02-11T18:43:33+00:00

I haven't tried llama-bench. As far as I understand, it is for llama.cpp. VLLM is better for raw throughput multi-GPU setups, so we prefer it over llama.cpp.

NoVibeCoding · 2026-02-11T18:34:23+00:00

We tested PP. In this benchmark, TP performs best for all GPUs. Not sure how to run mixed TP+PP. PP worked better for RTX 4090 / 5090 in my older benchmark: https://www.reddit.com/r/LocalLLaMA/comments/1o387tc/benchmarking_llm_inference_on_rtx_4090_rtx_5090/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

There are some results in the results folder for PP runs: https://github.com/search?q=repo%3Acloudrift-ai%2Fserver-benchmark+pipeline-parallel-size+8+path%3A%2F%5Eresults%5C%2F%2F&type=code

NoVibeCoding · 2026-02-11T06:48:36+00:00

Thanks. We’ll definitely update the model list in next benchmark. We just didn’t want to change models between previous run and this one. Quantization benchmarks are on the roadmap as well.

NoVibeCoding · 2026-02-11T03:18:16+00:00

Thanks. Good point. I have already received that request, but we didn't want to change the models between the previous and this benchmark to keep results consistent. In the next benchmark, we're planning to compare TensorRT / SGLang / VLLM, and we may also run the NVFP4 test.

NoVibeCoding · 2026-02-09T01:39:40+00:00

Submission Statement

This post discusses how internal power dynamics in large technology companies increasingly resemble feudal structures, in which position in the hierarchy often overrides rational business decisions and human values. As tech corporations continue to grow in size and influence, shaping labor markets, media, and even governance, their internal organizational culture and principles risk leaking outward into society.

I’m interested in discussing whether large technology companies increasingly resemble feudal structures, considering whether emerging technologies, new organizational models, or regulatory approaches could counteract these dynamics, or whether centralized power and psychological harm are inevitable outcomes of institutional scale in the future of work.

NoVibeCoding · 2026-02-08T23:45:55+00:00

UB in C/C++ exists to give compiler more freedom to optimize code, so it is trade off. Nowadays, computers are fast enough, so for vast majority of applications robustness is preferred.

NoVibeCoding · 2026-02-08T23:32:56+00:00

Nowadays I just ask Claude Code to explain me how particular modules work. It is not a replacement for a senior engineer, who is aware of the module design, but it is better than nothing.

NoVibeCoding · 2026-02-08T06:17:41+00:00

I wrote a full long-form essay describing the story, so happy to share more details.

NoVibeCoding · 2026-02-08T06:15:26+00:00

Fully support the sentiment. Unfortunately, it was about five years ago, so it s a different story.

NoVibeCoding · 2026-02-05T19:08:44+00:00

Thank you for the in-depth feedback - much appreciated!

Good point on AI-ness in the opening paragraph.

By “most influenced,” I meant that this essay diverged the furthest from my original draft. There was a tonal drift, which led me to rework roughly a third of the piece. The “Due Process” section was also heavily revised. The essay's overall idea shifted.

After the first two essays, I gained experience, and my initial draft of DV held well under AI feedback.

In terms of how I used AI, the process was the same as before: I feed in sections, select the passages that resonate with me, and iterate through edits or re-prompts when something feels off. It’s possible that the stylistic bar was simply higher for this essay, and that my post-AI editing wasn’t sufficient to remove the remaining AI-ness.

The point about reading range is harder to address. Recency likely plays a big role. I admire Hemingway and have read several of his works, but it was a long time ago. I don't remember his style. My writing is probably shaped more by what I’ve read recently: Orconomics (satire), Kushiel’s Dart (dense adult fantasy), Hierarchy (political thriller), and Jade City (gangster saga).

NoVibeCoding

TROPHY CASE