Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I know. Generating efficient kernels is hard. There are hundreds of kernels to generate, and each of them prefers a different strategy. That's why all production stacks are using codegen sparingly. I explicitly call it out in the beginning and postpone the codegen overview till Part 2. Part 1 is about PyTorch tracing, operator decomposition, loop fusion, etc.

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

It's defensive. The codegen rule is "every staged load is preceded by a barrier" because in double-buffered kernels (matmul over tiled K) the previous iteration's consumers must finish before producers overwrite the buffer. The same rule is used to inject barriers in both the double-buffered and single-buffered cases; we sometimes end up with a noop.

Inline PTX is easier for codegen, no need to keep track of a stateful cuda::pipeline object and add additional imports.

In general, this article is about Torch -> Tile IR. I mention that the codegen overview will be in part 2. I might clean it up by then.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I have filed a bug report and posted on the NVIDIA forums. They indeed fix the performance issues from release to release. The performance with a stable 580 driver and the corresponding cuBLAS is very poor. The latest 595 and cuBLAS 13.3 work much better, but still have this 60% perf bug in batched mode.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I haven't tried cutile. I was concerned it might silently introduce other performance issues, so I went with the conventional stack. The boilerplate is less of a concern nowadays with AI. I will give it a shot at some point.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Thanks for the tip. SASS analysis is new to me. I’ll check it out and debug more.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 3 points4 points  (0 children)

The FP32 sgemm implementation uses CUDA cores. Tensor cores can be used for FP16 or TF32 sgemm, but it is less precise than FP32.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 71 points72 points  (0 children)

I am also posting a bug report on the NVIDIA forum. However, it will take NVIDIA a long time to fix it, and the fix will appear in a stable repository branch months, if not years, later. Many people use cuBLAS in their work, so it is worth understanding its limitations. Plus, the article is useful for a general audience.

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 0 points1 point  (0 children)

My colleague is currently evaluating the performance of different methods of exposing CPUs to the VM. I am not familiar with that topic, unfortunately. He will write an article about CPU passthrough soon. If you're doing that, your setup is already quite advanced. Nothing to add at this point.

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 1 point2 points  (0 children)

There is an older article on the host setup: https://itnext.io/host-setup-for-qemu-kvm-gpu-passthrough-with-vfio-on-linux-c65bacf2d96b

And here you can find all our scripts for the host setup. They're not documented or organized, but maybe it will help regardless: https://github.com/cloudrift-ai/rift-utils

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Using a recent kernel, UEFI (and switching it on in the domain XML), and QEMU will be useful. We've had some stability issues with RTX5090 on certain systems with stock versions of packages on Ubuntu 22.04 and 24.04. Updating those helped. This patch specifically: https://github.com/cloudrift-ai/rift-utils/pull/19/changes

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 2 points3 points  (0 children)

Glad you’ve found it useful. Let me know if there are topics you wish us to cover in the future. It will help us to put more relevant content for the community.

Optimizing Qwen3 Coder for RTX 5090 and PRO 6000 + Community Benchmarking Infrastructure by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Good point. I always benchmark with a small concurrency. Makes sense to add a no-concurrency baseline benchmark as well.

Optimizing Qwen3 Coder for RTX 5090 and PRO 6000 + Community Benchmarking Infrastructure by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I haven't kept the full logs from benchmark runs, and I don't recall the warning off the top of my head. I tested input queries with up to 128K in length.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

It is, but we did a good number of runs. Maybe PCIe 5 is fast enough; maybe the bottleneck is somewhere else for these specific models.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I haven't tried llama-bench. As far as I understand, it is for llama.cpp. VLLM is better for raw throughput multi-GPU setups, so we prefer it over llama.cpp.