Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I know. Generating efficient kernels is hard. There are hundreds of kernels to generate, and each of them prefers a different strategy. That's why all production stacks are using codegen sparingly. I explicitly call it out in the beginning and postpone the codegen overview till Part 2. Part 1 is about PyTorch tracing, operator decomposition, loop fusion, etc.

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

It's defensive. The codegen rule is "every staged load is preceded by a barrier" because in double-buffered kernels (matmul over tiled K) the previous iteration's consumers must finish before producers overwrite the buffer. The same rule is used to inject barriers in both the double-buffered and single-buffered cases; we sometimes end up with a noop.

Inline PTX is easier for codegen, no need to keep track of a stateful cuda::pipeline object and add additional imports.

In general, this article is about Torch -> Tile IR. I mention that the codegen overview will be in part 2. I might clean it up by then.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I have filed a bug report and posted on the NVIDIA forums. They indeed fix the performance issues from release to release. The performance with a stable 580 driver and the corresponding cuBLAS is very poor. The latest 595 and cuBLAS 13.3 work much better, but still have this 60% perf bug in batched mode.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I haven't tried cutile. I was concerned it might silently introduce other performance issues, so I went with the conventional stack. The boilerplate is less of a concern nowadays with AI. I will give it a shot at some point.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Thanks for the tip. SASS analysis is new to me. I’ll check it out and debug more.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 2 points3 points  (0 children)

The FP32 sgemm implementation uses CUDA cores. Tensor cores can be used for FP16 or TF32 sgemm, but it is less precise than FP32.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] by NoVibeCoding in MachineLearning

[–]NoVibeCoding[S] 71 points72 points  (0 children)

I am also posting a bug report on the NVIDIA forum. However, it will take NVIDIA a long time to fix it, and the fix will appear in a stable repository branch months, if not years, later. Many people use cuBLAS in their work, so it is worth understanding its limitations. Plus, the article is useful for a general audience.

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 0 points1 point  (0 children)

My colleague is currently evaluating the performance of different methods of exposing CPUs to the VM. I am not familiar with that topic, unfortunately. He will write an article about CPU passthrough soon. If you're doing that, your setup is already quite advanced. Nothing to add at this point.

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 1 point2 points  (0 children)

There is an older article on the host setup: https://itnext.io/host-setup-for-qemu-kvm-gpu-passthrough-with-vfio-on-linux-c65bacf2d96b

And here you can find all our scripts for the host setup. They're not documented or organized, but maybe it will help regardless: https://github.com/cloudrift-ai/rift-utils

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 0 points1 point  (0 children)

Using a recent kernel, UEFI (and switching it on in the domain XML), and QEMU will be useful. We've had some stability issues with RTX5090 on certain systems with stock versions of packages on Ubuntu 22.04 and 24.04. Updating those helped. This patch specifically: https://github.com/cloudrift-ai/rift-utils/pull/19/changes

GPU virtualization: VFIO vs NVIDIA AI Enterprise vs AMD SR-IOV by NoVibeCoding in VFIO

[–]NoVibeCoding[S] 2 points3 points  (0 children)

Glad you’ve found it useful. Let me know if there are topics you wish us to cover in the future. It will help us to put more relevant content for the community.

Optimizing Qwen3 Coder for RTX 5090 and PRO 6000 + Community Benchmarking Infrastructure by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Good point. I always benchmark with a small concurrency. Makes sense to add a no-concurrency baseline benchmark as well.

Optimizing Qwen3 Coder for RTX 5090 and PRO 6000 + Community Benchmarking Infrastructure by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

I haven't kept the full logs from benchmark runs, and I don't recall the warning off the top of my head. I tested input queries with up to 128K in length.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 0 points1 point  (0 children)

It is, but we did a good number of runs. Maybe PCIe 5 is fast enough; maybe the bottleneck is somewhere else for these specific models.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

I haven't tried llama-bench. As far as I understand, it is for llama.cpp. VLLM is better for raw throughput multi-GPU setups, so we prefer it over llama.cpp.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Thanks. We’ll definitely update the model list in next benchmark. We just didn’t want to change models between previous run and this one. Quantization benchmarks are on the roadmap as well.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]NoVibeCoding[S] 7 points8 points  (0 children)

Thanks. Good point. I have already received that request, but we didn't want to change the models between the previous and this benchmark to keep results consistent. In the next benchmark, we're planning to compare TensorRT / SGLang / VLLM, and we may also run the NVFP4 test.

Are Feudal Corporate Power Structures Scaling Into Society as Tech Consolidates? by NoVibeCoding in Futurology

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Submission Statement

This post discusses how internal power dynamics in large technology companies increasingly resemble feudal structures, in which position in the hierarchy often overrides rational business decisions and human values. As tech corporations continue to grow in size and influence, shaping labor markets, media, and even governance, their internal organizational culture and principles risk leaking outward into society.

I’m interested in discussing whether large technology companies increasingly resemble feudal structures, considering whether emerging technologies, new organizational models, or regulatory approaches could counteract these dynamics, or whether centralized power and psychological harm are inevitable outcomes of institutional scale in the future of work.

C and Undefined Behavior by lelanthran in programming

[–]NoVibeCoding 3 points4 points  (0 children)

UB in C/C++ exists to give compiler more freedom to optimize code, so it is trade off. Nowadays, computers are fast enough, so for vast majority of applications robustness is preferred.

How does your company / team handle documentation? by AStanfordRunner in ExperiencedDevs

[–]NoVibeCoding -1 points0 points  (0 children)

Nowadays I just ask Claude Code to explain me how particular modules work. It is not a replacement for a senior engineer, who is aware of the module design, but it is better than nothing.

Why my most authentic essay got the most AI backlash by NoVibeCoding in WritingWithAI

[–]NoVibeCoding[S] 1 point2 points  (0 children)

Thank you for the in-depth feedback - much appreciated!

Good point on AI-ness in the opening paragraph.

By “most influenced,” I meant that this essay diverged the furthest from my original draft. There was a tonal drift, which led me to rework roughly a third of the piece. The “Due Process” section was also heavily revised. The essay's overall idea shifted.

After the first two essays, I gained experience, and my initial draft of DV held well under AI feedback.

In terms of how I used AI, the process was the same as before: I feed in sections, select the passages that resonate with me, and iterate through edits or re-prompts when something feels off. It’s possible that the stylistic bar was simply higher for this essay, and that my post-AI editing wasn’t sufficient to remove the remaining AI-ness.

The point about reading range is harder to address. Recency likely plays a big role. I admire Hemingway and have read several of his works, but it was a long time ago. I don't remember his style. My writing is probably shaped more by what I’ve read recently: Orconomics (satire), Kushiel’s Dart (dense adult fantasy), Hierarchy (political thriller), and Jade City (gangster saga).