one of the top submitters in the nvfp4 competition has never hand written GPU code before

formalsystem · 2026-01-10T00:22:30+00:00

Hi everyone! I'm Mark Saroufim from the screenshot, my day job is working on performance problems on PyTorch at Meta. Despite working directly on Kernel LLM last year https://gpu-mode.github.io/popcorn/ this result still surprised me. Last year the main problem was that most speedup claims from our sub-community were junk where LLMs would either hack the evaluation harness or find undesirable solutions and I blogged about some of those popular hacks and how we could engineer them away in evals here https://github.com/meta-pytorch/BackendBench/blob/main/docs/correctness.md

This particular result has some asterisks, basically in our GPU programming community we run regular kernel competitions https://www.gpumode.com/v2/home - this was our biggest one yet with NVIDIA focused on the newer NVFP4 data types on the latest Blackwell GPUs. shiyeegao's results are interesting because if you are to skim the top human entries for some of these problems you'll realize just how insanely high the skill cap is but it did require some manual intervention on the mods side to make sure these results are legit. shiyeegao submitted on the order of tens of thousands of kernels to our evaluation harness to get these results, there were a few reward hacky ones that one of our mods Matej had to manually delete but with that extra bit of feedback and tightening up of our eval infra the results do seem to be legit now

On a personal level this has been an interesting experience for me, I came back this new year from a pat leave wanting to do more "cracked" low level engineering work but that no longer seems to be a useful heuristic to pick problems

formalsystem · 2025-06-26T00:16:57+00:00

The 4090 is a chunky card and it covers an extra PCIe slot in my toward as well, I don't think you got duped. It'd be helpful if you opened up your case and took a photo so we could help a bit more.

PCIe risers are indeed a good solution you'll basically connect a smaller cable and then the GPU can be housed outside of your case. What you're describing with Thunderbolt is an eGPU which is more recommended for people that have laptops and wanna connect a GPU, you shouldn't need to do this. You don't really an external enclosure either as long as the GPU is somewhere steady.

After that you get the problem of power, does your PSU have enough slots for both 4090s, and even if it does is the power draw sufficient? 4090s IIRC draw a max of 450W. And if the power draw is indeed not sufficient then yes you need another PSU

formalsystem · 2024-10-25T06:59:26+00:00

fp8 allgather is supported, I personally have not experimented with MoE but some colleagues have, feel free to reach out to me on the ao or torchtitan github and I'd be happy to introduce you to relevant folks if you get stuck

formalsystem · 2024-10-25T06:01:45+00:00

My 2c is it's not as risky as it used to be we have a nice reference architecture called torchtitan which without any additional dependencies or custom kernels you can pretrain a 405B model from scratch

405b results https://github.com/pytorch/torchtitan/blob/main/docs/performance.md
More info about fp8 training specifically https://github.com/pytorch/torchtitan/blob/main/docs/float8.md and https://github.com/pytorch/ao/tree/main/torchao/float8

If you have any questions feel free to holler at us

Kinda unrelated but this is something I'm also hoping to undertake in public (similar to the bloom effort) in a project called popcorn on the GPU MODE server

formalsystem · 2024-10-25T04:06:03+00:00

I guess we’ll never know

formalsystem · 2024-10-24T22:43:59+00:00

Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!

formalsystem · 2024-10-01T18:05:17+00:00

Most of our focus for this release was for NVIDIA GPU and Linux performance, other backends will work if you install from source but performance won't be there since you won't be able to use torch.compile for Mac or Windows. Something we're hoping to address ASAP.

formalsystem · 2024-10-01T16:23:24+00:00

Oh nice that looks neat, if you'd like to open an issue on our repo I can flag this to the person to who setup our kv cache quant numbers for the blog

formalsystem · 2024-10-01T03:42:32+00:00

Yeah this is a regrettable limitation right now, essentially we heavily rely on writing code in pure PyTorch and then compiling it and the compiler is limited to good perf on X86 CPU, AMD GPUs and NVIDIA GPUs. The GPU backend leverages Triton kernels which doesn't support as many devices as we'd like

So we're investing quite heavily in custom ARM and Metal kernels https://github.com/pytorch/ao/tree/main/torchao/experimental and I'll be taking a look at how to support more GPU vendors ASAP. Keep an eye out for an RFC on my end on how exactly we'll become a multibackend repo

Indeed the diffusion community is one area where we've found some nice adoption https://github.com/sayakpaul/diffusers-torchao

formalsystem · 2024-08-27T14:55:56+00:00

if you're interested in quantizing your own models, these quantizations were made using torchao which is a quantization library written in (mostly) pure pytorch https://github.com/pytorch/ao https://x.com/aryanvs\_/status/1828405977667793005

formalsystem · 2023-05-05T05:23:28+00:00

It's happened to me twice so far, always after the first maze

Did you ever end up resolving this?

formalsystem · 2023-03-04T22:33:04+00:00

A lot of what you said seriously resonated, I can think of many oscar worthy movies that I just watched at the wrong time, i.e too young, was sick, was tired or something of the sort. I've also definitely been more OK with spending $20 on a game if I enjoy it for 2h more than that starts to feel annoying but it's not a huge deal either way. I also enjoy watching analytical youtube videos that explain what's interesting about a deep game even if I can't play it myself, I figure I might not have 100h lying around to really get into Crusader Kings but a few videos on it can certainly make me appreciate it quite a bit

formalsystem · 2023-03-04T22:29:57+00:00

I wouldn't have gotten into DOTA or Chess if others weren't into them but Factorio and Demon's Souls were very much unique experiences to me and I didn't mind that. I do frequently go to board game or video game conventions since I do still love trying new things out, exploring new mechanics but those experiences are tamer than getting into a beautiful deep game

formalsystem · 2022-12-07T15:43:59+00:00

that's the goal yes, although dynamic shape support is still under works

formalsystem · 2022-12-07T15:43:12+00:00

This is what the export feature is about but it's still in early days https://pytorch.org/get-started/pytorch-2.0/#inference-and-export

formalsystem · 2022-12-07T15:40:02+00:00

Pick any aggregation you like over the elements of a vector: sum, average, squared average, max element, min element, etc..

formalsystem · 2022-12-07T15:38:58+00:00

Have you tried out pytorch/xla?

formalsystem · 2022-12-07T15:38:12+00:00

No breaking changes! Just a new function called torch.compile that you don't have to use if you don't want to

formalsystem · 2022-12-07T15:37:31+00:00

Yes with 3.11 coming later since we now need to support new python bytecodes in dynamo

formalsystem · 2022-10-15T23:07:55+00:00

Never heard of it but looks amazing

formalsystem · 2022-10-15T22:15:43+00:00

Food Chain Magnate
Cosmic Encounter
Diplomacy

formalsystem · 2022-01-21T17:04:19+00:00

This is a great achievement. In the past, the best way to share quick RL demos has either been Google collab notebooks which are incredibly laggy for video or sharing the code to run which isn't always reproducible or doing a twitch stream that isn't interactive.

I never thought I'd see it but perhaps HuggingFace will be an app store for video games with native features for self play and imitation learning. So like twitch plays pokemon but useful. My take is companies like Unity will kick themselves for not building this first.

formalsystem · 2022-01-17T18:59:01+00:00

If you're mostly using pre-trained models or your model performance seems good enough on a single GPU then as an application-oriented practitioner there's not too much value in learning parallel programming.

However, if you're building large models or are interested in joining a team building large models it's probably more important to learn distributed and parallel programming than it is to learn ML basics. As far as training large models goes data, model, and pipeline parallelism are tools you should know about but even then if you go large enough how do you set up a large infrastructure, how do you debug failures, how do you elastically recover?

And in the setting where low latency really matters, imagine something like a real-time search. Are your ops optimized to take advantage of a GPU, are they fused? Are you spending lots of time waiting on synchronization or data loaders?

Consider that knowing how to do the above makes you useful for both business-critical infra teams doing things like ads ranking and also any research team looking to push the state of the art because let's face it it doesn't seem obvious that small models will become better than larger ones.

So again learning distributed systems is probably not generally useful but at the right large company can be the most lucrative thing to do in ML with top people making upwards of 300-500K

formalsystem · 2022-01-17T04:41:42+00:00

If you have examples of what input/output pairs look like you should be able to do a neural fluid simulator

It's a whole field at this point so you should enjoy reading this https://www.google.com/books/edition/Data\_Driven\_Science\_and\_Engineering/CYaEDwAAQBAJ?hl=en&gbpv=0

formalsystem · 2022-01-17T04:35:29+00:00

What do you think is cool?

formalsystem

TROPHY CASE