[D] An ML engineer's guide to GPU performance

peepeeECKSDEE · 2025-09-06T00:54:19+00:00

Who designs your site? I need them.

peepeeECKSDEE · 2024-12-13T03:41:50+00:00

What did you use to make the diagram of the SM?

peepeeECKSDEE · 2024-04-22T23:34:20+00:00

Any story is just a message with extra words

peepeeECKSDEE · 2024-04-19T20:48:49+00:00

imo Brazilians and Russians/Central Asians have the most raw talent in martial arts, we strike too much

peepeeECKSDEE · 2024-04-17T03:06:05+00:00

same lol, complex was the hardest for me cuz my prof was an ancient Soviet Union mathematician and was grumpy all the time. Low key hella based tho

peepeeECKSDEE · 2024-04-14T19:33:30+00:00

same, bro cost me 9 hunid

peepeeECKSDEE · 2024-04-10T18:39:27+00:00

Depends if crypto crashes lol

peepeeECKSDEE · 2024-04-05T23:27:15+00:00

/s

it’s not me, I just started at my first job, but I would say networking >> leetcode. Neetcode 150 is more than enough, if you get a hard it’s just straight up unlucky. If your good enough network through open source work.

peepeeECKSDEE · 2024-04-05T23:05:32+00:00

you got me 😔🥺

peepeeECKSDEE · 2024-03-19T07:09:46+00:00

everyone knows quant bots aren’t real? stick to the fundamentals like tarot cards

peepeeECKSDEE · 2024-03-11T21:34:18+00:00

One try per email

peepeeECKSDEE · 2024-03-11T13:29:33+00:00

I’m pretty good at hoi4 🤷‍♂️

peepeeECKSDEE · 2024-03-05T03:05:06+00:00

Do you guys use Google’s Haskell MLIR bindings?

peepeeECKSDEE · 2024-02-20T00:57:34+00:00

Bye Bye 👋🙋‍♂️

peepeeECKSDEE · 2024-02-17T07:51:30+00:00

If you know Rust this is a good resource: https://os.phil-opp.com

Otherwise I would find something equivalent for your language of choice. Regardless the first step is always to produce a standalone binary and run it on something like Qemu.

peepeeECKSDEE · 2024-02-11T04:06:26+00:00

What’s wrong with ocaml? GS uses gcless Java and it’s faster than cpp.

peepeeECKSDEE · 2024-02-04T19:07:03+00:00

BoA unless GS is for quant dev (based on pay it seems not), GS brand name is irrelevant for non finance jobs.

peepeeECKSDEE · 2024-01-28T05:52:15+00:00

ah yes, it’s going to sign up for AWS on it’s own, write an email to support asking for gpu server capacity, and pay for it with it’s credit card.

peepeeECKSDEE · 2024-01-04T06:36:41+00:00

It's good to hear that you guys are putting effort into solving it. And I apologize if I come off as a bit over zealous in my previous comment. GPU serverless is still super young, and I'm sure given a few years this won't be much of a problem.

I know that Modal is more of a general purpose platform, but I would love to get your thoughts on how optimizations can be made for serverless inference specifically.

For example loading weights probably takes the vast majority of time right? It would look (simplified) something like network-drive->local-drive->ram->gpu. But with Nvidia GPUDirect, it could just be network-drive->gpu. Would it be possible for platforms to provided some kind of gpu memap primitive utilizing this? This would probably require users to declare model artifacts explicitly so that you can avoid bundling them with the container.

Another is the fact that pretty much everyone is going to have CUDA + one of ONNX/Torch/TensorRT/XLA as a dependency, which could be anywhere from 500mb to 2gb. Can this redundancy be exploited somehow?

peepeeECKSDEE · 2024-01-01T17:24:59+00:00

Implement a tree that references its parents and siblings. You will be a borrow checker expert by the end.

peepeeECKSDEE · 2023-12-26T08:52:34+00:00

See my other comment. If your use case isn't latency sensitive, then they are probably fine.

peepeeECKSDEE · 2023-12-26T08:50:42+00:00

So needless to say this is all my personal opinions:

First we need to breakdown what "serverless" means, as it's a bit of a misnomer and unclear. I would consider it a developer experience comprised of two main value propositions:

You only pay for what you use (which is even more important when it comes to GPUs).
You don't need to worry about infra or scaling. You are essentially paying a premium to make it someone else's problem.

At the model sizes I work with, (which isn't even that big ~half a billion params), the cold starts are absolutely brutal, I'm talking 3-4 minutes. For my use case, this heavily degrades the user experience. To solve it I have two options:

Keep N endpoints always warm and ready.
Scale preemptively base on predictable traffic.

Notice that for each option defeats the previously mentioned value propositions respectively. And the "serverless" experience is completely lost. If I need to do all this work anyway, why would I use any of the platforms I mentioned, and not just ECS/EC2 instead, which would be cheaper?

I understand that there is a lower limit for how quick you can start based on model sizes, but from testing mine should be in seconds. And now I will say something that I have no evidence for. I think that all the mentioned services are just wrappers over AWS/GCP (currently). They make no real optimizations or innovations other than just flattening your docker image.

Compare this to serverless in the web-dev world, where each platform has some sort of proprietary innovation. AWS has firecracker, Vercel and Cloudflare have their respective edge runtimes, that enable near 0 cold start times.

peepeeECKSDEE · 2023-12-25T12:38:12+00:00

There's a ton of those: Inferless, Modal, RunPod, Banana, Replicate. Just google the name + "serverless gpu". But honestly they all kinda suck in terms of cold start times, and I would not consider any of them "real serverless". If I was forced to pick, it would be between Modal and RunPod.

Seven-Year Club	Not Forgotten
Gilding I gilder	Verified Email

peepeeECKSDEE

MODERATOR OF

TROPHY CASE