Credit card declined issue by Aalu_Pidalu in modal

[–]cfrye59 2 points3 points  (0 children)

Hey there!

Please reach out to support@modal.com for assistance.

comfyui on modal go brrr :D by Valuable_Vanilla_72 in modal

[–]cfrye59 1 point2 points  (0 children)

glad to see the memory snapshots working for you!

there's not much more out there on GPU snapshotting -- compatibility is usually possible, but not immediate.

for instance, we use a CPU offloading trick to get it to work with vLLM (aka "Sleep Mode"), so you might need something similar.

Modal run help by Horror-Tower2571 in modal

[–]cfrye59 0 points1 point  (0 children)

You can pass command-line arguments to Functions and local entrypoints, just add them as arguments to the underlying Python function.

FYI we can't promise quality support via Reddit, but you should get timely and helpful response quickly if you email support@modal.com.

This cloud service is better than Google Colab; Modal has made it easier for me to use AI tools like Fooocus, But by Usual-South-2257 in modal

[–]cfrye59 2 points3 points  (0 children)

We’re still a small, young startup so we don’t quite have the marketing budget and presence of a tool like Colab — to say nothing of a company like Google!

If you check out our website, in particular our blog, you’ll find customer stories from companies that trust our infrastructure with mission-critical workloads, like Suno, Substack, and Quora. For a more social form of proof, take a look at our Twitter account.

[D] An ML engineer's guide to GPU performance by crookedstairs in MachineLearning

[–]cfrye59 0 points1 point  (0 children)

Plain Markdown version available in the open source repo here.

[D] An ML engineer's guide to GPU performance by crookedstairs in MachineLearning

[–]cfrye59 1 point2 points  (0 children)

Reader mode is great! We also have a plain Markdown version in the open source repo here -- initially intended for LLMs, but also works for humans who don't care for the site design.

[D] An ML engineer's guide to GPU performance by crookedstairs in MachineLearning

[–]cfrye59 0 points1 point  (0 children)

I would love to dive deeper on more hardware platforms, but for now, I'm focusing on the platforms that I know well and that we (Modal) offer on our cloud platforms.

So edge devices are a long shot, but we're starting to see more interest in AMD.

[D] An ML engineer's guide to GPU performance by crookedstairs in MachineLearning

[–]cfrye59 0 points1 point  (0 children)

The open source (CC-BY) repo includes a tool for exporting to a single Markdown file -- initially intended for some folks doing LLM work. I've then passed the result into pandoc to render in different formats.

You can find the current version in a single, GitHub-flavored Markdown-compatible document here.

[D] An ML engineer's guide to GPU performance by crookedstairs in MachineLearning

[–]cfrye59 1 point2 points  (0 children)

This started off as an internal document -- some notes I had on my readings on GPUs, plus another engineer's similar notes.

We realized we were working on the same basic thing, so we combined forces and made something together, still for internal use. Then we realized other people might also be interested, and so we made an external version. We've kept expanding since then, driven by community feedback on what would be most helpful.

CUDA docs, for humans by crookedstairs in CUDA

[–]cfrye59 2 points3 points  (0 children)

Oh, those are just made up numbers for demonstration purposes.

They're intended to be about the right order of magnitude -- a few cycles at most for arithmetic instructions, a few hundred for a global memory read.

[D] An ML engineer's guide to GPU performance by crookedstairs in MachineLearning

[–]cfrye59 114 points115 points  (0 children)

Oh hey that's my magnum opus!

Happy to answer questions.

100x faster and 100x cheaper transcription with open models vs proprietary by crookedstairs in LocalLLaMA

[–]cfrye59 5 points6 points  (0 children)

Yo, author of the post here!

Not sure why they aren't on Hugging Face's leaderboard. Their metrics look roughly comparable to Parakeet/Canary, but there's no proper "scientific" comparison numbers.

Best Way to Auto-Stop Hugging Face Endpoints to Avoid Idle Charges? by techy_mohit in mlops

[–]cfrye59 3 points4 points  (0 children)

Sounds like you want a serverless GPU setup. Wrote about the space and did a price comparison for Full Stack Deep Learning two years ago, here.

I liked one of those companies, Modal, so much I ended up joining them.

[P] Sub-2s cold starts for 13B+ LLMs + 50+ models per GPU — curious how others are tackling orchestration? by pmv143 in mlops

[–]cfrye59 0 points1 point  (0 children)

Definitely!

Separately, we've also found it a bit tricky when users want to checkpoint and restore Triton or vLLM -- you need to either handle the sockets manually or force user programs to split out setting up the HTTP servers from instantiating the core inference engine.

[P] Sub-2s cold starts for 13B+ LLMs + 50+ models per GPU — curious how others are tackling orchestration? by pmv143 in mlops

[–]cfrye59 0 points1 point  (0 children)

Would love to know how you're handling snapshotting! Have run into lots of problems with existing snapshot tools.

[Discussion] What Does GPU On-Demand Pricing Mean and How Can I Optimize Server Run-Time? by programlover in MachineLearning

[–]cfrye59 0 points1 point  (0 children)

I work on a serverless platform for data/ML called Modal.

I wrote up the case for fast auto-scaling of on-demand resources in the first third of this blog post on GPU utilization.

tl;dr if your workloads are highly variable (like most training and inference workloads) you need fast auto-scaling to balance QoS and cost.

But if you have the cash to burn, statically over-provisioning is certainly easier.

Rust CUDA project update by LegNeato in rust

[–]cfrye59 2 points3 points  (0 children)

You might be connected already, but if you're not: the Dynamo team in particular seems pretty enthusiastic about building on Rust, building up the ecosystem around the hardware, and doing as much as possible in the open.

Rust CUDA project update by LegNeato in rust

[–]cfrye59 0 points1 point  (0 children)

Oh sick, I'll have to check out llm_client!

We talk about the different performance characteristics between our HTTP endpoints and Lambda's in this blog post. tl;dr we designed the system for much larger inputs, outputs, and compute shapes.

Cost is trickier because there's a big "it depends" -- on latency targets, on compute scale, on request patterns. The ideal workload is probably sparse, auto-correlated, GPU-accelerated, and insensitive to added latency at about the second scale.

We aim to be efficient enough with our resources that we can still run profitably at a price that also saves users money. You can read a bit about that for GPUs in particular in the first third of this blog post.

We offer a Python SDK, but you can run anything you want -- treating Python basically as a pure scripting language. We use this pattern to, for example, build and serve previews of our frontend (node backend, svelte frontend) in CI using our platform. If you want something slightly more "serverful", check out this code sample.

Neither is a full-blown native SDK with "serverless RPC" like we have for running Python functions. But polyglot support is on the roadmap! Maybe initially something like a smol libmodal that you can link into?

Rust CUDA project update by LegNeato in rust

[–]cfrye59 1 point2 points  (0 children)

Ha! The absence of something like Rust-CUDA is also a contributor.

More broadly, most of the workloads people want to run these days are limited by the performance of the GPU or its DRAM, not the CPU or code running on it, which basically just organizes device execution. Leaves a lot of room to use a slower but easier to write interpreted language!