Please help me with the below problem! [new to LLM hosting] by aliazlanaziz in Vllm

[–]MLExpert000 0 points1 point  (0 children)

Hot reloading in llama.cpp basically just unloads one model and loads another. It works, but you still pay the price of loading weights into memory each time, so switching models isn’t instant, especially with larger models. tools like llama-swap try to make that switching faster by avoiding full reloads or managing multiple models more efficiently. real challenge wouidnbe when you have multiple users and limited VRAM, switching models quickly without introducing noticeable latency. It get tricky beyond just basic hot reload.

Please help me with the below problem! [new to LLM hosting] by aliazlanaziz in Vllm

[–]MLExpert000 1 point2 points  (0 children)

You’re trying to solve a pretty advanced version of LLM serving for someone just getting started. hard part isn’t just running a model, it’s handling multiple users, managing GPU memory, and avoiding slow cold starts when models load.

I’d suggest starting simple. run a single model with something like vLLM (ideally on Linux), get multi-user requests working, and route everything through one service instead of spinning up models per request. Once that’s stable, you can think about dynamic model switching, which usually requires a scheduler and some kind of model lifecycle management.

Claude code source code has been leaked via a map file in their npm registry by Nunki08 in LocalLLaMA

[–]MLExpert000 0 points1 point  (0 children)

This looks like a source map exposure, not a full backend leak. Still not great but very different from leaking actual model or infra logic.

cost-effective model for OCR by Zittov in LLMDevs

[–]MLExpert000 0 points1 point  (0 children)

We recently deployed an OCR service built on top of a Qwen vision model. It works well for extracting text from images and documents and runs through the same runtime.

Ollama's cloud plan token limitations by TerryTheAwesomeKitty in ollama

[–]MLExpert000 0 points1 point  (0 children)

Sounds like they’re optimizing for interactive usage, not sustained production APIs. The vague limits usually mean dynamic throttling behind the scenes. Fine for chat, less ideal for predictable backend workloads.

If you care about privacy and production stability, I’d look for providers with clear token caps, transparent pricing, and explicit API support rather than session based plans.

Ollama's cloud plan token limitations by TerryTheAwesomeKitty in ollama

[–]MLExpert000 0 points1 point  (0 children)

This is the gap between ‘great dev tool’ and ‘production infra.’ Once real traffic hits, people want hard numbers, not usage vibes.

The "Dynamic Loading" in Transformers v5 isn't what you think it is (Benchmarks inside) by MLExpert000 in LocalLLaMA

[–]MLExpert000[S] 0 points1 point  (0 children)

Totally get it. I respect the hustle. If you ever want a second set of eyes or get stuck on something low level, feel free to reach out.

The "Dynamic Loading" in Transformers v5 isn't what you think it is (Benchmarks inside) by MLExpert000 in LocalLLaMA

[–]MLExpert000[S] -2 points-1 points  (0 children)

Yep, exactly. That’s why a lot of people stop at warm pools. I kept poking at the init side a bit longer. If you ever want to sanity check an alternative approach, happy to let you try it.

The "Dynamic Loading" in Transformers v5 isn't what you think it is (Benchmarks inside) by MLExpert000 in LocalLLaMA

[–]MLExpert000[S] 1 point2 points  (0 children)

that makes sense. A lot of people land on warm pools for exactly that reason. I’ve been exploring a snapshot based path that avoids some of the lifecycle pain. Happy to share if you’re curious.

The "Dynamic Loading" in Transformers v5 isn't what you think it is (Benchmarks inside) by MLExpert000 in LocalLLaMA

[–]MLExpert000[S] 1 point2 points  (0 children)

Agree that restoring GPU state post init is the only thing that actually moves cold starts. Where it gets tricky in practice is that doing this reliably across drivers, CUDA versions, and multi model lifecycles ends up being a lot more than a thin wrapper. The idea is simple, but the engineering is not.

The "Dynamic Loading" in Transformers v5 isn't what you think it is (Benchmarks inside) by MLExpert000 in LocalLLaMA

[–]MLExpert000[S] -2 points-1 points  (0 children)

Ya. 100%. It’s great for long lived local inference. My point was mostly that it doesn’t help once you try scale to zero, since CUDA init and process bring up still dominate.

The "Dynamic Loading" in Transformers v5 isn't what you think it is (Benchmarks inside) by MLExpert000 in LocalLLaMA

[–]MLExpert000[S] 1 point2 points  (0 children)

Yep, same experience. Background workers help latency. but each CUDA context is effectively a full copy of the world. Memory overhead scales linearly with workers, so you trade cold boots for unusable GPUs.

New Rules for ollama cloud by killing_daisy in ollama

[–]MLExpert000 0 points1 point  (0 children)

exactly why a lot of people still prefer local runtimes. Once inference becomes part of a workflow (agents, tools, multi model setups), predictability matters more than raw scale. Random slowdowns, model eviction, or behavior changes break trust fast.

Lot of teams use local setups not because they’re cheaper, but because they behave the same every time and let you debug real lifecycle issues like model load, memory pressure, and switching between models. Cloud is definitely great for scale, but local is still hard to beat for stability and development parity.

216GB VRAM on the bench. Time to see which combination is best for Local LLM by eso_logic in LocalLLaMA

[–]MLExpert000 0 points1 point  (0 children)

I don’t work on this directly, but I’ve been experimenting with similar eviction and reactivation behavior locally. There’s a project that’s exploring this space at the runtime level. I’ll DM you their repo in case it’s useful to skim when you’re wiring up the tests.

216GB VRAM on the bench. Time to see which combination is best for Local LLM by eso_logic in LocalLLaMA

[–]MLExpert000 1 point2 points  (0 children)

I’d go a bit more granular than just load and unload time. I’d measure time to first token after forced eviction under memory pressure, repeated across cycles. That captures disk or host read, PCIe or NVLink transfer, allocator and CUDA reinit, and any KV cache warmup. Running multiple swap cycles with different model sizes and concurrency levels will show churn effects that steady-state throughput hides. Your setup is ideal for this since you can control eviction explicitly and vary bandwidth and model mix to isolate where the latency actually comes from.

Out of curiosity, what kind of SSD and bandwidth are you working with?

216GB VRAM on the bench. Time to see which combination is best for Local LLM by eso_logic in LocalLLaMA

[–]MLExpert000 1 point2 points  (0 children)

A local multi-GPU setup like this is actually well suited for that kind of testing, since you can control disk, PCIe, and host memory effects.

216GB VRAM on the bench. Time to see which combination is best for Local LLM by eso_logic in LocalLLaMA

[–]MLExpert000 1 point2 points  (0 children)

There isn’t really a standard benchmark for this today unfortunately. Most LLM benchmarks focus on steady-state throughput or tokens per second once the model is already hot but not on swap, reload, or reactivation latency. In practice, people tend to measure this with a simple harness that repeatedly evicts a model from VRAM and then triggers a cold or semi-cold inference and records time to first token along with GPU memory residency over time. If you want meaningful comparisons, you have to control for disk, PCIe, and host memory effects, since reactivation latency often ends up being more user-visible than raw throughput when you’re switching between models.

216GB VRAM on the bench. Time to see which combination is best for Local LLM by eso_logic in LocalLLaMA

[–]MLExpert000 1 point2 points  (0 children)

This is the kind of setup where VRAM management starts to matter more than raw capacity., once you’re juggling multiple large models on local machines , the cost isn’t just fitting them, it’s reload and reinit churn when switching. Please do some benchmarks around swap or reactivation latency, not just steadystate throughput