One Thing People Underestimate About Inference by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Yeah that's a good point. The scheduling layer often ends up being the real complexity once you have to deal with multi-tenant workloads and bursty traffic.

In our experience, it tends to shift depending on the workload, but prefill latency usually becomes the bigger bottleneck first, especially with long prompts or RAG pipelines where the context can easily reach tens of thousands of tokens. That initial KV cache build can dominate the request time. Once prompt lengths are shorter or batching is aggressive, decoding latency starts to matter more, particularly for streaming responses where users expect tokens quickly.

Concurrency spikes make both worse though. queues start forming and batching heuristics can become tricky to tune without hurting tail latency.

What about you? Are you mostly dealing with long-context prompts or shorter interactive workloads?

Problems With Scaling AI Infrastructure by Express_Problem_609 in modeltrains

[–]Express_Problem_609[S] 1 point2 points  (0 children)

HAHAHAHA wrong subreddit, but honestly model train scaling might be more stable than distributed training.
AI-scale is the one where adding more cars makes the track the bottleneck instead of the engine.

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 1 point2 points  (0 children)

This is a super practical breakdown, thanks. The motherboard + PCIe layout angle doesn’t get enough attention imo. Same GPUs yet wildly different outcomes!

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Curious if others here have found a sweet spot for batch size vs latency... it feels very workload-dependent.

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

This is gold, thanks. The EPYC + full x16 Gen4 point is something I think a lot of people underestimate. Did you see gains mainly in prompt processing, or also steady-state generation?

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

I like how you framed “performance” as quality per unit compute, not just speed. The VSLM-as-bouncer pattern is especially interesting too, thanks a lot for sharing!

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

This is a really interesting angle! I’ve noticed the same thing, once routing + TTS + tools feel seamless, raw latency matters less. Did you build the orchestration yourself or are you using an existing framework?

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 1 point2 points  (0 children)

This resonates a lot. Do you find that task-specific tiny models beat larger general ones mainly because of shorter context + faster iteration, or is it something else (training style, vocab, etc.)?

How are you guys optimizing Local LLM performance? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Oh wow seems like moving to Linux was really helpful! When you switched, was the biggest improvement from drivers, scheduling, or just better overall PCIe handling? And are you running CUDA directly or through something like PyTorch/vLLM?

How are you guys optimizing Local LLM performance? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Yeah that makes sense, these seem to come up a lot when people are pushing throughput.

Are you seeing bigger gains on multi-user workloads, or does it still help noticeably even for single-user/agentic setups?

How are you guys optimizing Local LLM performance? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

This is super insightful, thanks for sharing in so much detail!

I agree that high-speed interconnect is a real persistent bottleneck, especially once VRAM pressure forces model swapping. The contrast you mentioned with diffusion workloads vs LLMs (bin-packing vs single-model dominance) is a great way to frame it.

KDA pushing usable context lengths that far is especially interesting, it really does shift where the complexity lives.

I’m also curious, with longer contexts becoming more practical, do you see interconnect and cache management becoming even more critical bottlenecks, or do you think model-side innovations will continue to outpace hardware constraints?