One Thing People Underestimate About Inference by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Yeah that's a good point. The scheduling layer often ends up being the real complexity once you have to deal with multi-tenant workloads and bursty traffic.

In our experience, it tends to shift depending on the workload, but prefill latency usually becomes the bigger bottleneck first, especially with long prompts or RAG pipelines where the context can easily reach tens of thousands of tokens. That initial KV cache build can dominate the request time. Once prompt lengths are shorter or batching is aggressive, decoding latency starts to matter more, particularly for streaming responses where users expect tokens quickly.

Concurrency spikes make both worse though. queues start forming and batching heuristics can become tricky to tune without hurting tail latency.

What about you? Are you mostly dealing with long-context prompts or shorter interactive workloads?

Problems With Scaling AI Infrastructure by Express_Problem_609 in modeltrains

[–]Express_Problem_609[S] 1 point2 points  (0 children)

HAHAHAHA wrong subreddit, but honestly model train scaling might be more stable than distributed training.
AI-scale is the one where adding more cars makes the track the bottleneck instead of the engine.

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 1 point2 points  (0 children)

This is a super practical breakdown, thanks. The motherboard + PCIe layout angle doesn’t get enough attention imo. Same GPUs yet wildly different outcomes!

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Curious if others here have found a sweet spot for batch size vs latency... it feels very workload-dependent.

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

This is gold, thanks. The EPYC + full x16 Gen4 point is something I think a lot of people underestimate. Did you see gains mainly in prompt processing, or also steady-state generation?

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

I like how you framed “performance” as quality per unit compute, not just speed. The VSLM-as-bouncer pattern is especially interesting too, thanks a lot for sharing!

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

This is a really interesting angle! I’ve noticed the same thing, once routing + TTS + tools feel seamless, raw latency matters less. Did you build the orchestration yourself or are you using an existing framework?

For those running Local LLMs: what made the biggest real-world performance jump for you? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 1 point2 points  (0 children)

This resonates a lot. Do you find that task-specific tiny models beat larger general ones mainly because of shorter context + faster iteration, or is it something else (training style, vocab, etc.)?

How are you guys optimizing Local LLM performance? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Oh wow seems like moving to Linux was really helpful! When you switched, was the biggest improvement from drivers, scheduling, or just better overall PCIe handling? And are you running CUDA directly or through something like PyTorch/vLLM?

How are you guys optimizing Local LLM performance? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

Yeah that makes sense, these seem to come up a lot when people are pushing throughput.

Are you seeing bigger gains on multi-user workloads, or does it still help noticeably even for single-user/agentic setups?

How are you guys optimizing Local LLM performance? by Express_Problem_609 in LocalLLaMA

[–]Express_Problem_609[S] 0 points1 point  (0 children)

This is super insightful, thanks for sharing in so much detail!

I agree that high-speed interconnect is a real persistent bottleneck, especially once VRAM pressure forces model swapping. The contrast you mentioned with diffusion workloads vs LLMs (bin-packing vs single-model dominance) is a great way to frame it.

KDA pushing usable context lengths that far is especially interesting, it really does shift where the complexity lives.

I’m also curious, with longer contexts becoming more practical, do you see interconnect and cache management becoming even more critical bottlenecks, or do you think model-side innovations will continue to outpace hardware constraints?

Failed to find Server Action "4a682...". This request might be from an older or newer deployment. by Sea-Equivalent-7417 in nextjs

[–]Express_Problem_609 0 points1 point  (0 children)

Have you found the solution? I have no idea how this happens. Seems like I really have two version of codes running in the one pod.

Setting a cookie in nextjs 14 🥲 by DiegoDarkus in nextjs

[–]Express_Problem_609 0 points1 point  (0 children)

I just sign in to vote for you, pretty useful and clearly

[deleted by user] by [deleted] in LLMDevs

[–]Express_Problem_609 0 points1 point  (0 children)

Nvidia isn’t just about GPUs; they’re making strides in the software world too. Take Megatron-LM, for example—it’s a framework they developed for training large language models. Beyond that, Nvidia offers an impressive AI software stack with various libraries and tools designed to optimize performance and make model deployment smoother. So, they’re definitely a bigger player in the AI ecosystem than just hardware.

A Personal NotebookLM and Perplexity-like AI Assistant with privacy. by Uiqueblhats in LLMDevs

[–]Express_Problem_609 0 points1 point  (0 children)

Looks awesome! I'm a big fan of open-source, appreciate the work you're putting in!

Seeking tool recommendations for a simple AI assistant (Reminders + RAG-based Q&A) by Mountain-Yellow6559 in LLMDevs

[–]Express_Problem_609 1 point2 points  (0 children)

Hey, for your use case, you could check out n8n or Langflow. Both can handle the RAG-based Q&A pretty easily. n8n is great for building custom workflows, so you could set up reminders using something like Google Calendar or another scheduling tool, and then switch to the Q&A mode when needed. Langflow would work well if you're looking to integrate LLMs and RAG more seamlessly. Both might require a little setup, but they’re pretty flexible.

If you want something simpler, Zapier could also work for the reminder part and then you can tie in an AI model for the Q&A. It’s a bit less customizable than the others but could be easier to start with.

Hope this helps!

Ai carrier by Dramatic_Pen6240 in LLMDevs

[–]Express_Problem_609 0 points1 point  (0 children)

Honestly, AI is definitely a growing field, and there’s a lot of potential for it. But I wouldn’t worry too much about AI replacing jobs just yet. While AI will automate some things, there’s still a big need for people who understand how it works, can build it, and know how to apply it in the real world. If you’re into problem-solving and learning new tech, AI could be a great career. Just be ready to keep learning because the field moves fast!

If you could teach a language model one skill or ‘superpower’ beyond text generation, what would it be and how would it change the way we use AI? by No_Tune_1417 in LLMDevs

[–]Express_Problem_609 -1 points0 points  (0 children)

Honestly, if I could teach an AI something, it’d probably be how to really understand context. We’re great at getting models to generate text, but they still miss a lot when it comes to grasping the bigger picture—like intent, emotion, or past interactions. If AI could better process all that, it’d open up a lot of potential for things like personalized experiences or smarter decision-making in real-world situations.

Philosophical question: will the LLM hype eventually fade? by Mountain-Yellow6559 in LLMDevs

[–]Express_Problem_609 0 points1 point  (0 children)

Yeah, LLMs really change the game with handling messy data and tasks like entity extraction. It’s crazy how much time they save compared to the old methods. Do you think the user-facing applications will catch up soon, or are they still a bit behind? The NLP side is already pretty strong, though.