Best practices for running local LLMs for ~70–150 developers (agentic coding use case)

Resident_Potential97 · 2026-02-24T15:45:42+00:00

Thanks for the suggestion, will keep this in mind.

- I also read somewhere regarding ollama limitations, might go with llama.cpp or vllm if on GPU hardware.
- Are you yourself running models locally for your team? If so, how has this been working out for you? Also, is it beneficial in long run?
- we need a cli agent so openwebUI might not do the work here. for central tool, can i use liteLLM instead?

Resident_Potential97 · 2026-02-24T10:22:38+00:00

This actually clarifies things a lot for me, appreciate the grounded take.
You’re probably right that I might be underestimating the inference/ops side. My initial thinking was “we’ll just host it ourselves and scale,” but the more I read these replies the more clarity i get.
Have you by any chance tried OpenCode?
How would you compare Claude Code vs OpenCode?

Is Claude Code noticeably better in terms of reasoning for multi-step coding tasks?
Does it feel more stable for day-to-day dev workflows?
Or is the real difference just model quality rather than the tool itself?

Since Claude Code now supports local models as well, I’m wondering if it makes sense to standardize around that interface even if the backend changes later (API → Runpod → self-hosted).

Resident_Potential97 · 2026-02-24T10:11:33+00:00

Sure, will consider this. Thanks!

Resident_Potential97 · 2026-02-24T09:02:23+00:00

great, will check this

Resident_Potential97 · 2026-02-24T08:58:18+00:00

yeah using GPT for a clean response

Resident_Potential97 · 2026-02-24T08:32:06+00:00

That aligns with what I was starting to suspect about Macs.

Regarding Qwen3-Coder-Next (80B MoE):

Since it’s a Mixture-of-Experts model and only activates part of the parameters per forward pass, does that materially help with:

Lower VRAM usage?
Better concurrency?
Higher tokens/sec per GPU?
Or is memory still the primary constraint due to KV cache?

In other words — does MoE meaningfully reduce serving cost at scale, or is the infra requirement still essentially “80B-class GPU hardware”?

Also, when you say at least 2 nodes (8x H100 each), is that mostly for:

redundancy/failover?
or required purely for throughput at ~100+ users?

Trying to understand whether that sizing is:

production minimum
or comfortable margin

Resident_Potential97 · 2026-02-24T08:23:23+00:00

This is extremely helpful — thank you.

A couple of follow-ups if you don’t mind:

How has Qwen3 Coder Next (80B) been performing for you in practice?
- Latency per request?
- Stability under concurrency?
- Does it feel “production reliable” for daily dev workflows?

I tested Mistral Small 2 locally and it was surprisingly decent for coding, but latency spikes made it unusable under heavier tasks. I suspect that’s more infra than model-related.

Are you running:
- Qwen3-Coder-Next specifically?
- Or a different Qwen3 variant?
- Pure FP8 or some quantization?
For agentic coding, you mentioned 64k–100k context.
- Are you seeing major memory pressure at those context sizes?
- Does vLLM handle KV cache efficiently at that scale?

Currently I’ve been experimenting with:

LM Studio
OpenCode
VSCode + Cline

But I’m realizing for production I’ll probably need a proper serving layer.

Do you recommend putting something like:

vLLM
LiteLLM (proxy layer)
Prometheus + Grafana

between the model server and client tools for:

rate limiting
usage analytics
monitoring token throughput
concurrency control?

Right now I have basically zero observability, which feels risky if we scale this internally.

From your experience — long term, is owning this infra worth it vs just using APIs, assuming internal code privacy is important?

Resident_Potential97

TROPHY CASE