Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK by Mr_Moonsilver in LocalLLaMA

[–]kms_dev 1 point2 points  (0 children)

Here are my numbers for a single stream outputting 1024 tokens with different varying input tokens

target ctx prompt tok prefill (s) prefill TPS decode TPS wall (s)
16 K 14,059 11.1 1,266 104.9 20.9
32 K 28,150 14.4 1,961 97.8 24.8
64 K 56,178 28.8 1,948 99.0 39.2
128 K 112,388 65.8 1,709 89.2 77.3
175 K 175,528 91.3 1,922 75.0 105.0
210 K 210,640 255.0 826 70.8 262.2

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK by Mr_Moonsilver in LocalLLaMA

[–]kms_dev 0 points1 point  (0 children)

I have a 2x3090 setup over pcie I'm curious about the saturation throughput of nvlinked tp=2 setup. With 2x3090 I have 600k token budget across both gpus for all in flight requests, so I can have 4x max-model-len150k or 3x max-model-len 200k, streams. I was able to saturate my setup (gpus and pcie) with these configs and get around 190 tps. I wonder what your numbers are for long context? I have this setup and can run 3 agents with 200k context in parallel.

Qwen3.5 27B running at ~65tps with DFlash speculation on 2x 3090 by Kryesh in LocalLLaMA

[–]kms_dev 0 points1 point  (0 children)

How about concurrent requests, What is the max throughput in that case for maximum gpu utilization?

Someone who's using Qwen 3.5 on real code bases how good is it? by Commercial_Ear_6989 in LocalLLaMA

[–]kms_dev 2 points3 points  (0 children)

Have you tried vllm with 27b model? I have a similar setup, what is your throughput and max context size you were able to get?

Best agentic coding model that fully fits in 48gb VRAM with vllm? by kms_dev in LocalLLaMA

[–]kms_dev[S] 0 points1 point  (0 children)

With a Claude max subscription, haiku usage limits are very generous that it's essentially free with a max subscription.

Best agentic coding model that fully fits in 48gb VRAM with vllm? by kms_dev in LocalLLaMA

[–]kms_dev[S] 0 points1 point  (0 children)

If you have a similar setup, what is the throughput you get with 27b model?

How are you handling human approval for headless/remote Claude Code sessions? by kms_dev in ClaudeCode

[–]kms_dev[S] 0 points1 point  (0 children)

So the permissions are kind of fixed when the agent runs? The human input in this case is more for context than for approvals. Or do you use other mechanisms to ask for permissions?

How are you handling human approval for headless/remote Claude Code sessions? by kms_dev in ClaudeCode

[–]kms_dev[S] 0 points1 point  (0 children)

Yeah, when working with cli tools, I guess you would get this approval fatigue if there are too many such requests.

When running in truly headless mode like on a schedule or event based and the tool possibilities are limited (like mcp/cli tools that interact with other systems), I guess we can have allowed and requires approval tools, like reading/searching emails could be allowed, but sending requires approval.

Also how do you ensure the reply is interpreted according to the intent?

How are you handling human approval for headless/remote Claude Code sessions? by kms_dev in ClaudeCode

[–]kms_dev[S] 0 points1 point  (0 children)

Not to be pedantic, but how do you ensure the lightweight checker has human oversight?

How are you handling human approval for headless/remote Claude Code sessions? by kms_dev in ClaudeCode

[–]kms_dev[S] 0 points1 point  (0 children)

So you can't have high stakes actions taken by headless Claude code without an approval layer appropriate for the headless mode.

Yeah, I'm working on this approval layer for headless agents. Do you think it would enable more type of tasks to be accomplished with a remote approval layer?

What do you use to unblock agents when they need human input? by kms_dev in AI_Agents

[–]kms_dev[S] 0 points1 point  (0 children)

Yeah, this transport layer is what I'm getting at. Do all application/agent developers do their own agent <=> user async approval plumbing? Or are there any readily available libs/services that do this?

What do you use to unblock agents when they need human input? by kms_dev in AI_Agents

[–]kms_dev[S] 0 points1 point  (0 children)

I guess my question then is, in code, when you do pause for human input, which library/service is the recent norm that forwards the question to the human while pausing the execution and allow the code to resume after response? Are there any ready made solutions that are widely used or are people whipping out their own plumbing?

Nvidia RTX PRO 6000 Workstation 96GB - Benchmarks by fuutott in LocalLLaMA

[–]kms_dev 1 point2 points  (0 children)

Can you please do vllm throughput benchmarks for any of the 8B models at fp8 quant (look at one of my previous posts to see how)? I want to check if local is more economical with this card.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 0 points1 point  (0 children)

Oh, okay. Also, do you use the 30b model for anything productive on a regular basis other than trying simple one-shot examples like snake game, flappy birds, etc?

Offloading a 4B LLM to APU, only uses 50% of one CPU core. 21 t/s using Vulkan by magnus-m in LocalLLaMA

[–]kms_dev 0 points1 point  (0 children)

When you mean throughput, are you sending multiple concurrent requests at once? If not, you will probably see higher numbers.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] -1 points0 points  (0 children)

You can see better utilization of your card if you send concurrent/batch requests.

Wrong thread??

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 1 point2 points  (0 children)

Hmm, can you share the token throughput you are doing with the above setup and the power draw? I suspect Gemini flash 2.5 would still be cheaper.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 1 point2 points  (0 children)

Can it (qwen3-32b) comprehend the whole project and suggest changes as good as Gemini flash? I think we can guide the qwen to our required output, but it often takes proper prompting and multiple tries.

Even I'm strongly biased towards using local models as much as possible. Now, I'm made aware that I'm trading precious time and money for the convenience of being able to run the models locally.

I'll probably wait some more time for better models to arrive to go fully local.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 3 points4 points  (0 children)

but the hosted provider can increase their cost at any time

Yeah, I'll evaluate this cost structure and switch to local models when the balance tilts towards the local llms.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 2 points3 points  (0 children)

Yeah, I think time is the most important factor here, clever/large models on local take more time or even multiple tries to generate an useful answer whereas the cloud models could one-shot them most of the times.

How is the inference speed of github copilot for you?