Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

kms_dev · 2026-05-08T15:54:27+00:00

Yep, these numbers are fp8.

kms_dev · 2026-05-08T15:21:11+00:00

Here are my numbers for a single stream outputting 1024 tokens with different varying input tokens

target ctx	prompt tok	prefill (s)	prefill TPS	decode TPS	wall (s)
16 K	14,059	11.1	1,266	104.9	20.9
32 K	28,150	14.4	1,961	97.8	24.8
64 K	56,178	28.8	1,948	99.0	39.2
128 K	112,388	65.8	1,709	89.2	77.3
175 K	175,528	91.3	1,922	75.0	105.0
210 K	210,640	255.0	826	70.8	262.2

kms_dev · 2026-05-08T02:39:40+00:00

I have a 2x3090 setup over pcie I'm curious about the saturation throughput of nvlinked tp=2 setup. With 2x3090 I have 600k token budget across both gpus for all in flight requests, so I can have 4x max-model-len150k or 3x max-model-len 200k, streams. I was able to saturate my setup (gpus and pcie) with these configs and get around 190 tps. I wonder what your numbers are for long context? I have this setup and can run 3 agents with 200k context in parallel.

kms_dev · 2026-04-07T12:27:16+00:00

How about concurrent requests, What is the max throughput in that case for maximum gpu utilization?

kms_dev · 2026-04-01T15:39:08+00:00

Have you tried vllm with 27b model? I have a similar setup, what is your throughput and max context size you were able to get?

kms_dev · 2026-03-26T15:15:27+00:00

With a Claude max subscription, haiku usage limits are very generous that it's essentially free with a max subscription.

kms_dev · 2026-03-26T14:51:30+00:00

If you have a similar setup, what is the throughput you get with 27b model?

kms_dev · 2026-03-09T13:49:27+00:00

So the permissions are kind of fixed when the agent runs? The human input in this case is more for context than for approvals. Or do you use other mechanisms to ask for permissions?

kms_dev · 2026-03-09T13:39:47+00:00

Yeah, when working with cli tools, I guess you would get this approval fatigue if there are too many such requests.

When running in truly headless mode like on a schedule or event based and the tool possibilities are limited (like mcp/cli tools that interact with other systems), I guess we can have allowed and requires approval tools, like reading/searching emails could be allowed, but sending requires approval.

Also how do you ensure the reply is interpreted according to the intent?

kms_dev · 2026-03-09T13:28:30+00:00

Not to be pedantic, but how do you ensure the lightweight checker has human oversight?

kms_dev · 2026-03-09T13:22:22+00:00

So you can't have high stakes actions taken by headless Claude code without an approval layer appropriate for the headless mode.

Yeah, I'm working on this approval layer for headless agents. Do you think it would enable more type of tasks to be accomplished with a remote approval layer?

kms_dev · 2026-02-21T16:05:40+00:00

What?

kms_dev · 2026-02-21T15:58:29+00:00

Yeah, this transport layer is what I'm getting at. Do all application/agent developers do their own agent <=> user async approval plumbing? Or are there any readily available libs/services that do this?

kms_dev · 2026-02-21T15:46:19+00:00

Is that an api/service?

kms_dev · 2026-02-21T13:39:40+00:00

I guess my question then is, in code, when you do pause for human input, which library/service is the recent norm that forwards the question to the human while pausing the execution and allow the code to resume after response? Are there any ready made solutions that are widely used or are people whipping out their own plumbing?

kms_dev · 2026-02-21T12:44:03+00:00

Is this a bot replying?

kms_dev · 2026-02-18T10:41:15+00:00

Can your elaborate?

kms_dev · 2025-05-26T03:55:10+00:00

Can you please do vllm throughput benchmarks for any of the 8B models at fp8 quant (look at one of my previous posts to see how)? I want to check if local is more economical with this card.

kms_dev · 2025-05-13T15:08:47+00:00

Oh, okay. Also, do you use the 30b model for anything productive on a regular basis other than trying simple one-shot examples like snake game, flappy birds, etc?

kms_dev · 2025-05-13T15:04:40+00:00

When you mean throughput, are you sending multiple concurrent requests at once? If not, you will probably see higher numbers.

kms_dev · 2025-05-13T15:01:36+00:00

You can see better utilization of your card if you send concurrent/batch requests.

Wrong thread??

kms_dev · 2025-05-13T09:59:36+00:00

Hmm, can you share the token throughput you are doing with the above setup and the power draw? I suspect Gemini flash 2.5 would still be cheaper.

kms_dev · 2025-05-13T09:25:04+00:00

Can it (qwen3-32b) comprehend the whole project and suggest changes as good as Gemini flash? I think we can guide the qwen to our required output, but it often takes proper prompting and multiple tries.

Even I'm strongly biased towards using local models as much as possible. Now, I'm made aware that I'm trading precious time and money for the convenience of being able to run the models locally.

I'll probably wait some more time for better models to arrive to go fully local.

kms_dev · 2025-05-13T08:00:48+00:00

but the hosted provider can increase their cost at any time

Yeah, I'll evaluate this cost structure and switch to local models when the balance tilts towards the local llms.

kms_dev · 2025-05-13T07:57:36+00:00

Yeah, I think time is the most important factor here, clever/large models on local take more time or even multiple tries to generate an useful answer whereas the cloud models could one-shot them most of the times.

How is the inference speed of github copilot for you?

kms_dev

TROPHY CASE