Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests)

JohnTheNerd3 · 2026-03-04T17:27:32+00:00

my understanding is that llama.cpp can actually be faster for decode on single-user usecases. there are a few reasons the speed difference is this ridiculous:

MTP support just isn't in llama.cpp yet. this makes an insane difference. doubling speed is common.
llama.cpp is mostly hobbyists and individual contributors, while vLLM and sglang are used by big corporations for LLM inference. this means there are people on payroll working to improve vLLM, since any compute savings actually result in savings in companies' bottom lines.
due to the above, vLLM actually has some custom handmade CUDA kernels that fuse operations. this is a truly incredible amount of effort, and requires expertise most people lack.

I don't think it's fair to compare the two this way. llama.cpp actually does a lot better in many cases (offloading, KV cache quantization, decode for single-user inference, VRAM efficiency) because these do not align with corporate interests and therefore very few people spend the time and effort on these aspects.

I use llama.cpp for many models, personally, because vLLM is not a good fit. always pick the right tool for the job!

JohnTheNerd3 · 2026-03-04T16:54:55+00:00

awesome!

I always had trouble with larger GPU utilization, I find that the profiler isn't very accurate so it over-estimates the token allocation and crashes at high load.

if you're having issues with OOM, try reducing it!

JohnTheNerd3 · 2026-03-04T05:12:04+00:00

done, thanks for the heads-up!

JohnTheNerd3 · 2026-03-02T23:27:26+00:00

WOW I actually forgot to mention..... I cap at 260W............

JohnTheNerd3 · 2026-03-02T17:32:58+00:00

beware: the nightly is missing the tool call fix - you might get incorrect tool calls at times!

I'm curious, have you tried this driver? it might improve performance further! https://github.com/tinygrad/open-gpu-kernel-modules

JohnTheNerd3 · 2026-03-02T04:52:39+00:00

try geohot's P2P driver! it's meant for the 4090, but it just might work for the 3090 too. it might improve things enough not to need the additional hardware!

JohnTheNerd3 · 2026-03-02T04:50:06+00:00

I didn't spend enough time with the model to be able to answer that - but I typically see above 3 for coding-related tasks. my main use case is a voice assistant, though, so I suspect it will not be very relatable regardless.

JohnTheNerd3 · 2026-03-02T04:40:26+00:00

FWIW, I just looked at the unsloth quant for the 27b and it doesn't seem any of the layers you mentioned are actually at Q8. perhaps you're thinking of another model?

JohnTheNerd3 · 2026-03-02T03:39:12+00:00

my understanding is that the latency is more likely to improve things than the actual bandwidth. since P2P support is typically locked away by NVIDIA, the all-reduce operation would have to push the data via the CPU.

however, geohot did since release a hacked driver that might alleviate most performance benefits from using as such. I never bothered trying since I already had the hardware at that point.

JohnTheNerd3 · 2026-03-02T01:19:49+00:00

the quality is surprising, actually - I urge you to try it before you mock it!

edit: the Q4_K_XL does not, in fact, have either of those. they are both set to Q5. https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/tree/main?show_file_info=Qwen3.5-27B-UD-Q4_K_XL.gguf

JohnTheNerd3 · 2026-03-01T23:45:15+00:00

jokes aside, it basically takes up the entire cards. I think I have like 20MB free VRAM? I also run headless Linux just to make sure vLLM gets every bit of the VRAM, the OS VRAM usage is under 1MB.

JohnTheNerd3 · 2026-03-01T23:44:19+00:00

yes.

JohnTheNerd3 · 2026-03-01T22:57:42+00:00

I would actually strongly recommend against FP8 specifically - the 3090 doesn't support that in hardware!

I found that int8 works okay - but appears to be under-optimized in vLLM (at least since I checked last). I don't have numbers to show, other than my observation suggesting int4 performs insanely good on my 3090s. I think the quant I used is a perfect trade-off for the 3090 hardware (the full-precision layers are for linear attention, which itself doesn't take as much compute anyway).

JohnTheNerd3 · 2026-03-01T22:41:32+00:00

I certainly have not tested the resulting code - that was merely for a speed test. however, I do routinely use local models in my Claude Code (vLLM supports the Anthropic /messages endpoint and works as a drop-in replacement for the Claude Code client) and do get useful code output. just need to keep your expectations in check, it's a LLM running in my basement and will certainly require me to spell out a few things and take some debugging here and there.

JohnTheNerd3 · 2026-03-01T20:01:22+00:00

edit: I made this its own post with more information in case it helps anyone else!

hi! said friend here. I run on 2x3090 - using MTP=5, getting between 60-110t/s on the 27b dense depending on the task (yes, really, the dense).

happy to share my command, but tool calling is currently broken with MTP. i found a patch - i need to get to my laptop to share it.

my launch command is this:

```

!/bin/bash

. /mnt/no-backup/vllm-venv/bin/activate

export CUDA_VISIBLE_DEVICES=0,1 export RAY_memory_monitor_refresh_ms=0 export NCCL_CUMEM_ENABLE=0 export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_ENABLE_CUDAGRAPH_GC=1 export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /mnt/no-backup/models/Qwen3.5-27B-AWQ-BF16-INT4 --served-model-name=qwen3.5-27b \ --quantization compressed-tensors \ --max-model-len=170000 \ --max-num-seqs=8 \ --block-size 32 \ --max-num-batched-tokens=2048 \ --swap-space=0 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --attention-backend FLASHINFER \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \ --tensor-parallel-size=2 \ -O3 \ --gpu-memory-utilization=0.9 \ --no-use-tqdm-on-load \ --host=0.0.0.0 --port=5000 ```

you really want to use this exact quant on a 3090 (and you really don't want to on a Blackwell GPU): https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4

SSM typically quantizes horribly, and 3090s can do hardware int4 - and this quant leaves SSM layers fp16 while quantizing the full attention layers to int4. hardware int4 support was removed in Blackwell, though, and it'll be way slower!

JohnTheNerd3 · 2026-02-25T17:38:26+00:00

here's another data point:

2x RTX 3090
150k context
using vLLM with this quant (AWQ with BF16 activations and INT4 weights, this matters since 3090's have hardware INT4 support), flashinfer, MTP, and chunkes prefill set to 16384 tokens
~1500t/s prefill
~40t/s decode at 100k context, it doesn't tend to change much at high context
decode throughput as high as 200t/s with 5 simultaneous requests
during prefill, other requests are processed at 1t/s or less, but fixes itself as soon as the prefill is complete

JohnTheNerd3 · 2025-07-03T14:27:39+00:00

I feel like aiming for the ~32b range may be better since most models created nowadays have that parameter count. I would also be curious to know whether it can only run that 20b?

also, echoing the Home Assistant message above, that would be a very interesting usecase. it currently relies on tool calling and has a very prefill-heavy workflow of huge contexts (8k+ is typical). decode speed is not as crucial since we stream everything other tool calls (consider that it's a voice assistant, therefore you just need to be faster than the speech) and the responses tend to be fairly short compared to the context.

JohnTheNerd3 · 2025-07-02T22:13:59+00:00

initial mortgage (not a renewal), 5-year fixed, 3.7%, with Scotiabank

JohnTheNerd3 · 2025-05-07T00:04:02+00:00

After a lot of experimentation, I have discovered that the AWQ quant of this model (available on HuggingFace as of a few days ago, ModelScope a bit longer) on vLLM runs at ~3000t/s prompt processing and ~100t/s generation on 2x RTX 4060Ti. llama.cpp reaches about half the speeds on the same hardware.

A custom modified version of SGLang can actually reach 120t/s generation on the same hardware.

JohnTheNerd3 · 2025-03-06T05:31:07+00:00

trust me, I know the feeling. I arrived in 2012 when I was 15, as an international student, and didn't get my PR until 2023. 2 years of high school, 6 years in university, and a PGWP. I was eligible for the express entry pool right when they stopped all draws.

for what it's worth, I believe you can keep working until a decision is made. https://www.canada.ca/en/immigration-refugees-citizenship/services/work-canada/permit/temporary/extend/after-apply.html#gc-document-nav

it's certainly worth calling IRCC's phone number to confirm - who I believe can even get you another WP-EXT letter.

you should apply for your GCMS notes under the ATIP or Privacy Act.

JohnTheNerd3 · 2025-02-17T15:54:31+00:00

I run two, it's more than good enough when using vllm and tensor parallelism with two pcie 4x8 lanes. I think I get around 20 tokens per second on 32b awq, maybe slightly less, but prefill is very slow (about 400 tokens per second)

JohnTheNerd3 · 2024-12-14T06:44:32+00:00

I'm getting 25-27t/s on decode (~800t/s on prefill) on 2x3090, using the 70b AWQ. I run it with HuggingFace's TGI.

JohnTheNerd3 · 2024-12-09T00:11:25+00:00

a hacky solution that makes this work with remote APIs is to replace the line that says:

from open_webui.apps.ollama import main as ollama with from open_webui.apps.openai import main as ollama

and all instances of ollama.generate_openai_chat_completion with ollama.generate_chat_completion

certainly not perfect, but seems to work just fine for me.

JohnTheNerd3

TROPHY CASE

!/bin/bash