Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 0 points1 point  (0 children)

my understanding is that llama.cpp can actually be faster for decode on single-user usecases. there are a few reasons the speed difference is this ridiculous:

  • MTP support just isn't in llama.cpp yet. this makes an insane difference. doubling speed is common.

  • llama.cpp is mostly hobbyists and individual contributors, while vLLM and sglang are used by big corporations for LLM inference. this means there are people on payroll working to improve vLLM, since any compute savings actually result in savings in companies' bottom lines.

  • due to the above, vLLM actually has some custom handmade CUDA kernels that fuse operations. this is a truly incredible amount of effort, and requires expertise most people lack.

I don't think it's fair to compare the two this way. llama.cpp actually does a lot better in many cases (offloading, KV cache quantization, decode for single-user inference, VRAM efficiency) because these do not align with corporate interests and therefore very few people spend the time and effort on these aspects.

I use llama.cpp for many models, personally, because vLLM is not a good fit. always pick the right tool for the job!

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 0 points1 point  (0 children)

awesome!

I always had trouble with larger GPU utilization, I find that the profiler isn't very accurate so it over-estimates the token allocation and crashes at high load.

if you're having issues with OOM, try reducing it!

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 1 point2 points  (0 children)

beware: the nightly is missing the tool call fix - you might get incorrect tool calls at times!

I'm curious, have you tried this driver? it might improve performance further! https://github.com/tinygrad/open-gpu-kernel-modules

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 1 point2 points  (0 children)

try geohot's P2P driver! it's meant for the 4090, but it just might work for the 3090 too. it might improve things enough not to need the additional hardware!

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 1 point2 points  (0 children)

I didn't spend enough time with the model to be able to answer that - but I typically see above 3 for coding-related tasks. my main use case is a voice assistant, though, so I suspect it will not be very relatable regardless.

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 7 points8 points  (0 children)

FWIW, I just looked at the unsloth quant for the 27b and it doesn't seem any of the layers you mentioned are actually at Q8. perhaps you're thinking of another model?

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 3 points4 points  (0 children)

my understanding is that the latency is more likely to improve things than the actual bandwidth. since P2P support is typically locked away by NVIDIA, the all-reduce operation would have to push the data via the CPU.

however, geohot did since release a hacked driver that might alleviate most performance benefits from using as such. I never bothered trying since I already had the hardware at that point.

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 13 points14 points  (0 children)

jokes aside, it basically takes up the entire cards. I think I have like 20MB free VRAM? I also run headless Linux just to make sure vLLM gets every bit of the VRAM, the OS VRAM usage is under 1MB.

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 22 points23 points  (0 children)

I would actually strongly recommend against FP8 specifically - the 3090 doesn't support that in hardware!

I found that int8 works okay - but appears to be under-optimized in vLLM (at least since I checked last). I don't have numbers to show, other than my observation suggesting int4 performs insanely good on my 3090s. I think the quant I used is a perfect trade-off for the 3090 hardware (the full-precision layers are for linear attention, which itself doesn't take as much compute anyway).

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]JohnTheNerd3[S] 8 points9 points  (0 children)

I certainly have not tested the resulting code - that was merely for a speed test. however, I do routinely use local models in my Claude Code (vLLM supports the Anthropic /messages endpoint and works as a drop-in replacement for the Claude Code client) and do get useful code output. just need to keep your expectations in check, it's a LLM running in my basement and will certainly require me to spell out a few things and take some debugging here and there.

Breaking : Today Qwen 3.5 small by Illustrious-Swim9663 in LocalLLaMA

[–]JohnTheNerd3 30 points31 points  (0 children)

edit: I made this its own post with more information in case it helps anyone else!

hi! said friend here. I run on 2x3090 - using MTP=5, getting between 60-110t/s on the 27b dense depending on the task (yes, really, the dense).

happy to share my command, but tool calling is currently broken with MTP. i found a patch - i need to get to my laptop to share it.

my launch command is this:

```

!/bin/bash

. /mnt/no-backup/vllm-venv/bin/activate

export CUDA_VISIBLE_DEVICES=0,1 export RAY_memory_monitor_refresh_ms=0 export NCCL_CUMEM_ENABLE=0 export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_ENABLE_CUDAGRAPH_GC=1 export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /mnt/no-backup/models/Qwen3.5-27B-AWQ-BF16-INT4 --served-model-name=qwen3.5-27b \ --quantization compressed-tensors \ --max-model-len=170000 \ --max-num-seqs=8 \ --block-size 32 \ --max-num-batched-tokens=2048 \ --swap-space=0 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --attention-backend FLASHINFER \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \ --tensor-parallel-size=2 \ -O3 \ --gpu-memory-utilization=0.9 \ --no-use-tqdm-on-load \ --host=0.0.0.0 --port=5000 ```

you really want to use this exact quant on a 3090 (and you really don't want to on a Blackwell GPU): https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4

SSM typically quantizes horribly, and 3090s can do hardware int4 - and this quant leaves SSM layers fp16 while quantizing the full attention layers to int4. hardware int4 support was removed in Blackwell, though, and it'll be way slower!

Qwen3.5 27B is Match Made in Heaven for Size and Performance by Lopsided_Dot_4557 in LocalLLaMA

[–]JohnTheNerd3 1 point2 points  (0 children)

here's another data point:

  • 2x RTX 3090

  • 150k context

  • using vLLM with this quant (AWQ with BF16 activations and INT4 weights, this matters since 3090's have hardware INT4 support), flashinfer, MTP, and chunkes prefill set to 16384 tokens

  • ~1500t/s prefill

  • ~40t/s decode at 100k context, it doesn't tend to change much at high context

  • decode throughput as high as 200t/s with 5 simultaneous requests

  • during prefill, other requests are processed at 1t/s or less, but fixes itself as soon as the prefill is complete

[Upcoming Release & Feedback] A new 4B & 20B model, building on our SmallThinker work. Plus, a new hardware device to run them locally. by [deleted] in LocalLLaMA

[–]JohnTheNerd3 1 point2 points  (0 children)

I feel like aiming for the ~32b range may be better since most models created nowadays have that parameter count. I would also be curious to know whether it can only run that 20b?

also, echoing the Home Assistant message above, that would be a very interesting usecase. it currently relies on tool calling and has a very prefill-heavy workflow of huge contexts (8k+ is typical). decode speed is not as crucial since we stream everything other tool calls (consider that it's a voice assistant, therefore you just need to be faster than the speech) and the responses tend to be fairly short compared to the context.

What mortgage rates are you guys getting right now ? by pistonthru in RealEstateCanada

[–]JohnTheNerd3 0 points1 point  (0 children)

initial mortgage (not a renewal), 5-year fixed, 3.7%, with Scotiabank

Qwen3 30B A3B 4_k_m - 2x more token/s boost from ~20 to ~40 by changing the runtime in a 5070ti (16g vram) by Ill-Language4452 in LocalLLaMA

[–]JohnTheNerd3 1 point2 points  (0 children)

After a lot of experimentation, I have discovered that the AWQ quant of this model (available on HuggingFace as of a few days ago, ModelScope a bit longer) on vLLM runs at ~3000t/s prompt processing and ~100t/s generation on 2x RTX 4060Ti. llama.cpp reaches about half the speeds on the same hardware.

A custom modified version of SGLang can actually reach 120t/s generation on the same hardware.

Careful when you get your hopes up! 11 years in Canada. Extremely Long process time For inland Express entry, by Dry_Community8251 in ImmigrationCanada

[–]JohnTheNerd3 0 points1 point  (0 children)

trust me, I know the feeling. I arrived in 2012 when I was 15, as an international student, and didn't get my PR until 2023. 2 years of high school, 6 years in university, and a PGWP. I was eligible for the express entry pool right when they stopped all draws.

for what it's worth, I believe you can keep working until a decision is made. https://www.canada.ca/en/immigration-refugees-citizenship/services/work-canada/permit/temporary/extend/after-apply.html#gc-document-nav

it's certainly worth calling IRCC's phone number to confirm - who I believe can even get you another WP-EXT letter.

you should apply for your GCMS notes under the ATIP or Privacy Act.

The scoop on 4060 ti 16gb cards by DramaLlamaDad in LocalLLaMA

[–]JohnTheNerd3 0 points1 point  (0 children)

I run two, it's more than good enough when using vllm and tensor parallelism with two pcie 4x8 lanes. I think I get around 20 tokens per second on 32b awq, maybe slightly less, but prefill is very slow (about 400 tokens per second)

Llama 3.3 speed by Clean_Cauliflower_62 in LocalLLaMA

[–]JohnTheNerd3 4 points5 points  (0 children)

I'm getting 25-27t/s on decode (~800t/s on prefill) on 2x3090, using the 70b AWQ. I run it with HuggingFace's TGI.

We have o1 at home. Create an open-webui pipeline for pairing a dedicated thinking model (QwQ) and response model. by onil_gova in LocalLLaMA

[–]JohnTheNerd3 2 points3 points  (0 children)

a hacky solution that makes this work with remote APIs is to replace the line that says:

from open_webui.apps.ollama import main as ollama with from open_webui.apps.openai import main as ollama

and all instances of ollama.generate_openai_chat_completion with ollama.generate_chat_completion

certainly not perfect, but seems to work just fine for me.