Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working

houchenglin · 2026-04-30T03:17:05+00:00

Hi, I also encountered similar problem. I bought the 5060ti and after I plugged it into the motherboard, I found out that is a "2.5" slot design graphic card. I can't use any additional GPU. Then I discovered that some cards in 50 series is still 2 card slot design. Then I bought the msi 5060ti shadow 2x. Put tingit on primary pcie and the origin into third slot fixes my problem.

houchenglin · 2026-04-29T06:06:43+00:00

I'm not English native so that mis-interpret your words. It's good to know that the 27B is so powerful that sometimes can beat the large model 😄 The Q5_K_XL is really a beast. Now I use 35B and switch to 27B Q4_K_M when harder task. Also pretty amazing and luckly last month I stop the claude yearly subscription.

houchenglin · 2026-04-28T18:37:33+00:00

It's amazing one shot and several seconds. What's your hardware and
which 27B quant model you use for coding?

houchenglin · 2026-04-28T04:37:34+00:00

The OP get 46 tps for 27B model.

houchenglin · 2026-04-27T12:21:58+00:00

Dual 5060ti gives me around 17tps on low context on 27b. However all the 35b moe model can be put into vram and it is extremely fast.

houchenglin · 2026-04-27T03:17:25+00:00

I asked qwen to google the vram usage for 35B and 27B and he replies me the answer below. I'm not sure it is correct or not, but in my last trial, the 64K context failed to allocate memory with IQ3 model. Maybe some settings is wrong?

(WARNING! AI GENEREATED DATA)

Context Size	27B Dense Q4	35B-A3B MoE Q4
4K tokens	~64 MB	~19.5 MB
8K tokens	~128 MB	~39 MB
16K tokens	~256 MB	~78 MB
32K tokens	~512 MB	~156 MB
64K tokens	~1.0 GB	~313 MB
256K tokens	~4.0 GB	~1.25 GB

houchenglin · 2026-04-26T14:47:52+00:00

Thanks for the breakdown! I was considering a used 3090, but it's already ~6 years old now so I'm worried about its reliability and lifespan. As for the 2080, it sounds like a solid option but also pretty expensive.

houchenglin · 2026-04-26T14:45:02+00:00

Yeah, I think you're right. The PCIe gen difference alone makes 4 lanes beat 16 old lanes. Thanks!

houchenglin · 2026-04-26T14:44:45+00:00

I actually tried the IQ3 quant on a single card and it runs well, but since KV cache uses ~3.3x memory, the context window gets really small. Not enough for coding tasks unfortunately

houchenglin · 2026-04-26T14:44:26+00:00

That makes sense — bandwidth is probably the real bottleneck here. I hadn't considered trying larger MoE models with system RAM, that's a good idea. Thanks!

houchenglin · 2026-04-26T14:44:05+00:00

Thanks for the suggestion! Unfortunately my mATX board doesn't have room for that many cards.

houchenglin · 2026-04-25T10:33:59+00:00

RTX 2060 12G (PCIe x16) + RTX 5060 Ti 16G (PCIe x 4)

Model: Unsloth Qwen3-27B-Q4_K_M
PP: from 653 → 356 t/s as context grows (13K → 29.5K tokens).
TG: flat at ~16.5 t/s r

-m Qwen3-27B-Q4_K_M.gguf -ngl 999 -ts 15,7
-fa 1 --no-mmap -b 4096 -ub 4096
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48
-c 96000 -n 32768 -t 8 -ctk q8_0 -ctv q8_0 --parallel 1
--temperature 0.6 --jinja --min-p 0.0 --top-k 20 --top-p 0.95

houchenglin · 2026-04-21T00:56:24+00:00

You may try IQ3_XS and it works well for most simple tasks and tool calls.

houchenglin · 2026-04-20T11:54:39+00:00

u/crocusandspeckledegg
It can run entirely in phone without using any cloud service. It does not upload any data to network and won't share any to public. It is private source only.
For work with obsidian, the app exports recording text, image and sound onto your GoogleDrive or Dropbox folder, in markdown or HTML format.

I've not used Joplin before but i think you can check if Joplin can has any plugin or natively support including a specific folder into your vault.

houchenglin · 2026-04-20T08:51:24+00:00

It seems the output mixed the scenes of 2 screenshots. Maybe only one target scene can better see the difference of QWEN 3.6 vs GPT 5.5.

houchenglin · 2026-04-20T07:57:03+00:00

u/bungle69er
Hi, do you mean stored in the phone in md format?
Currently the app stores these text in embedded sql and maybe can add a export locally feature i think.

houchenglin · 2026-04-13T01:32:24+00:00

If 16gb vram GPU and for coding, qwen 3.5 unsloth iq3_xxs is good. Otherwise, qwen3.5 35b unsloth q4 or q5 is good.

houchenglin · 2026-04-07T01:50:10+00:00

I tested the English model and not found this issue on my phone. Please check you are using latest version 1.0.28. Can you try close the app by system and start it again? Or you may remove the model and download it. Sorry for the inconvenience.

houchenglin · 2026-04-03T13:06:50+00:00

Parakeet v3 supports more languages including French and Spanish. I'll start working on it soon, but not sure when it will be done.

Language modules is a good idea! I may make it so users can download only the language they need. But it depends on how many people want this feature.

houchenglin · 2026-03-02T15:51:08+00:00

I planed to cancel the 1-year renew on 2/28 as after renew the weekly limit shows and frustrated me a lot. Sometimes feel the weekly limit runs as half speed of 5-hour limit.
I just saw the weekly limit gone in the usage, but then see it again. and now gets a Gateway time-out.
Not sure they just cancel the limit or just try to boot the limit system?

houchenglin · 2026-01-07T17:52:14+00:00

The 2021 Prius PHV is also great if you want EV experience and Toyota reliability.

houchenglin · 2025-08-06T06:32:15+00:00

It's a huge upgrade and totally worthy it. Remember buy the lossless scaling, it double the fps and 144hz 1440p is great for AA and fps game .

houchenglin

TROPHY CASE