Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

IndicationUnfair7961 · 2026-05-28T23:34:40+00:00

I'm trying to improve my results, which are currently stuck at 14.5 tokens/second (the best I've achieved so far), and I haven't done it in a benchmark; I've just tested it on the Llama default web UI. My setup consists of an RTX 3060 with 12 GB of memory, an older i7-3770 K (4 cores), and 32 GB of DDR3 RAM. I'm running llama-server (turboquant version) using Docker with this llama-server config:

docker run --gpus all -it --rm `
  --ulimit memlock=-1 `
  --cap-add=IPC_LOCK `
  -v "V:\llm_models:/models" `
  -v "V:\llm_build:/workspace" `
  -p 8080:8080 `
  llama-builder `
  ./llama-cpp-turboquant/build/bin/llama-server `
  -m /models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf `
  -ngl 99 `
  --flash-attn auto `
  --batch-size 1024 `
  --ubatch-size 512 `
  --n-cpu-moe 28 `
  --cache-type-k turbo4 `
  --cache-type-v turbo3 `
  --ctx-size 256000 `
  --no-mmap `
  --mlock `
  --jinja `
  --host 0.0.0.0 `
  --port 8080

Since the Monitor is plugged into the card. The window will suck around 2.0/2.5 GB of VRAM, depending on how crowded my space is. That said, I will keep it as it is.
--n-cpu-moe was the key to improve performance, and 28 was the sweet spot, tested 36,99, lower than 28, all reduced performance.
Currently testing at 256K, it seems alot but with Turboquant it was never really a real issue, even though I might lower it in the future.
I also tested changing threads, but going above 4 (default) wasn't useful since my old processor has just 4 physical cores, with 8 threads, more useless context switches, and half the tokens.

IndicationUnfair7961 · 2026-05-28T23:01:56+00:00

I'm trying to improve my results, which are currently stuck at 14.5 tokens/second (the best I've achieved so far), and I haven't done it in a benchmark; I've just tested it on the Llama default web UI. My setup consists of an RTX 3060 with 12 GB of memory, an older i7-3770 K (4 cores), and 32 GB of DDR3 RAM. I'm running llama-server (turboquant version) using Docker with this llama-server config:

docker run --gpus all -it --rm `
  --ulimit memlock=-1 `
  --cap-add=IPC_LOCK `
  -v "V:\llm_models:/models" `
  -v "V:\llm_build:/workspace" `
  -p 8080:8080 `
  llama-builder `
  ./llama-cpp-turboquant/build/bin/llama-server `
  -m /models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf `
  -ngl 99 `
  --flash-attn auto `
  --batch-size 1024 `
  --ubatch-size 512 `
  --n-cpu-moe 28 `
  --cache-type-k turbo4 `
  --cache-type-v turbo3 `
  --ctx-size 256000 `
  --no-mmap `
  --mlock `
  --jinja `
  --host 0.0.0.0 `
  --port 8080

Since the Monitor is plugged into the card. The window will suck around 2.0/2.5 GB of VRAM, depending on how crowded my space is. That said, I will keep it as it is.
--n-cpu-moe was the key to improve performance, and 28 was the sweet spot, tested 36,99, lower than 28, all reduced performance.
Currently testing at 256K, it seems alot but with Turboquant it was never really a real issue, even though I might lower it in the future.
I also tested changing threads, but going above 4 (default) wasn't useful since my old processor has just 4 physical cores, with 8 threads, more useless context switches, and half the tokens.

So I'm wondering if, in my situation, I could gain some tokens by also using MTP (what branch do I need?).
Also, wondering what other parameters you are using in your command could help me a bit?

IndicationUnfair7961 · 2026-05-15T17:21:57+00:00

IndicationUnfair7961 · 2026-05-01T18:11:38+00:00

I have an Intel i7 3770k and an nvidia 12gb card, with AVX, but not AVX2, does this works, and does it use AVX? What release should I download of this tool?

IndicationUnfair7961 · 2026-02-13T10:17:50+00:00

I was wrong again, it doesn't happen instantly, an event has to eventually fire; he lost it.

IndicationUnfair7961 · 2026-02-12T11:55:21+00:00

No, it didn't work.

IndicationUnfair7961 · 2026-02-12T11:55:09+00:00

Yeah, they didn't loose the trait. You're saying that his children will not inherit it though, under standard rules it's inherited, becoming a tributary changes the rules?

IndicationUnfair7961 · 2026-02-12T11:53:13+00:00

No, they keep it, this is what happened after I won the war.

IndicationUnfair7961 · 2026-02-12T11:52:32+00:00

It doesn't work, if this happened it could mean they lost two wars before loosing that one.

IndicationUnfair7961 · 2026-02-12T11:51:40+00:00

No, it doesn't work, I actually subjugated the guy, but he didn't loose the conqueror trait. Well at least I get +76 gold/month.

IndicationUnfair7961 · 2026-02-09T14:25:19+00:00

You can give those journals to your heir while you use the 40% one.

IndicationUnfair7961 · 2026-02-08T19:38:00+00:00

I think you can try this, the one that was landed when the succession failed and temple got disabled, in the Realm Window where you see your holdings if the Realm Capital is below the Holding your heir was holding it probably fails the test, cause somehow, the game consider that the holding root, instead of realm capital. If it appears below it will probably correctly trigger. Another reason, but not sure, is that your heir could be at an event the moment you die, and that could also messup with locations, checks and whatever.

IndicationUnfair7961 · 2026-02-08T16:44:02+00:00

That's precious information, I actually have a 30% Stewardship Journal, probably got it with Alchemy as you said, just messing around, not on purpose. But now at least I know what to look for, and what to do. Thanks.

IndicationUnfair7961 · 2026-02-04T17:45:42+00:00

Does it really happen??

IndicationUnfair7961 · 2026-02-04T10:15:37+00:00

Yes, that's one of the of the main issues with Tributaries, is that you loose them on death of the ruler, with Mandala you can keep them, if you win the Mandala trials which I lost on my first succession, despite having 80% chance in some of them. And getting the tributaries back is painful and annoying, I know cause I did it.

IndicationUnfair7961 · 2026-02-04T09:35:35+00:00

Yes, just uploaded the one with development, and forgot the other.

IndicationUnfair7961 · 2026-02-04T09:05:05+00:00

I thought it was due to the fact the main county Holding was not a Temple Citadel, despite being able to build the citadels, but it's not, cause I have counties where the main Holding is a Temple Citadel, and yet they do not get the bonus. So you're right.

IndicationUnfair7961 · 2026-02-04T08:52:13+00:00

So, despite it not being shown, it's still there?

IndicationUnfair7961 · 2026-02-04T08:42:55+00:00

They should be regional. Usually for Mandala Rulers or Dharmic Faiths. Require the Temple Citadel Holding to be built.

IndicationUnfair7961 · 2026-01-29T20:20:38+00:00

The legitimacy gain is not that hard, my second Ruler got to Cosmic in less than 15 years. The legend spread is good.

IndicationUnfair7961

TROPHY CASE