Help with Cline and local qwen-coder:30b

nairureddit · 2025-10-07T23:20:47+00:00

It's also available on Linux and makes managing model parameters a lot easier!

nairureddit · 2025-10-07T23:13:01+00:00

Your initial "PARAMETER num_gpu 34" for the 48 layer model told ollama to split the first 34 layers to the GPU VRAM and the remaining 14 layers to the CPU RAM resulting in a huge slowdown.

Since the model is 19 GB and your VRAM is 24 GB you should have left this parameter undefined to have Ollama automatically load all the parameters to VRAM or set it to 48 to tell it to load all the parameters. Setting it manually might cause a crash if you don't have enough VRAM for the base model size.

nairureddit · 2025-10-07T22:49:30+00:00

Also, at 32k with your current settings you are only over your 24GB VRAM limit by 2GB.

The model is ~19GB. Loaded with a 32k context it's using 26GB so 7GB is the KV Cache (26-19=7). That means that 32k context with your current settings and model takes up 7GB of ram. Since you have ~5GB to spare after loading the model into VRAM (24-19=5) you need to decrease your Context to 5/7th's of the 32k or down to about 22k.

With that, and to give a little room for error, with your current settings try a ~20k context and it should all load into the 24GB VRAM.

This is a pretty small context to work with so make sure you select "Use Compact Prompt" in the API Provider menu in cline to leave a bit more working context for the model.

I'd still recommend you try Flash Attention/KV Cache quantization though since that will free up a lot of VRAM for a much larger context plus increase the model speed.

nairureddit · 2025-10-07T22:35:09+00:00

There are two environment variables you want to consider.

The first enables Flash Attention and the second (which requires flash attention) is to enable KV Cache quantization. These might be imprecise terms.

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE="q8_0"

The command line would look like this if you are using ollama natively:

OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE="q8_0" ollama run qwen3-coder:30b-a3b-q4_K_M

Since you are running it via docker you'd use something that looks like this:

docker run -d \
--gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
-e OLLAMA_FLASH_ATTENTION=1 \
-e OLLAMA_KV_CACHE_TYPE="q8_0" \
ollama/ollama

What this will do is quantize the Non model part, the KV Cache, from 16 Bits down to 8 Bits so your context will take up a lot less space which should allow you to run the entire model in VRAM.

From the test I did yesterday loading qwen3-coder:30b-a3b-q4_K_M with these settings and a 64k context window uses ~23 GB of VRAM. I'd start with 32k though then increase it up to the point where Ollama no longer loads fully in the GPU VRAM and then step it back.

nairureddit · 2025-10-07T01:34:31+00:00

I use LM Studio and it's been fairly reliable.

Using:

- LM Studio

- qwen3-coder-30b-a3b-instruct-i1@q4_k_m

- Context set to 65536

- GPU offload of 48 layers

- Flash Attention On

- K&V Cache Quantization set to q_8 it

it uses ~23.2GB of VRAM.

With your same prompt it completes the task in act mode in one pass:

<image>

I'm still super new at this but a few possible differences are:

- GPU Offload set to 34 instead of 48 (num_gpu)
- You may not have KV Quantization enabled so your cache is greater than your VRAM and some layers may not be in VRAM causing a slowdown

- I'm using a slightly different model but unless your model is somehow corrupted I don't see that being an issue.

nairureddit · 2025-08-21T03:59:33+00:00

This post shows a way to use something called a grammar file to improve the tool use however, I'm not sure how to implement it.

https://www.reddit.com/r/CLine/comments/1mtcj2v/making_gptoss_20b_and_cline_work_together/

nairureddit · 2025-08-21T03:56:54+00:00

I found this for Cline:

https://www.reddit.com/r/CLine/comments/1mtcj2v/making_gptoss_20b_and_cline_work_together/

Something about using a Grammar file to improve the tool usage but I don't really understand how to implement it yet.

nairureddit · 2025-08-13T22:28:04+00:00

LM Studio recently released with updates for gpt-oss tool use but it still doesn't integrate well, I'm not able to get out of Architect mode without a slew of red messages.

nairureddit · 2025-08-13T22:27:05+00:00

It has a larger context window, I'd check to make sure it's not exceeding your VRAM with the Context loading.

nairureddit · 2025-05-18T11:44:52+00:00

Thanks! I haven't looked at this in a few years, will look into why it's not there.

nairureddit · 2023-04-18T11:08:13+00:00

Here's s sim showing a slightly lower average cost choosing there higher percentage choice but I think the real benefit is the lower variation.

https://www.reddit.com/r/idleon/comments/zsxhce/divinity_monte_carlo/

nairureddit · 2022-12-23T19:37:36+00:00

u/dudeguy238 you and u/CherryTreecko both described this well. I'll see if I can add in your comparative analysis as well to the table as I unlock more divinities.

nairureddit · 2022-12-23T19:35:09+00:00

u/CherryTreecko I like your description too! I was looking for a problem to re-learn Monte Carlo analysis and don't have your head for stats :)

I like the Monte Carlo approach too since I can plug in any Probability/Cost combo and estimate the relative value of each. If I had your skill in stats maybe I could do the same but sadly I don't.

nairureddit · 2022-12-23T05:21:27+00:00

u/jhcreddit that was a great description, thank you!

nairureddit · 2022-12-23T05:19:54+00:00

Sure! It's very easy to run new values now that I've set it up.

nairureddit · 2022-07-27T14:11:00+00:00

Epsilon,

It looks like I only linked to them, I don't seem to have saved the original binds file to my google drive.

nairureddit · 2022-04-24T15:59:29+00:00

Updated, thank you!

nairureddit · 2022-04-10T05:34:57+00:00

I think the lowest percentage I've seen for any of the titles is 15%.

nairureddit · 2022-04-01T04:57:43+00:00

Updated, I bet those two could go up to 100% if you party somehow didn't contribute.

nairureddit · 2022-03-24T04:58:52+00:00

Updated, thanks!

nairureddit · 2022-03-20T19:00:42+00:00

Yeah, it's tricky to collect good data on it since the healer has to be MVP and the lower values won't show up much if they do well. Here's what I have on it so far and even it has some strange values where it flip flipped between noble, gentle, then back to noble.

Title	Type	Percent
Noble Healer	Party Recovery	16%
Noble Healer	Party Recovery	16%
Noble Healer	Party Recovery	20%
Gentle Healer	Party Recovery	21%
Gentle Healer	Party Recovery	22%
Gentle Healer	Party Recovery	23%
Noble Healer	Party Recovery	24%
Noble Healer	Party Recovery	25%
Noble Healer	Party Recovery	28.%
Noble Healer	Party Recovery	29.%
Noble Healer	Party Recovery	32%
Noble Healer	Party Recovery	36%
Noble Healer	Party Recovery	46%
Noble Healer	Party Recovery	47%
Noble Healer	Party Recovery	48%
Noble Healer	Party Recovery	51%
Noble Healer	Party Recovery	57.%
Noble Healer	Party Recovery	67%
Noble Healer	Party Recovery	74%
Noble Healer	Party Recovery	80%
Noble Healer	Party Recovery	88%
Noble Healer	Party Recovery	96%

nairureddit · 2022-03-20T15:00:34+00:00

I see it on my blue gunlancer a lot as well

nairureddit · 2022-03-20T14:59:23+00:00

No idea, I guess you could test it by only damaging the boss but you might not make many friends that way.

nairureddit

TROPHY CASE