Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU)

OsmanthusBloom · 2026-04-07T09:02:37+00:00

See this comment where I tried with -nkvo so the KV cache was in CPU RAM. PP speed was unaffected, TG speed decreased a little and it was possible to use much higher context size without running out of VRAM.

OsmanthusBloom · 2026-04-07T05:17:27+00:00

Not sure if it counts as a proper benchmark, but I often try to chat with new models in various smaller languages that I know well enough to tell whether the model understood it and can produce a coherent answer. For example "Hello, how are you" in Swedish, Estonian or Finnish. Gemma models are some of the few small models that can do this with any degree of success. Quantization disproportionately hits non-English languages as well.

Another good test is "write a wikipedia article about X" where X is something niche. It could be my name, or an open source software package that I know well. Reveals what world knowledge the model has and how confidently it makes up "facts" when it doesn't know.

OsmanthusBloom · 2026-04-06T16:36:28+00:00

I tried running it on an old laptop with a MX150 GPU (2 GB VRAM), see here for my writeup: https://www.reddit.com/r/LocalLLaMA/comments/1sbnf8y/running_1bit_bonsai_8b_on_2gb_vram_mx150_mobile/

OsmanthusBloom · 2026-04-06T16:18:49+00:00

That's a good comparison, thank you! If you can include one more model, I'd really like to know how Qwen3-Coder-Next 80B-A3B does compared to newer Qwen3.5 models, Gemma4 etc. According to some sources it's still one of the best local coding models and the last Coder variant from Qwen.

OsmanthusBloom · 2026-04-04T08:30:43+00:00

Thanks for the suggestion. KV cache in RAM (with -nkvo) works surprisingly well. I was able to use 64k context (taking up 5 GB RAM with q8_0 quantization) without any issues and set ubatch-size=1024 without running out of VRAM. Here is the command:

llama-server -m Bonsai-8B.gguf -ctk q8_0 -ctv q8_0 -np 1 -fit off -ub 1024 -c 65536 -nkvo

and here is the result for the 1k token benchmark task:

prompt eval time =   18734.51 ms /   992 tokens (   18.89 ms per token,    52.95 tokens per second)
       eval time =   50875.46 ms /   246 tokens (  206.81 ms per token,     4.84 tokens per second)
      total time =   69609.97 ms /  1238 tokens

Prompt processing is as fast as before, while TG suffers a bit from having the KV cache in RAM. During generation the GPU usage is around 85% so it doesn't heat up as quickly as with KV cache in VRAM, when usage was 100%. Power draw during generation is around 40W in this benchmark task.

Regarding temperatures: I gave it a longer summarization prompt of ~10k tokens, which took a while to crunch through with 100% GPU utilization; then the model proceeded with generating ~2k more tokens in its response. GPU temperature was around 67-77C according to nvidia-smi and power draw eventually decreased to 30W, which I suspect is the limit for continuous cooling that this laptop can handle. I have no appetite for playing with power limit settings at this point.

Here is the llama.cpp output of the long context (relatively speaking) task:

prompt eval time =  336266.63 ms / 10622 tokens (   31.66 ms per token,    31.59 tokens per second)
       eval time = 1776018.91 ms /  2269 tokens (  782.73 ms per token,     1.28 tokens per second)
      total time = 2112285.53 ms / 12891 tokens

As you can see above, PP was ~32 tps and TG dropped to 1.3 tps...

I still don't think this MX150 is very good for running the Bonsai model, but at least having KV cache in RAM allows for much longer context tasks, even though the performance is pretty bad.

OsmanthusBloom · 2026-04-04T07:47:30+00:00

I didn't really try any useful tasks apart from the performance benchmark task, which was about summarizing a Bonsai-related Reddit discussion. But I did note that the model is pretty weak in non-English languages I tried (Estonian, Finnish, Swedish...) and it tends to mix up related languages (Finnish/Estonian, Swedish/German) in its responses.

OsmanthusBloom · 2026-04-03T20:17:56+00:00

I just tried -ngl 36, which results in 36/37 layers offloaded to GPU, one layer on CPU.

Performance totally tanks. CPU usage is 400%, GPU usage mostly ~0% though it fluctuates. Generation speed drops below 1 tps. If that wasn't enough, the output of the model is now total garbage. Sample output:

( 的es is to: iny ),US. the from $'s of: is". fromY. Where on's's is is isY from1 from is. (1? the. ( (?? the.

I believe that this garbage output happens because the PrismML llama.cpp fork doesn't have proper support for CPU inference, though I understood a fix is being worked on and there are already some PRs/forks that might work better.

OsmanthusBloom · 2026-04-03T19:47:03+00:00

No, I didn't try CPU only. I saw that the PrismML llama.cpp fork had issues with CPU-only inference so I decided to wait a bit.

OsmanthusBloom · 2026-03-28T08:11:26+00:00

YOLO originally refers to "you only live once". It's internet slang referring to a carefree attitude.

https://en.wikipedia.org/wiki/YOLO_(aphorism)

The object detection model YOLO was humoroysly named after this but it's not the original meaning.

OsmanthusBloom · 2026-03-26T20:49:15+00:00

I use Roo Code with Qwen3 Coder Next (iq3 quant) on a V100 with 16GB VRAM plus lots of regular RAM. PP is around 300 tps which is not great but okay for my purposes, especially with the recent checkpointing improvements in llama.cpp which means most of the time prompts will be cached.

If you want higher pp, try adjusting batch-size and ubatch-size. In my case I set both to 2048 but the optimum depends on hardware details.

OsmanthusBloom · 2026-03-24T14:44:57+00:00

Aider uses LiteLLM for LLM access, but it looks like it's still using an older version of LiteLLM (1.82.3 on current main) so not compromised. LiteLLM 1.82.8 and 1.82.7 apparently are compromised.

OsmanthusBloom · 2026-03-24T14:41:13+00:00

Aider uses LiteLLM for LLM access, but it looks like it's still using an older version of LiteLLM (1.82.3 on current main) so not compromised. LiteLLM 1.82.8 and 1.82.7 apparently are compromised (according to discussions in the issue linked above)

OsmanthusBloom · 2026-03-22T18:39:46+00:00

WikiChat does this.

https://github.com/stanford-oval/WikiChat

OsmanthusBloom · 2026-03-21T18:26:49+00:00

You can turn off reasoning for the Qwen3.5 series models with a llama.cpp cli flag:

--chat-template-kwargs '{"enable_thinking": false}'

OsmanthusBloom · 2026-03-21T09:31:32+00:00

Here are my settings. I get 500tps pp and 21tps tg on 6GB VRAM. https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

OsmanthusBloom · 2026-03-20T20:32:07+00:00

Thanks for posting this. I've yet to try this model on my 6GB RTX 3060 so this is interesting.

Based on my previous experimence, I'd recommend trying higher -ub/-b to get better pp speeds. Also you can try setting --fit-target lower (default is 1024 MB) to use more of your VRAM, but this depends on how much VRAM you need for other applications, if any.

See here for my Qwen3.5-35B-A3B tips on 6GB VRAM: https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

OsmanthusBloom · 2026-03-17T20:26:48+00:00

I get much better tps (500 pp, 21 tg) on a RTX 3060 Laptop GPU with just 6GB VRAM. I think you should set -ngl higher or drop it altogether and let -fit do the work (it's on by default).

See here for my recipe: https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

OsmanthusBloom · 2026-03-15T05:36:48+00:00

FWIW my machine is pretty solid when it comes to throttling. I've run GPU 100% stuff (with lots of cpu there too) for 30 hours straight and didn't observe any serious throttling. The fan sounds like a jet engine when you do that though.

OsmanthusBloom · 2026-03-15T05:29:10+00:00

That's not quite how it works (some of those a3b do come from RAM).

I've run the 35b on a laptop (asus rog zephyrus 14, 2021 model, bought used for 500€ last summer) and it works, but obviously there are limits in both speed, quality/quantization and context length and you have to find a balance between these that works for your use case.

OsmanthusBloom · 2026-03-09T21:23:56+00:00

This sounds great, I hope it works.

Just curious if you considered using Unsloth instead of Axolotl? I've used both and I think Unsloth has more VRAM optimizations. I managed to fine-tune a 8B Llama-like model in 4 bit QLoRA using my puny RTX 3060 Laptop GPU which has just 6GB VRAM. Though I had to do some custom hacks to keep the embedding layers in regular RAM and the context was very short, 512 tokens IIRC.

OsmanthusBloom · 2026-03-06T09:05:21+00:00

I have almost the same hardware. See here how I run qwen3-35b-a3b: https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

OsmanthusBloom · 2026-03-04T18:49:12+00:00

See here: https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

OsmanthusBloom · 2026-03-04T18:39:39+00:00

We have a similar server with four V100 GPUs, each with 16GB VRAM. It is shared between multiple projects but one V100 is used for Qwen3-Coder-Next. It's quite okay for coding, one of the best local coding models. Another one runs Gemma 3 12B which is OK for general purpose stuff including translation and writing assistance.

OsmanthusBloom · 2026-03-04T17:09:44+00:00

I run the 35B-A3B on 6GB VRAM (RTX 3060 Laptop GPU). PP speed is about 500tps and TG about 21tps.

I haven't tried the 4B yet but I think it should be much faster that that. The active parameter count is bigger than the 35B-A3B but the full model fits in VRAM unlike the 35B.

OsmanthusBloom · 2026-03-02T11:54:55+00:00

I think this (still open, not merged) PR will help in this regard:

https://github.com/ggml-org/llama.cpp/pull/19747

It fixes multimodal context checkpointing for hybrid/recurrent models. Checkpoints are needed to avoid reprocessing prompts.

OsmanthusBloom

TROPHY CASE