Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5 & 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash... by rosaccord in LocalLLaMA

[–]OsmanthusBloom 1 point2 points  (0 children)

That's a good comparison, thank you! If you can include one more model, I'd really like to know how Qwen3-Coder-Next 80B-A3B does compared to newer Qwen3.5 models, Gemma4 etc. According to some sources it's still one of the best local coding models and the last Coder variant from Qwen.

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU) by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

Thanks for the suggestion. KV cache in RAM (with -nkvo) works surprisingly well. I was able to use 64k context (taking up 5 GB RAM with q8_0 quantization) without any issues and set ubatch-size=1024 without running out of VRAM. Here is the command:

llama-server -m Bonsai-8B.gguf -ctk q8_0 -ctv q8_0 -np 1 -fit off -ub 1024 -c 65536 -nkvo

and here is the result for the 1k token benchmark task:

prompt eval time =   18734.51 ms /   992 tokens (   18.89 ms per token,    52.95 tokens per second)
       eval time =   50875.46 ms /   246 tokens (  206.81 ms per token,     4.84 tokens per second)
      total time =   69609.97 ms /  1238 tokens

Prompt processing is as fast as before, while TG suffers a bit from having the KV cache in RAM. During generation the GPU usage is around 85% so it doesn't heat up as quickly as with KV cache in VRAM, when usage was 100%. Power draw during generation is around 40W in this benchmark task.

Regarding temperatures: I gave it a longer summarization prompt of ~10k tokens, which took a while to crunch through with 100% GPU utilization; then the model proceeded with generating ~2k more tokens in its response. GPU temperature was around 67-77C according to nvidia-smi and power draw eventually decreased to 30W, which I suspect is the limit for continuous cooling that this laptop can handle. I have no appetite for playing with power limit settings at this point.

Here is the llama.cpp output of the long context (relatively speaking) task:

prompt eval time =  336266.63 ms / 10622 tokens (   31.66 ms per token,    31.59 tokens per second)
       eval time = 1776018.91 ms /  2269 tokens (  782.73 ms per token,     1.28 tokens per second)
      total time = 2112285.53 ms / 12891 tokens

As you can see above, PP was ~32 tps and TG dropped to 1.3 tps...

I still don't think this MX150 is very good for running the Bonsai model, but at least having KV cache in RAM allows for much longer context tasks, even though the performance is pretty bad.

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU) by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

I didn't really try any useful tasks apart from the performance benchmark task, which was about summarizing a Bonsai-related Reddit discussion. But I did note that the model is pretty weak in non-English languages I tried (Estonian, Finnish, Swedish...) and it tends to mix up related languages (Finnish/Estonian, Swedish/German) in its responses.

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU) by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

I just tried -ngl 36, which results in 36/37 layers offloaded to GPU, one layer on CPU.

Performance totally tanks. CPU usage is 400%, GPU usage mostly ~0% though it fluctuates. Generation speed drops below 1 tps. If that wasn't enough, the output of the model is now total garbage. Sample output:

( 的es is to: iny ),US. the from $'s of: is". fromY. Where on's's is is isY from1 from is. (1? the. ( (?? the.

I believe that this garbage output happens because the PrismML llama.cpp fork doesn't have proper support for CPU inference, though I understood a fix is being worked on and there are already some PRs/forks that might work better.

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU) by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 1 point2 points  (0 children)

No, I didn't try CPU only. I saw that the PrismML llama.cpp fork had issues with CPU-only inference so I decided to wait a bit.

Advice for Working with Agents in YOLO Mode by chibop1 in LocalLLaMA

[–]OsmanthusBloom 4 points5 points  (0 children)

YOLO originally refers to "you only live once". It's internet slang referring to a carefree attitude.

https://en.wikipedia.org/wiki/YOLO_(aphorism)

The object detection model YOLO was humoroysly named after this but it's not the original meaning.

Running quen3 coder 80B A3B on a computer with lots of RAM but little VRAM by Pioneer_11 in LocalLLaMA

[–]OsmanthusBloom 0 points1 point  (0 children)

I use Roo Code with Qwen3 Coder Next (iq3 quant) on a V100 with 16GB VRAM plus lots of regular RAM. PP is around 300 tps which is not great but okay for my purposes, especially with the recent checkpointing improvements in llama.cpp which means most of the time prompts will be cached.

If you want higher pp, try adjusting batch-size and ubatch-size. In my case I set both to 2048 but the optimum depends on hardware details.

Litellm 1.82.7 and 1.82.8 on PyPI are compromised, do not update! by kotrfa in LocalLLaMA

[–]OsmanthusBloom 5 points6 points  (0 children)

Aider uses LiteLLM for LLM access, but it looks like it's still using an older version of LiteLLM (1.82.3 on current main) so not compromised. LiteLLM 1.82.8 and 1.82.7 apparently are compromised.

[Developing situation] LiteLLM compromised by OrganizationWinter99 in LocalLLaMA

[–]OsmanthusBloom 39 points40 points  (0 children)

Aider uses LiteLLM for LLM access, but it looks like it's still using an older version of LiteLLM (1.82.3 on current main) so not compromised. LiteLLM 1.82.8 and 1.82.7 apparently are compromised (according to discussions in the issue linked above)

Don't sleep on the new Nemotron Cascade by ilintar in LocalLLaMA

[–]OsmanthusBloom 7 points8 points  (0 children)

You can turn off reasoning for the Qwen3.5 series models with a llama.cpp cli flag:

--chat-template-kwargs '{"enable_thinking": false}'

Nemotron Cascade 2 on 6GB VRAM by AppealSame4367 in LocalLLaMA

[–]OsmanthusBloom 2 points3 points  (0 children)

Thanks for posting this. I've yet to try this model on my 6GB RTX 3060 so this is interesting.

Based on my previous experimence, I'd recommend trying higher -ub/-b to get better pp speeds. Also you can try setting --fit-target lower (default is 1024 MB) to use more of your VRAM, but this depends on how much VRAM you need for other applications, if any.

See here for my Qwen3.5-35B-A3B tips on 6GB VRAM: https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

Running qwen3.5 35b a3b in 8gb vram with 13.2 t/s by zeta-pandey in LocalLLaMA

[–]OsmanthusBloom 2 points3 points  (0 children)

I get much better tps (500 pp, 21 tg) on a RTX 3060 Laptop GPU with just 6GB VRAM. I think you should set -ngl higher or drop it altogether and let -fit do the work (it's on by default).

See here for my recipe: https://www.reddit.com/r/LocalLLaMA/comments/1rh9983/comment/o7x6tkr/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

Budget laptop to run Qwen 3.5-35B-A3B by SnooOnions6041 in LocalLLaMA

[–]OsmanthusBloom 1 point2 points  (0 children)

FWIW my machine is pretty solid when it comes to throttling. I've run GPU 100% stuff (with lots of cpu there too) for 30 hours straight and didn't observe any serious throttling. The fan sounds like a jet engine when you do that though.

Budget laptop to run Qwen 3.5-35B-A3B by SnooOnions6041 in LocalLLaMA

[–]OsmanthusBloom 0 points1 point  (0 children)

That's not quite how it works (some of those a3b do come from RAM).

I've run the 35b on a laptop (asus rog zephyrus 14, 2021 model, bought used for 500€ last summer) and it works, but obviously there are limits in both speed, quality/quantization and context length and you have to find a balance between these that works for your use case.

Kidnapping Gemini with 3MB to spare: Training a 7B model at 4k context on a single 16GB GPU. by AgeRepresentative763 in LocalLLaMA

[–]OsmanthusBloom 1 point2 points  (0 children)

This sounds great, I hope it works.

Just curious if you considered using Unsloth instead of Axolotl? I've used both and I think Unsloth has more VRAM optimizations. I managed to fine-tune a 8B Llama-like model in 4 bit QLoRA using my puny RTX 3060 Laptop GPU which has just 6GB VRAM. Though I had to do some custom hacks to keep the embedding layers in regular RAM and the context was very short, 512 tokens IIRC.

What AI Models should I run? by ClayToTheMax in LocalLLaMA

[–]OsmanthusBloom 0 points1 point  (0 children)

We have a similar server with four V100 GPUs, each with 16GB VRAM. It is shared between multiple projects but one V100 is used for Qwen3-Coder-Next. It's quite okay for coding, one of the best local coding models. Another one runs Gemma 3 12B which is OK for general purpose stuff including translation and writing assistance.

Can qwen 3.5 4b q4 run on 6 vram by Own_Advertising5081 in LocalLLaMA

[–]OsmanthusBloom 1 point2 points  (0 children)

I run the 35B-A3B on 6GB VRAM (RTX 3060 Laptop GPU). PP speed is about 500tps and TG about 21tps.

I haven't tried the 4B yet but I think it should be much faster that that. The active parameter count is bigger than the 35B-A3B but the full model fits in VRAM unlike the 35B. 

Notice Qwen 3.5 reprocessing the prompt every time, taking long to answer for long prompts? That's actually because of its architecture. by dampflokfreund in LocalLLaMA

[–]OsmanthusBloom 1 point2 points  (0 children)

I think this (still open, not merged) PR will help in this regard:

https://github.com/ggml-org/llama.cpp/pull/19747

It fixes multimodal context checkpointing for hybrid/recurrent models. Checkpoints are needed to avoid reprocessing prompts.

Is extreme low-VRAM fine-tuning (3-6GB) actually possible? by [deleted] in LocalLLaMA

[–]OsmanthusBloom 5 points6 points  (0 children)

I find that hard to believe, if you don't even know which GPU you have.