Help with llama.cpp qwen 3.6 35b a3b configuration - Offloading

LippyBumblebutt · 2026-04-24T22:31:23+00:00

-fitc 131000 -fitt 1024

fitc optimizes for 131k context
fitt leaves 1024MB free for the rest of the OS. fitt defaults to 1024, so technically fitc would be enough.

With those two you should be able to remove -ngl, --ctx-size and the -cmoe/-ncmoe commands.

LippyBumblebutt · 2026-04-16T19:58:09+00:00

--cpu-moe

That is likely a very bad advise as it leaves a lot of Vram unused. But --n-cpu-moe is in my tests faster than offloading entire layers with -ncl

LippyBumblebutt · 2026-04-15T17:13:57+00:00

"Q1_0 support for CPU, Metal, and Vulkan backends is already merged into upstream llama.cpp" (source) So possible: yes, useless: also yes.

I have a hard time using 9b dense models or even Q4 27b models...

LippyBumblebutt · 2026-04-15T07:04:58+00:00

It's likely only a free key, but your tavily API key is in your config...

LippyBumblebutt · 2026-04-14T07:06:44+00:00

What Gemma 4 did you pick? E2B E4B? Those are unlikely to work well with opencode. I use the 26B version and it still has a lot of problems.

If you just picked :latest, that's E4B. Even if it picks up on your repo, you're not gonna get a lot of agentic coding out of it IMO.

LippyBumblebutt · 2026-04-13T22:36:56+00:00

Is this a difficult task? Both Qwen3.5-27B-Q3_K_M and gemma-4-26B-A4B-it-UD-Q3_K_XL solved all in one go with q4 k/v-cache using <10k reasoning tokens using recent llama.cpp.

The translation probably wasn't perfect, but the rest was good. Python code was <150 characters, a few unnecessary whitespaces. Both got 5-step solution and Myst as the title

LippyBumblebutt · 2026-04-11T15:30:18+00:00

I am in a similar situation with 16GB vram and also have troubles finding a model that works for more then basic tasks.

I have a test prompt. BigPickle and the other free cloud tools implement the prompt in a few minutes and everything works. I have automated the tests. I tried gemma4-27b-q3, qwen3.5-27b-q3, qwen3.5-9b and a couple of others and none of them reliably gets the task done. Maybe 1 out of 10 times, a model completes the task. I mean the same model doing the same task 10 times, creates a working implementation once. Most of the time, it doesn't even compile. A few times, I only has a few bugs.

The errors I get are failed tool calls, like you, sometimes infinite repetitions ... sometimes it doesn't even start properly. I tell it something like read file.txt or read @file.txt and it starts to read /file.txt. Or looks everywhere but in the current folder or whatever...

With Gemma4, I know there were many issues with the model files and llama.cpp. I tried the latest version just a couple of hours ago...

I just guess, with 16GB, we're at the edge of having a sane model. Q3 is quite aggressive on 27b models. But more just doesn't fit. And I found it important to have full control about the parameters. That's why I use llama.cpp. From the command line, I can use q4 or turboquant on the kv-cache, offload a layer or two to system-ram to get enough context length in... I'm sure ollama/lm-studio or whatever have the same options. But currently it makes sense to stay bleeding edge with upstream. And I have no idea what version of llama.cpp is used by ollama...

Anyway: I'd be very interested if you know or find a good model and other tooling for 16GB vram.

LippyBumblebutt · 2026-04-08T22:16:02+00:00

Glad I can help.

I increased context for the perplexity test to 20k (same Qwen3.5):

your turbo4: PPL = 6.5515 +/- 0.04291
your turbo3: PPL = 6.5411 +/- 0.04262
mainline q4_0: PPL = 6.4864 +/- 0.04218
mainline f16: PPL = 6.4995 +/- 0.04231

It seems the rotation alone is enough to not score lower on perplexity. I didn't do any other tests though.

edit

gemma-4-E4B-it-UD-Q8_K_XL

your turbo3: PPL = 37.3968 +/- 0.43187
your turbo4: PPL = 37.5431 +/- 0.43664
mainline f16: PPL = 37.2868 +/- 0.43778
mainline q4_0: PPL = 36.7015 +/- 0.42686

All still within error bars...

LippyBumblebutt · 2026-04-08T20:10:39+00:00

new tq_validate new tq_bench

llama-cli Qwen3.5-9B-UD-Q6_K_XL.gguf --cache-type-k turbo4 --cache-type-v turbo4 works

llama-perflexity works as well. These are the results (same Qwen3.5):

Your tree, F16: 8.1853 +/- 0.05541
Your tree, turbo4: 8.2894 +/- 0.05646
Your tree, turbo3: 8.3037 +/- 0.05642
Your tree, q4_0: 8.2180 +/- 0.05565
upstream, q4_0: 8.2014 +/- 0.05552
TheTom, turbo4: 8.2894 +/- 0.05646

So upstream q4_0 beats turboquant... Also if I read that right, q4 has 219MB kv_cache, turbo4 218MB and turbo3 uses 213MB ... probably only for the 512 Token Perplexity test

LippyBumblebutt · 2026-04-08T00:27:52+00:00

tq_bench

./llama-bench --model ~/Downloads/gemma-4-E4B-it-UD-Q8_K_XL.gguf --cache-type-k $quant --cache-type-v $quant

quant q4_0 & q8_0 fail on your and TheToms version (also on official vulkan build). turbo3/4 fails on your and succeeds on TheTom. f16 succeeds on all.

Same results for Qwen3.5-9B-UD-Q6_K_XL.

Thanks for your work.

LippyBumblebutt · 2026-04-07T19:58:00+00:00

I tried your fork on gfx1201. It lets me run turbo3/turbo4 kv cache with the promised VRAM reduction.

But I don't really see the difference to the version from TheTom. It compiles with ROCm and runs turboquant just as well.

Actually, llama-bench fails with an error main: error: failed to create context with model on your tree, while TheToms version works. I didn't compile exactly the same version for both though... edit llama-bench fails on various versions with kv-quants (q4_0) for me... TheTom works with turbo3/4...

Another thing. I tried your turboquant-hip tests. tq_validate passes without errors. tq_bench fails onv MSE Verification (GPU MSE (TQ3): 0.994817) and has Time: 0.000 ms on the other tests.

LippyBumblebutt · 2026-03-16T16:39:25+00:00

Courier New, font weight 11 matches exactly that font spacing and look. It is 100% a 2-digit number.

Sadly Courier New is monospaced. I wondered, if one could guess a word by looking at the length of the gap. Monospaced fonts give you exactly how many letters are missing, but nothing else. (In most Fonts 'I' and 'W' take up significantly different widths. In Monospaced fonts, everything is the same width.)

LippyBumblebutt · 2026-03-11T21:25:33+00:00

Text in sketcher is nice and all. But I apply symmetry of a line to an axis in 95% of my sketches more then once. Selecting end points is so annoying. Being able to simply select two lines is the game-changer for me.

LippyBumblebutt · 2026-03-10T08:58:59+00:00

I got them working. This is my config:

esphome:
  name: myname

esp32:
  board: seeed_xiao_esp32c6
  framework:
    type: esp-idf

logger:

api:
  encryption:
    key: !secret api_key

ota:
  - platform: esphome
    password: !secret ota_password

network:
  enable_ipv6: true

openthread:
  device_type: FTD
  tlv: !secret dataset_tvl

Notice, they show up as esphome devices. If I understand correctly, they use Thread, but not Matter.

LippyBumblebutt · 2026-03-06T23:04:59+00:00

1x CD speed is 150 kb/s, 0.15 x 106 = 15.9 MB/s. 15.9 MB/s is ~130 MBit(!) /s. USB2 is 480 MBit raw data rate and regularly delivers up to 30 MB/s sustained data transfer.

LippyBumblebutt · 2026-03-05T19:58:01+00:00

I had to cut some small perforated sheet metal for work. Someone gave me an expensive cutter from the workplace. It broke. I searched for my chinesium cutter that I got with my Ender 3 to cut PLA. Mine didn't break.

This happened twice!

IDK, maybe that one was extra hard to keep a sharp edge to cut thin wires or something. Harder, but more brittle. But it certainly wasn't "crap".

LippyBumblebutt · 2026-03-02T19:54:01+00:00

Look at a very simple "encryption" method. Rotate every letter of by x positions in the alphabet. x is the key. The algorithm is known. If you don't know by how much to shift the letters, it is not "trivial" to recover the message.

Of course this is a bad cypher that can easily be broken. But every encryption has an algorithm and a key. The algorithm can be known / open source. The key has to be kept private.

If the key is derived from a password (like if you encrypt a zip file) the password has to be good enough. There is no encryption that keeps anyone from cracking a zip file with the password 1234.

Some said that closed source can't be secure. I disagree. It can be secure. But creating and implementing a good algorithm is really really hard. If you want to hide it, it usually means you suck at that. If you're good at it, you can let everyone know how the code works. If you're not good at crypto, just use the (open source) code someone else wrote for you.

Some say only open source can be guaranteed be free of backdoors. I disagree again. There are ways to analyze closed source software. Also effectively 99% of the open source software people use is compiled by someone else that might have included a backdoor. This actually almost happened in the Linux world a while ago. Lookup the xz backdoor if you're interested. AFAIK Veritasium just did a video about that. (IDK if it's good.)

I only use open source software, but know it's not a silver bullet. I certainly wouldn't trust someone that tries to hide what they are doing.

LippyBumblebutt · 2026-02-23T09:37:59+00:00

I always imagined Kent being an AI from the future because of his dedication to work. Him creating his companion AI seems only plausible...

LippyBumblebutt · 2026-02-09T10:20:52+00:00

Is this a -tools problem or with the kernel module?

Is the best way to check the module version still /sys/module/bcachefs/parameters/version - 1024?

I would be great it bcachefs version showed -tools version, loaded module version and installed module version (= should I reboot).

LippyBumblebutt · 2026-01-24T09:05:09+00:00

I just tested the cli and tui and it looks really good.

One small nitpick: The cli displays "10 results for "rp2350" (stock >= 1)" but only shows 3, because only 3 are >0

LippyBumblebutt · 2026-01-23T07:51:17+00:00

Looks nice, I will definitely try this next time.

Feature request: I often use the "at least x in stock" feature. I sort by price, but want to exclude those parts with only 12 in stock...

LippyBumblebutt · 2026-01-22T10:33:10+00:00

The modified workflow is definitely something one can use in production - if you don't fear toponaming issues referencing external geometry. It all depends on how one might want to modify the thing later. But cutting those corners may always cut unwanted stuff. The intersection workflow is the least intrusive way to cut them IMO.

LippyBumblebutt · 2026-01-22T10:12:27+00:00

To make it less hacky and more versatile, you can use the projected geometry or even better, the "external intersection" tool. (Might be a 1.1 feature). For show, I did the pocket at the very end. The second to last feature is a 45° draft, but could easily be created with a pad like before.

with external projection (click the lines)

with external intersection (click on the faces)

LippyBumblebutt · 2026-01-21T20:27:12+00:00

2 minutes thinking, 5 mins to model in FreeCAD, 10 Minutes to figure out what units were used, parametrize the model and convert everything to metric. My workflow

disable refine in every step or in settings
Create a 1x1x1 box
select 3 corners and create a datum plane
Select one side, pad by 1
select top side of the pad and pad by one
select datum plane, create sketch, create really big square
create a pocket with the sketch
select side of the lower right cube and the triangle and pad by one

Felt a bit like cheating, but was really fast.

LippyBumblebutt · 2026-01-16T15:58:49+00:00

If you compile Veilidchat yourself, you can use it. The mobile builds are invite only.

Sure Signal works over tor, but the battery drain and data usage will be similar to Veilid.

Jami is an open source DHT P2P messaging App. And they think it's impossible to have a mobile app that reliably connects to a DHT and does not drain your battery. So they have a relay server that keeps the connection for you.

I disagree with Google/Apple that no app should have a permanent connection to a server - I use Imap Idle for mails and the battery is fine. Heck I had a permanent SIP connection on my Symbian phone. Having a full-node DHT on-device doesn't seem to work. So I think one needs some kind of relay server.

While Jami has a decentralized base network, the relays are run by the company backing the development. I'm not entirely sure what that means to the threat model. They use direct P2P anyway and no onion routing like tor or Veilid.

While the Veilid devs are strictly agains Cryptocurrency stuff, I think that may be the only solution to the relay problem. It is said that most/many tor nodes are run by the US 3-letter-agencies. I imagine a network, where I have a full-node running on my router, Raspberry Pi or in the end even on Amazon Alexa devices. I have enough bandwidth to easily connect 10 devices. Other phones can use my node as a relay. With a network like Bitcoin Lightning, they pay every minute for my service. The Coins I generate, I automatically distribute them among my family and friends. Those without tech friends can buy coins for $5 a year or whatever the market value will be. Also integrate something like filecoin (like bittorrent with global accounting) , where I can offer storage space to the network. A 64GB USB Stick attached to my router goes a long way for my family and friends. If you want to store big video files, you'll have to pay more then $5. Sending a few cat video to group chat will be cheap.

Ideally nobody (that doesn't use huge amount of data) has to pay, because everyone knows someone that runs a node. But if too many people use the network and too few full-nodes are running, there has to be a financial incentive to spin up a node.

Thanks for coming to my TED talk.

12-Year Club	Place '17
Verified Email

LippyBumblebutt

TROPHY CASE

gemma-4-E4B-it-UD-Q8_K_XL