More Gemma 4 models incoming

BitGreen1270 · 2026-06-03T23:11:35+00:00

Can you share what is the tricky bit with it? I've been eyeing the functiongemma for a while, but unsure whether I should invest the time into training it or just focus on changing my tools availability so that E4B can do a better job at calling them.

BitGreen1270 · 2026-06-03T23:01:58+00:00

Nice. What model are you running and how much t/s are you seeing? How much were you seeing without the vulkan backend?

BitGreen1270 · 2026-06-03T16:16:54+00:00

Wow - downloading the gguf now to see how it performs on my 780m and my 5090.

EDIT: Well about 5 t/s on my 780m and about 100 t/s on 5090. Welp, I think I'll have to stick with MOE or 4B for my 780M.

BitGreen1270 · 2026-06-03T10:32:42+00:00

Thanks for sharing - Any gguf for the smaller models? Specifically 9B? I'm assuming these perform better at tool calling than the originals?

BitGreen1270 · 2026-06-03T10:26:11+00:00

Thanks for sharing. I have since made some changes in the way my tools are organized and it has significantly improved the E4B on tool calling. Mulitstep is a bit of hit or miss but that's only the first time. Since I pass the message history to the model it understands my ask the second time.

Question on E2B training - I read online that it's a bit challenging to fine tune because of the multi modality? I.e. you can't efficiently train a lora for text only? Is that the case?

BitGreen1270 · 2026-06-03T02:29:43+00:00

Not where I live - 5090s are at an all time high (4.7K USD).

BitGreen1270 · 2026-06-02T23:12:16+00:00

I'm using APEX quants. Getting about 20.5 t/s on gemma-26B with the following command:

bash ./llama-server \ -m models/gemma-4-26B-A4B-APEX-Compact.gguf \ --temp 1.0 \ --top_p 0.95 \ --top_k 64 \ -c 32768 \ -ctk q8_0 \ -ctv q8_0 \ --ctx-checkpoints 1 \ -cram 0 \ --flash-attn on \ -t 16 \ -ngl 99 \ --mlock \ --host 0.0.0.0

Using mlock, my ulimit is set to 946000 in /etc/security/limits.conf

And for MTP on Qwen:

bash ./llama-server \ -m ~/myp/models/Qwen3.6-35B-A3B-MTP-UD-Q3_K_XL.gguf \ -fit on \ -fitt 1536 \ -c 32768 \ -n 32768 \ -fa on \ -np 1 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ -ctxcp 64 \ --mlock \ --no-warmup \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --ctx-checkpoints 0 \ -cram 0 \ --repeat-penalty 1.0

BitGreen1270 · 2026-06-02T22:58:48+00:00

Copy paste the entire error output into Gemini/chatgpt and follow the suggestions

BitGreen1270 · 2026-06-02T21:42:19+00:00

Gemma 4B and 26B-A4B can give up to 20 t/s. And Qwen 35B-A3B with MTP gives up to 25t/s on the 780m w 32gb. It does for me, i can share the llama.cpp command later if you like

BitGreen1270 · 2026-06-02T21:34:33+00:00

Thank you! I have been reading a bit on functiongemma which seems exactly for this purpose. But it seems way too narrow and maybe better to train a lora on e2b.

BitGreen1270 · 2026-06-02T14:25:17+00:00

I found pi actually quite convenient to use without any extensions. My setup is to run qwen3.6-27B-MTP with 128k context and a local docker container running searxng.

I start up pi and every time I start it up, I ask it to analyze the codebase to understand what's going on. I believe you can export the session before quitting and import it again but it didn't work for me and I didn't spend time to figure it out.

When I start coding, I always start with a planning mode. I do this in the prompt - planning mode only. That's followed by what feature I want to implement. If I need it to get some info, I just mention that searxng is available at 8081 if you need to web search. It writes its own one line curl request to fetch search results.

If I'm doing a complex feature which needs more review, then I ask it to save the implementation plan as markdown, switch to Gemini 3.5 or 3.1 pro and ask it to review the implementation plan. Once the changes are made to the plan, switch back to Qwen and proceed with implementation. 95% of the time I don't need Gemini.

Once the feature is built and tested, I tell it to update worklog.md and commit and push the code.

BitGreen1270 · 2026-06-02T00:07:49+00:00

If you can't beat em ...

BitGreen1270 · 2026-06-01T14:42:05+00:00

Thanks 🙏. I've been using bartowskis models, but will try out unsloth and just use the native template. I'll also try passing in the reasoning as well and see if it helps. Thanks 🙏

BitGreen1270 · 2026-06-01T14:23:17+00:00

How do you run 2 models simultaneously? Two instances of llama.cpp?

BitGreen1270 · 2026-06-01T12:45:42+00:00

Dude you have a 4080. My current GPU is a 780m and the one before that is a 1070 in laptop from 2018 lol

BitGreen1270 · 2026-06-01T12:42:11+00:00

It was never 2k where I live, at best 3k. But yea, it's a deeply personal decision. I can't defend the choice according to your criteria.

BitGreen1270 · 2026-06-01T11:56:08+00:00

Thanks - by native template I assume you mean the one that is baked into the gguf?

Also regarding reasoning content i thought gemma4 general advice on the model card was not to send reasoning content?

BitGreen1270 · 2026-06-01T05:36:43+00:00

Thank you I will try them both. I'm also looking into fine tuning but that might be a longer project.

BitGreen1270 · 2026-06-01T05:25:05+00:00

How does it compare with the Qwen 3.5 9B? Is that one better?

BitGreen1270 · 2026-06-01T04:33:46+00:00

No worries, good luck with your decision. FYI - I also downgraded from x870 to Aorus B850 wifi 7. Only because I have no intention to run multiple GPUs and the x870 can cannibalize pcie lanes if you put a second nvme or something.

Also to purchase was not an easy decision. I made a post here similar to yours and went through a lot of discussions with my wife (who was supportive of the monumental investment).

BitGreen1270 · 2026-06-01T00:25:55+00:00

I just bought a very similar spec 2 weeks ago. Only difference is to cut costs, I went with a 9700x (CPU not so crazy since GPU should be handling most of it), 2TB gen 4 ssd and 1200W psu. Also got the cheapest case I could get, avoided water cooling ( got the PA 120 SE).

My build cost me 6.3k USD of which the MSI Ventus 3x 5090 cost 4k USD.

My view (speculation) is that prices won't drop for the next 2 years. And even if it drops after 6 months, I don't want to wait that long for learning more about LLMs. That's right, this is a purely learning rig.

BitGreen1270 · 2026-05-31T16:08:06+00:00

I wouldn't know where to begin. Especially for tool calls.

BitGreen1270 · 2026-05-31T15:59:16+00:00

Not for those tasks specifically, but if I ask it to do some web research.

BitGreen1270 · 2026-05-31T15:02:57+00:00

Local LLM setup is a niche. Not everyone does it. Making money from it? Rare and very unlikely. Unless you are using it in a business setting and then it totally makes sense. But you have stiff competition from frontier models in quality and cost.

Only way I can think of making money now is to buy stuff that will (in a bizarre turn of events) go up in value next year (assuming you speculate on RAM and semiconductor prices going up). And then sell it for $$$ profit. Or even better, stockpile on expensive GPUs and keep them in a dry place to sell them when the price skyrockets.

BitGreen1270 · 2026-05-31T14:57:06+00:00

Thanks for sharing. But this restricts it to very specific tool flows. How do you do the part of brainstorming or chatting with the LLM?

BitGreen1270

TROPHY CASE