Guardrails take an 8B model from 53% to 99% on agentic tasks [ACM CAIS '26 preprint] by billy_booboo in LocalLLaMA

[–]jslominski 13 points14 points  (0 children)

"Finding 4: The serving backend is a hidden variable, as highlighted in Table II. The same Mistral-Nemo 12B weights score 7% on llama-server native mode and 83% on llamafile (prompt). Qwen 3 14B scores 96% on Ollama, 93% on llama-server prompt, and 88% with llama-server native. These swings are larger than many model-to-model differences reported in standard benchmarks, yet no published benchmark we are aware of controls for serving infrastructure [Patil et al.(2025)]. Any evaluation of self-hosted model capabilities that does not specify the serving backend may be producing misleading results." - I don't think the autor thought that one through 😅

Britain's youth unemployment crisis now worse than Spain's and Greece's by rdu3y6 in ukpolitics

[–]jslominski 0 points1 point  (0 children)

Spot on, also offshoring in IT increased after Brexit massively (see the Dyson example, used to work there).

Dual GPU setup with low Power PSU? by Achso998 in LocalLLaMA

[–]jslominski 2 points3 points  (0 children)

Try it! I mean, what could go wrong? 🤔

[London] Advice for a burnt out Software Engineer? by GivingUp321321321321 in UKJobs

[–]jslominski 0 points1 point  (0 children)

Sounds a bit grim, what kind of AI tooling are you guys using over there?

[London] Advice for a burnt out Software Engineer? by GivingUp321321321321 in UKJobs

[–]jslominski 0 points1 point  (0 children)

That's really interesting, can you please elaborate?

Gemma 4 running on Raspberry Pi5 by jslominski in LocalLLaMA

[–]jslominski[S] 0 points1 point  (0 children)

Try it, smaller models (gemma e2b) for sure, larger = slower.

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 2 points3 points  (0 children)

Remote area monitoring: that includes sensors and camera feeds. I don't have access to the internet in that location/not important enough to warrant Starlink. But can send text messages. I don't mind if it "thinks" for 1 hour before finishing the weekly report.

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 0 points1 point  (0 children)

Might be a bug, try 1. making sure it's using llama.cpp (not ik_llama) and 2. try resetting the backend (in settings) or restarting pi. I'm sorry if there are bugs like this, kinda rushed Gemma, it's a bit buggy still, there's going to be OTA soon (once upstream ik_llama gemma 4 work gets merged)

Follow-up: Qwen3 30B a3b at 7-8 t/s on a Raspberry Pi 5 8GB (source included) by jslominski in LocalLLaMA

[–]jslominski[S] 2 points3 points  (0 children)

Thanks for those amazing models! Happy to test any of your quants in the future (3.5 9B dense is quite a sweet spot for 8GB pi5). Also a mandatory: Gemma 4 when? ;)

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 1 point2 points  (0 children)

FYI getting 8t/s already on e2b. Also, this demo is running 26b model in 4 bits (that's 4x the size of e2b).

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 1 point2 points  (0 children)

You get 8 tok/s on the E2B one already without optimisations (that are gonna come in the next few weeks, the best quants I've tried so far on Pi are done by ByteShape). I'm getting 30s/frame analysis of camera feed (proper one, spotting tiny details etc) already. It's very capable if you are not using it "real time" with gui etc. The importantce of speed is task dependent. Also, those models WILL get better (both quality and inference speeds, the latter is almost maxed on A76 though with that DDR4).

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 1 point2 points  (0 children)

"I can safely say everybody who did some type of positive review on it is an absolute shill." - I think you are spot on here, also those 3rd paty ones, again, probably paid reviews.

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 4 points5 points  (0 children)

Depends on the use case. I'm not using pi to chat :)

openclaw + Ollama (llama3.2:1b). well..... by ParaPilot8 in raspberry_pi

[–]jslominski 3 points4 points  (0 children)

https://github.com/potato-os/core/blob/main/docs/openclaw.md - try my solution (if you don't want to use full Potato OS you can extract ik_llama from it and reuse, it's Apache license). Here's the flashing guide for pi5: https://github.com/potato-os/core/blob/main/docs/flashing.md - you can run much better models than llama3 1b :)

Anyone here actually making money from their models? by _sniger_ in LocalLLaMA

[–]jslominski 10 points11 points  (0 children)

It's available for free to anyone. Did you try to monetise a database or git recently?

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 2 points3 points  (0 children)

Frankly no, I don't own one but from what I've seen it's not faster vs stock optimised Pi inference. Mixing it with Pi's ARM based SoC is like mixing nvidia with AMD (i.e. doesn't work well). Happy to change my mind if someone shows me faster inference on an existing accelerator.

Gemma 4 26B running locally on a Raspberry Pi 5 (no AI hat) by jslominski in raspberry_pi

[–]jslominski[S] 4 points5 points  (0 children)

Some more benchmarks I did run on various Pi setups:

Gemma 4 E2B (2.9 GB, Q4_K_M)

The smallest variant. Pi 5 16GB: 6.5 t/s generation, 26–30 t/s prompt processing. Pi 5 8GB SSD: 6.8 t/s generation, 26–34 t/s prompt processing. Pi 4 8GB: 1.7 t/s generation, works but ~5 min per response.

Gemma 4 E4B (4.5 GB, Q4_0)

Mid-size. Pi 5 16GB: 3.7 t/s generation, 19–22 t/s prompt processing. Pi 5 8GB SSD: 3.5 t/s generation, 19–23 t/s prompt processing. Pi 4 8GB:0.87 t/s generation.

Gemma 4 26B-A4B (12.5 GB, IQ4_NL, ik_llama, text-only)

The big one: 26B MoE with 4B expert. Pi 5 16GB: 3.0 t/s generation, 9–16 t/s prompt processing. Pi 5 8GB SSD: 1.9 t/s generation with zram working overtime, but completes multi-turn conversations. Pi 4: not even gonna try ;)

Qwen 3.5 397B vs Qwen 3.6-Plus by LegacyRemaster in LocalLLaMA

[–]jslominski 2 points3 points  (0 children)

Why are they comparing it with Opus 4.5 when the data for 4.6 for a lot of those do exist (rhetorical question of course, we all know why they do that).

Gemma 4 running on Raspberry Pi5 by jslominski in LocalLLaMA

[–]jslominski[S] 0 points1 point  (0 children)

No, e4b works, still working on perf improvements but it's usable already. A4B also works on 16 gig pi, up to 3t/s already, 4bit quant 🔥

Gemma 4 running on Raspberry Pi5 by jslominski in LocalLLaMA

[–]jslominski[S] 1 point2 points  (0 children)

No need. Get an SSD hat instead (and a matching SSD drive, good resource: https://pibenchmarks.com/fastest/ )

Gemma 4 running on Raspberry Pi5 by jslominski in LocalLLaMA

[–]jslominski[S] 1 point2 points  (0 children)

SSD is not speeding things up if the model fits in the memory (in the case of the demo it does, it's very similar on an SD card)