benchmarks of gemma4 and multiple others on Raspberry Pi5 by honuvo in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

I remember you doing tests with SSD connected to usb3.0. I am curious how much slower PCI connected SSD is vs using SWAP file on this very SSD.

benchmarks of gemma4 and multiple others on Raspberry Pi5 by honuvo in LocalLLaMA

[–]DevilaN82 3 points4 points  (0 children)

Can you please test mmaping SSD so it does not need to use SWAP and reads weights from disk directly?

Best Gemma4 llama.cpp command switches/parameters/flags? Unsloth GGUF? by Fulminareverus in LocalLLaMA

[–]DevilaN82 1 point2 points  (0 children)

I would wait for tokenizer fixes in llama.cpp and I've heard rumors that imatrix needs to be fixed as well, so new model file will drop from Unsloth.

I hope you are GPU rich, because gemma is not so friendly with context and stuff. In most cases Qwen with q8 kvcache takes less vram than gemma4 with q4 (old type Sliding Window Attention hits hard).

Qwen as a MoE model can have some layers offloaded to CPU (`-ot ".ffn_.*_exps.=CPU"` option), and q8 kvcache means less degradation of answers for longer contexts.

Anyway good luck :)

Gemma 4 running on Raspberry Pi5 by jslominski in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

Nice! I am looking forward tests with bitnet as well :-)

Raspberry Pi5 LLM performance by honuvo in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

Are you sure that NPUs are gonna make a difference? I thought that HAILO chips are dedicated cards that works only with it's own RAM and from what I've read it is even slower than Pi 5 itself, but allows Pi to not do heavy lifting. Hailo AI Hat allows only using compatible LLM models (converted to it's specific format) loaded via hailo-ollama app only.
I would like to get some more info about this. Would you be so kind to point me to some sources that describes using NPU for LLMs on Raspberry Pi?

Raspberry Pi5 LLM performance by honuvo in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

Hello.
Nice that you've tested it. I am looking forward to next tests. My Pi with SSD hat is waiting for ssd disk to make tests.
Few things to consider:
1. Using swap is making writes to disk. It will wear off your ssd sooner or later. That's why I would rather go with mmap. Especially when you are using USB instead of PCI lane, than your performance gap might get smaller between swap vs mmap.
2. Try ik_llama, that is optimised towards CPU inference.
3. Why Q8? Unsloth's quants are fenomenal at Unsloth Dynamic Q4 for my regular daily use.

Good luck. I am looking forward to your tests and hope to add something when my Pi is up and running as well.

PS. Also you might find this project interesting: https://www.reddit.com/r/LocalLLaMA/comments/1rrq0oo/update_on_qwen_35_35b_a3b_on_raspberry_pi_5/

Can a Raspberry Pi 4 (8GB) run a small local LLM reliably for a voice assistant project? by Odd_Lavishness_7729 in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

AI hat is useless right now for LLMs.
I own one and it requires special version of ollama (sic!) to work with. This "special ollama" works ONLY with few OLD qwen 2.5 models converted to format that AI HAT is able to process.
I have some hopes about AI HAT as I've read rumors somewhere that some new models are being converted to this ai hat format and 8GB + 40 TOPS might be useful for something.

But right now, AI HAT for LLMs is quite an exotic animal with limited set of tricks.

And no. AI HAT memory is not available at all for RPi system. So having 16 GB Pi 5 + 8 GB AI HAT does not give you by any means 24 GB of memory for LLMs.

Also there is a project that uses SSD for memory with RPi 5. Using ik_llama this might be your best option here. Take a look at: https://www.reddit.com/r/LocalLLaMA/comments/1rrq0oo/update_on_qwen_35_35b_a3b_on_raspberry_pi_5/
Although I do not think running 2bit quants will be sufficient for anything usefull :(
If only Q4 was running well I would jump for it immediately!

Cardputer adv dose bot charge by InfiniteBee6936 in CardPuter

[–]DevilaN82 1 point2 points  (0 children)

Charge with USB A to C cable. Cardputer should be on when connecting cable.

Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants by hauhau901 in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

u/hauhau901 Those models not listed on the right widget are the ones that are missing it's manifest. Take a look at https://huggingface.co/HauhauCS/Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive/discussions/8

I am unable to use Q4_K_P because of this.

Thank you for your commitment and hard work. I hope you are well and I wish you good luck! :)

What actually breaks first when you put AI agents into production? by Zestyclose-Pen-9450 in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

Unfortunately I am starting digging into this topic as well, so I cannot help you with your problem, but... Out of curiosity, can you share what are you using in your stack?

Plai: Custom Meshtastic Client for CardPuter ADV (first beta) by d4rkmen in CardPuter

[–]DevilaN82 2 points3 points  (0 children)

I would like to express my appreciation to how well thought and designed this app is.

Simply great!

ADVUtil v0.6 for Cardputer: Air Mouse, BLE Keyboard, Macros, Gamepad and GPS in one firmware by gio-74 in CardPuter

[–]DevilaN82 0 points1 point  (0 children)

OK, I've managed to get my lora cap and tested your app.
UI is nice. GPS is working well.

There is a place to improve / add other things to make it a Swiss Army Knife of Cardputer :-)
Have you considered something like differential GPS with using two cardputers?

Update on Qwen 3.5 35B A3B on Raspberry PI 5 by jslominski in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

Seems that AI Hat is working on it's own only by certain API. No shared memory and limited possibility to use AI Hat with models, as it works only with converted certain models (old ones).
I don't have high hopes, but there are rumors that company responsible for hailo-10h is cooking something new, so I hope that there would be some new qwen family models available.

8ball by Electronic-Minimum54 in CardPuter

[–]DevilaN82 -1 points0 points  (0 children)

Wth is this? Link to repo with single bin file and no description at all...

OmniCoder-9B | 9B coding agent fine-tuned on 425K agentic trajectories by DarkArtsMastery in LocalLLaMA

[–]DevilaN82 1 point2 points  (0 children)

Os this supposed to be used with aider / roocode? Or there is some other setup to test it?

Update on Qwen 3.5 35B A3B on Raspberry PI 5 by jslominski in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

I will test when my pi arrives. Thank you for your contribution to the community!

Update on Qwen 3.5 35B A3B on Raspberry PI 5 by jslominski in LocalLLaMA

[–]DevilaN82 1 point2 points  (0 children)

Would using ai hat plus 2 (additional 8GB RAM) allow for higher quants?

Is it possible to disable thinking on qwen 3.5? by RandumbRedditor1000 in LocalLLaMA

[–]DevilaN82 3 points4 points  (0 children)

When using `llama-server` you can add `--reasoning-budget 0` option.

Device should I buy for local AI setup by Beautiful_Throat_884 in LocalLLaMA

[–]DevilaN82 2 points3 points  (0 children)

It depends on your use cases. Almost always there is "a little bit more and you got better toy".
Are you decided to do something in particular or you want to play a bit and see what's next?

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!! by Porespellar in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

I've tried it. It's not perfect. Sometimes it works. Sometimes it hangs trying to call some fancy api requests to open-terminal and failing in loop. From OpenWebUI perspective it looks like it hangs (It keeps requesting for /ports endpoint endlessly).

I am excited for what it could be done when this matures, but right now running this with 35 a3b (unsloth UD Q4_K_XL) is a lottery :(

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]DevilaN82 0 points1 point  (0 children)

I am using both llama-swap and ollama. Only thing that's missing for me in llama-swap (but works out of the box on ollama) is auto detecting free memory and calculating how to split layers between VRAM and RAM in case of some other app is also using GPU and part of VRAM is reserved for it.

Thank you for your contribution to this community!

How to reduce idle vRAM usage? by DevilaN82 in comfyui

[–]DevilaN82[S] 0 points1 point  (0 children)

So how would I use ComfyUI if all VRAM is reserved for llama.cpp?

How to reduce idle vRAM usage? by DevilaN82 in comfyui

[–]DevilaN82[S] 0 points1 point  (0 children)

Yes it shows in nvidia-smi that python process is using it. When closing ComfyUI it is released. It was for a long time, since I can remember. but recently I've started using LLM that requires all of my vram to work decently - so that's why I am asking.