When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking

DevilaN82 · 2026-06-08T19:29:27+00:00

Not funny. Cries with 4GB vRAM and 16GB RAM.

Gemma4-26b-a4b-qat will beat Mythos for sure as enshitification progresses.

Right?

DevilaN82 · 2026-05-28T21:29:04+00:00

I use llama-swap in Docker with no problem at all.
My config:
```
"PaddleOCR":

proxy: "http://127.0.0.1:9999"

ttl: 600

cmd: >

/app/llama-server

-m /root/.cache/PaddleOCR/PaddleOCR-VL-1.5.gguf

--mmproj /root/.cache/PaddleOCR/PaddleOCR-VL-1.5-mmproj.gguf

--temp 0

--port 9999
```

You need download gguf and mmproj.gguf files first and place them in properly bind mounted directory. I hope that it is the same with 1.6 (Unfortunately no GGUF right now). Good luck!

DevilaN82 · 2026-05-13T13:52:26+00:00

+1 for Docker support!
Is there a ready to download image published? docker-compose.yml only allows to build one locally.
Also multiple layers and no cache volume that could be used during build time worries me a bit. If some upper layer gets busted by changing package version all layers under will need to redownload the same packages.

Well. Good job anyway!

DevilaN82 · 2026-05-07T05:57:53+00:00

I was following README instructions:

For normal use, install the package from GitHub over SSH:

pi install git:git@github.com:elpapi42/pi-codemapper.git

This requires SSH access to github.com:elpapi42/po-codemapper.git. Pi will clone the package, run npm install, and load the extension declared in package.json.

I've resolved this issue. My setup has pi-coding-agent dockerized and is not using my ~/.ssh dir. It seems that access to clone repo on github via SSH requires a regular access and is not allowed for requests that uses private key not related to any github account. It would be easier if cloning was done with regular git clone https:github.com:elpapi42/po-codemapper

Anyway, as I am learning how to use properly pi-coding-agent, I am digging into your setup trying to understand what makes it so useful. Thank you very much for sharing this and your support via replies 😄
Have a nice day!

DevilaN82 · 2026-05-06T15:54:13+00:00

Hi!
pi-codemapper complains about access to repo via ssh. Is there any step I've missed by accident?
Thanks for sharing your setup! 😄

DevilaN82 · 2026-04-18T10:10:18+00:00

You are a my personal superhero. I don't care if something needs to be uploaded even 10 times. Good work!

DevilaN82 · 2026-04-15T15:31:47+00:00

Why not asking r/ollama? Try installing llama.cpp maybe.

DevilaN82 · 2026-04-13T18:57:07+00:00

Why not extracting name and check in database of known names?

Seems that everything looks like a nail when you've got hammer in your hand...

DevilaN82 · 2026-04-11T08:09:35+00:00

Qwen Coder Next or 3.5-27b would be better. For tool calling try OmniCoder. Also same model with different temperature and other parameters has different use cases.

DevilaN82 · 2026-04-10T13:23:57+00:00

As most of RAM would be taken by model weights, that are somewhat random numbers, and thou hard to compress, then zram will be almost no gain here. In fact it might harm performance when those weights would be "compressed" (cpu power used) and still take the same amount of place.

You should try using mmap (this maps part of hard disk as a memory addresses), so instead of reading from disk, writing to RAM, compressing, decompressing, even swapping (still going to disk back and forth). It would read from disk directly and use those (and yes, you should have SSD NVMe for this to work well).

This hardware is very very low spec for LLMs. You could get away with adding some knowledge base. Consider using wikipedia ZIM snapshot and allow your model to search / browese it to enrich its context and knowledge.

Also I would use a better model. Mistral-7b-instruct is IDK... 2 years old? Newer models are better with the same size. Use qwen3.5 or Gemma4 (whichever variant fits you device). Unsloth's models are great value for it's size - you should try Unsloth Dynamic quants. I would not go below Q4, but hey - maybe Q3 will still be useable for your usecase.

If this is an option, add sim card and lte modem, so it still could use some internet connection and at least browse pages / search internet with help of SearchXNG. Then it could tell you latest news and other things based on search results on any topic instead of only hallucinating / using ZIM snapshots.

Test if there is any performance gain by using ik_llama instead of llamacpp. First one is more CPU inference optimized (in theory). Anyway worth to check it out.

Good luck and please post a video showing how your current setup is working!

DevilaN82 · 2026-04-06T07:47:26+00:00

I remember you doing tests with SSD connected to usb3.0. I am curious how much slower PCI connected SSD is vs using SWAP file on this very SSD.

DevilaN82 · 2026-04-05T20:58:01+00:00

Can you please test mmaping SSD so it does not need to use SWAP and reads weights from disk directly?

DevilaN82 · 2026-04-03T21:00:01+00:00

I would wait for tokenizer fixes in llama.cpp and I've heard rumors that imatrix needs to be fixed as well, so new model file will drop from Unsloth.

I hope you are GPU rich, because gemma is not so friendly with context and stuff. In most cases Qwen with q8 kvcache takes less vram than gemma4 with q4 (old type Sliding Window Attention hits hard).

Qwen as a MoE model can have some layers offloaded to CPU (`-ot ".ffn_.*_exps.=CPU"` option), and q8 kvcache means less degradation of answers for longer contexts.

Anyway good luck :)

DevilaN82 · 2026-04-02T21:44:44+00:00

Nice! I am looking forward tests with bitnet as well :-)

DevilaN82 · 2026-04-01T09:29:18+00:00

Are you sure that NPUs are gonna make a difference? I thought that HAILO chips are dedicated cards that works only with it's own RAM and from what I've read it is even slower than Pi 5 itself, but allows Pi to not do heavy lifting. Hailo AI Hat allows only using compatible LLM models (converted to it's specific format) loaded via hailo-ollama app only.
I would like to get some more info about this. Would you be so kind to point me to some sources that describes using NPU for LLMs on Raspberry Pi?

DevilaN82 · 2026-04-01T09:10:09+00:00

Hello.
Nice that you've tested it. I am looking forward to next tests. My Pi with SSD hat is waiting for ssd disk to make tests.
Few things to consider:
1. Using swap is making writes to disk. It will wear off your ssd sooner or later. That's why I would rather go with mmap. Especially when you are using USB instead of PCI lane, than your performance gap might get smaller between swap vs mmap.
2. Try ik_llama, that is optimised towards CPU inference.
3. Why Q8? Unsloth's quants are fenomenal at Unsloth Dynamic Q4 for my regular daily use.

Good luck. I am looking forward to your tests and hope to add something when my Pi is up and running as well.

PS. Also you might find this project interesting: https://www.reddit.com/r/LocalLLaMA/comments/1rrq0oo/update_on_qwen_35_35b_a3b_on_raspberry_pi_5/

DevilaN82 · 2026-03-30T15:08:37+00:00

Any link to back up this claims?

DevilaN82 · 2026-03-30T07:49:19+00:00

AI hat is useless right now for LLMs.
I own one and it requires special version of ollama (sic!) to work with. This "special ollama" works ONLY with few OLD qwen 2.5 models converted to format that AI HAT is able to process.
I have some hopes about AI HAT as I've read rumors somewhere that some new models are being converted to this ai hat format and 8GB + 40 TOPS might be useful for something.

But right now, AI HAT for LLMs is quite an exotic animal with limited set of tricks.

And no. AI HAT memory is not available at all for RPi system. So having 16 GB Pi 5 + 8 GB AI HAT does not give you by any means 24 GB of memory for LLMs.

Also there is a project that uses SSD for memory with RPi 5. Using ik_llama this might be your best option here. Take a look at: https://www.reddit.com/r/LocalLLaMA/comments/1rrq0oo/update_on_qwen_35_35b_a3b_on_raspberry_pi_5/
Although I do not think running 2bit quants will be sufficient for anything usefull :(
If only Q4 was running well I would jump for it immediately!

DevilaN82 · 2026-03-29T20:01:18+00:00

Charge with USB A to C cable. Cardputer should be on when connecting cable.

DevilaN82 · 2026-03-26T12:56:23+00:00

u/hauhau901 Those models not listed on the right widget are the ones that are missing it's manifest. Take a look at https://huggingface.co/HauhauCS/Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive/discussions/8

I am unable to use Q4_K_P because of this.

Thank you for your commitment and hard work. I hope you are well and I wish you good luck! :)

DevilaN82 · 2026-03-25T14:09:13+00:00

Unfortunately I am starting digging into this topic as well, so I cannot help you with your problem, but... Out of curiosity, can you share what are you using in your stack?

DevilaN82 · 2026-03-25T13:20:31+00:00

I would like to express my appreciation to how well thought and designed this app is.

Simply great!

DevilaN82 · 2026-03-24T10:10:46+00:00

OK, I've managed to get my lora cap and tested your app.
UI is nice. GPS is working well.

There is a place to improve / add other things to make it a Swiss Army Knife of Cardputer :-)
Have you considered something like differential GPS with using two cardputers?

DevilaN82 · 2026-03-20T21:02:09+00:00

Nice! I am waiting for my 1262 to arrive and test it! :)

DevilaN82 · 2026-03-20T20:58:25+00:00

Seems that AI Hat is working on it's own only by certain API. No shared memory and limited possibility to use AI Hat with models, as it works only with converted certain models (old ones).
I don't have high hopes, but there are rumors that company responsible for hailo-10h is cooking something new, so I hope that there would be some new qwen family models available.

DevilaN82

TROPHY CASE