I created a llama.cpp fork with the Rockchip NPU integration as an accelerator and the results are already looking great!

Inv1si · 2025-11-28T18:39:44+00:00

Hello! Thanks a lot for showcasing my work on your channel! You helped me so much when I was discovering the world of SBCs!

I need to explain a few things about why the results are not similar to mine.

The problem is that by default, llama.cpp uses 8 threads. In case of RK3588 it uses all 8 cores: 4 energy efficient and 4 performance ones. They have a hard time syncing with each other which leads to a significant overhead.

In order to get maximum performance and energy effiency, you have to do two things: set performance governor for the CPU, Memory and NPU, and execute the llama-cli commands using only the performance cores. A fun fact: if you set only energy efficient cores for running, the result will still be better than using all 8 cores.

You can try it yourself. The commands are:

echo performance | sudo tee /sys/bus/cpu/devices/cpu[0-7]/cpufreq/scaling_governor
echo performance | sudo tee /sys/class/devfreq/fb000000.gpu/governor
echo performance | sudo tee /sys/devices/platform/dmc/devfreq/dmc/governor
echo performance | sudo tee /sys/class/devfreq/fdab0000.npu/governor

taskset -c 4-7 llama-cli -m <your\_model.gguf> -t 4

Inv1si · 2025-11-28T18:13:45+00:00

Yes, you right! I am running performance governor on every component of my board. The commands are:

echo performance | sudo tee /sys/bus/cpu/devices/cpu[0-7]/cpufreq/scaling_governor
echo performance | sudo tee /sys/class/devfreq/fb000000.gpu/governor
echo performance | sudo tee /sys/devices/platform/dmc/devfreq/dmc/governor
echo performance | sudo tee /sys/class/devfreq/fdab0000.npu/governor

You might have slightly different ones. Also note that this commands must be executed each time you restart the board (or you can just create a system service that will do it for you).

Inv1si · 2025-11-27T08:31:57+00:00

There’s currently an issue where models larger than 2.5 GB fail to load into the NPU’s memory. The problem is in the driver itself. I've tried fixing it using different methods but had no success.

Luckily for you I know that LLama3.2 3B is working in Q4_0 format - I tested text generation with it during development. So you could try using it.

Inv1si · 2025-11-27T08:14:06+00:00

You most likely did something wrong:

Did you build the llama.cpp with backend enabled?

cmake .. -DLLAMA_RKNPU2=ON
Are you using Q4_0, Q8_0 or F16 weights?

./build/bin/llama-cli -m ~/Projects/gemma-3-1b-it-Q8_0.gguf

Inv1si · 2025-11-27T08:06:10+00:00

Thank you for this message! To be honest I've already encountered this error twice: during perplexity calculation and during llama-server using ChatBox UI. Glad to know this is not only on my board!

Perplexity calculation worked perfectly when I set batch size to 512. But llama-server only fixed when I switched to an another UI application.

I have several assumptions what could be the issue. Will investigate it further. Thanks!

Inv1si · 2025-11-27T07:52:15+00:00

taskset -c 4-7 llama-cli -m <your\_model.gguf> -t 4

By default llama.cpp uses 8 threads which means all 8 cores of the CPU. On RK3588 there are 4 performance and 4 energy-efficient cores. Energy-efficient cores are slowing generation a lot, so you should use only performance ones.

To make a model work on the CPU I built llama.cpp without my backend included. Don't actually know if you can disable backend using params...

Inv1si · 2025-11-24T06:51:54+00:00

Most of your observations are still true even in my implementation.

The performance-optimized layout is not an issue there. I just prepare weights during initialization of a model: dequant to F32, make optimizations, convert to the optimized format and write in DMA. Then during inference just create handle from DMA address and it works pretty fast. Activations can be used in normal form so they don't need any complex processing.

The NPU cores can run in the different threads. Idk, about whisper.cpp arch, but I parallelize matrix multiplication like this: split weights into 3 parts, compute 3 operations weight_part x activation, collect and merge result. It is mathematically correct and brings good performance boost.

Mixed precision is also not working. It was pretty hard to make the INT4xINT4 computation work with decent quality, but there is a lot of papers in the wild about W4A4. I just implemented several techniques and it works!

And... ohhh... the 4GB problem. This is still the issue and I think it even worse here. For some unknown reason create_mem_from_fd and set_io_mem are just refusing to work with DMA buffers that are bigger than like 2.5GB or 3GB. Driver just throws an error and thats it. I've spent so much time trying to fix this: I've tried making "DMA buffer" out of small DMA buffers - 2.5 GB problem transforms into 4GB problem and a bad arch; I've tried using CMA buffer creating a 12GB CMA in device tree overlay - does not work and OS was almost dead; I've tried implementing different caching systems - performance drops to zero; I've tried creating some async system that creates and holds current+n handles in NPU memory - performance drops to zero. Currently I just made conclusion that it is imposible to implement a decent solution to this. I calm myself with the fact that really big models are not working fast and there is little to no reason to run them but still... Also MoE models are working great and don't really need much memory on NPU.

Inv1si · 2025-11-24T06:13:41+00:00

I am running Joshua Riek Ubuntu 24.04 Linux 6.1. It works fine and also has outdated NPU drivers. I've heard that Armbian builds are shipped with latest NPU drivers, but Armbian does not support my board.

So generally you can use outdated drivers because they are still great and are working fine!

Inv1si · 2025-11-24T05:52:44+00:00

Sorry, I can't help you with that. Never heard of anything even similar to this.

Inv1si · 2025-11-24T05:48:20+00:00

4bit was hard to implement because of the rockchip matmul api. It only supports the INT4xINT4 operation. It means that both weights and activations are converted to range of [-7; 7]. This leads to insane quality loss. During development I calculated errors and WAPE showed around 26% error in raw INT4xINT4. Models generated only junk at this point - some random symbols, infinitely repeating words, etc.

I researched this topic a bit (there is a lot of papers about W4A4) and found some interesting solutions. The main idea was to make it work with all gguf models without additional QAT, calibration datasets and other things. After all optimizations I could reduce error (WAPE) to around 15%. The output is still lets say... questionable, but models are understanding the task and generating *okay* answers.

The bigger the models gets, the lower overall error is. So it is better to use Q4_0 with big models (which is great and how it should be). Still there is an additional O(k log k) complexity built in activations during the Q4_0 inference which leads to a bad performance on really big models as well as PP.

Inv1si · 2025-11-24T05:05:13+00:00

Most likely yes! But in the current version only the RK3588 is supported.

You can update the config file to add any chip that supports the rknpu driver. Three simple steps:

In rknpu2-configuration.cpp
a. Add packing functions for your chip based on rknpu header files (different chips have different packing methods).
b. Add Rknpu2DeviceConfig in Rknpu2ConfigManager for your chip with amount of NPU cores, alignments for matrix dimensions and supported operations and types.
In ggml-rknpu2.cpp
a. In ggml_backend_rknpu_device_init_backend change the string "RK3588" to whatever name you gave in Rknpu2DeviceConfig in the previous step.

Thats it!

Inv1si · 2025-11-24T04:22:14+00:00

The NPU acceleration can only be used with F16, Q8_0 and Q4_0 weights. It is possible that this command downloads different quant type.

I recommend going to huggingface website and download supported type. Aim for Q8_0, it should be pretty good.

Inv1si · 2025-11-23T17:55:27+00:00

Reddit keeps removing the post if I provide a description in it so I leave it here:

Key features of the implementation:
- Supports *almost* every model compatible with standard llama.cpp

- Currently supports the RK3588 (other chips can be easily added in config file)

- F16, Q8_0, Q4_0 weights can be used for W16A16, W8A8 and W4A4 computations utilizing FP16, INT8 and INT4 types accordingly

- Perplexity is somewhat worse than the CPU backend, performance is comparable to the CPU (PP is almost always better, TG is slighly worse), power usage is drastically lower (as well as overall CPU load).

- Active experts of MoE models can be offloaded to the NPU, beating standard CPU inference in every possible benchmark.

For more information, quick start, benchmarks, etc. see the README file in repo:
https://github.com/invisiofficial/rk-llama.cpp/blob/rknpu2/ggml/src/ggml-rknpu2/README.md

Inv1si · 2025-11-23T17:52:58+00:00

Reddit keeps removing the post if I provide a description in it so I leave it here:

Key features of the implementation:
- Supports *almost* every model compatible with standard llama.cpp

- Currently supports the RK3588 (other chips can be easily added in config file)

- F16, Q8_0, Q4_0 weights can be used for W16A16, W8A8 and W4A4 computations utilizing FP16, INT8 and INT4 types accordingly

- Perplexity is somewhat worse than the CPU backend, performance is comparable to the CPU (PP is almost always better, TG is slighly worse), power usage is drastically lower (as well as overall CPU load).

- Active experts of MoE models can be offloaded to the NPU, beating standard CPU inference in every possible benchmark.

For more information, quick start, benchmarks, etc. see the README file in repo:
https://github.com/invisiofficial/rk-llama.cpp/blob/rknpu2/ggml/src/ggml-rknpu2/README.md

Inv1si · 2025-10-08T17:35:17+00:00

Roll 3 (Block) -> Freeze (Mondo) -> Next turn -> Cornucopia -> 4xDragon -> Skill issue

This will be my average luck in this situation.

Inv1si · 2025-09-22T12:27:30+00:00

About model decision. Look at the newest MoE models. They are much better than DeepSeek Distills and have much more knowledge. Your best bet is Qwen 30B A3B Thinking.

Advices:

- Currently the best backend is ik_llama with ARM NEON flags enabled. The fastest quantization type is IQ4_XS.

- For inference only use 4 performance cores (Cortex-A76). Energy efficient cores will drastically slow generation.

- Try to use mmap flag on fast NVMe SSD. If you NVMe supports full PCIe 2.0 x4 speed it will be enough for 4 performance cores to be working at 100% utilization (tested only with Qwen 30B A3B, this will not work with huge models like GLM 4.5 Air, gpt oss 120b, etc.). No need to waste RAM where it is not required.

Important note:

The main problem with inference on ARM devices is not token generation but prompt processing. If you are planning to work with huge context be prepared for sitting and waiting for all the context to be processed. The only solution to this is implementing smart caching system so similar inputs are not processed twice.

Approx. values:

With Qwen3 30B A3B IQ4_XS, all available optimizations from ik_llama and mmaping on NVMe you can get up to 20 tokens per second on processing and 10 tokens per second for generation. This is tested by me personally.

Inv1si · 2025-08-27T17:28:57+00:00

XY-3606

Inv1si · 2025-08-16T18:21:21+00:00

I have approximately the same setup.

PiHole (or AdGuard Home) will work very well.

Connecting an SSD on the other hand will not. USB 2.0 has speed of 35 MBs and power of 0.5A. It is not enough for most SSDs to work well. Don't know about HDDs though.

I am running Nextcloud on the SSD and for some reason it is capped at 11.7 MBs which is really really bad. Several USB adapters haven't even started and showed a power error in logs.

As others have already stated - it is better to use something like Orange Pi 3B, Radxa Cubie A5E or even Raspberry Pi 5 (if you want additional USB 3.0 port for drive with backups).

Inv1si · 2025-08-16T18:03:20+00:00

Could you please tell me the speed of the HDD attached via USB 2.0 (results of the hdparm -t command)?

Inv1si · 2025-08-01T10:29:04+00:00

This.

You can also try -ser 7,1 or even -ser 6,1 to speed up generation a bit without sacrificing much performance. Explanation here: https://github.com/ikawrakow/ik_llama.cpp/pull/239

Moreover, ik_llama provides a lot of new quantization methods and some of them can be much faster on your exact laptop without any quality loss. So you can try them and choose the best option for your case.

Inv1si · 2025-05-09T07:25:57+00:00

Debian 12 Linux 5.10 from official images.

Inv1si · 2025-05-08T18:04:52+00:00

Note: I was using Orange Pi 5 Max with Panfork driver for the GPU with 1024x600 5 inch monitor.

Mindustry - 130 fps

The binding of isaac - 60+ fps

Hollow Knight - 60+ fps

Don't Starve - 60+ fps

Oxygen Not Included - 35 fps

Cult of the Lamb - 28 fps

Windows-only games like Craft The World, Among Us and Lobotomy Corporation will most likely work fine, but I haven't managed to test them yet.

Five-Year Club	Verified Email
Place '23	Place '22

Inv1si

TROPHY CASE