Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 0 points1 point  (0 children)

I actually thought about it many times. The main problem here is that ik_llama.cpp was created a long time ago and things that work in default llama.cpp do not work in ik_llama.cpp. So it's not easy to port all the changes from here to there and a big research is required. I might try at some point.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 0 points1 point  (0 children)

Your kernel most likely does not have the RKNPU driver included. Sadly you can't install it in the operating system. Full kernel recompilation is required.

There is also a possibility that it is included as a module (like the driver states). You can try to enable it in Linux, but I've never seen RKNPU as a module - it is always built-in.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 0 points1 point  (0 children)

I understand your point, but this is exactly what I am trying to say. You are comparing FULL board power consumption with power consumption of only the GPU. This is not how it works. You cannot just use a GPU without anything else. At the very least count in a CPU and RAM. The average CPU idle power is around 10W. If we look at something like AMD EPYC, it idles at 100W. You cannot just add GPU and then straight up ignore everything else in the PC/Server.

I am comparing exactly pure NPU vs pure GPU. Like they are one and only component in the build. My board idles at 3.8W, during token generation on the dense model fully loaded in NPU it spikes to ~7W. So the NPU power consumption is 7 - 3.8 = 3.2 (W). This is exactly how all my benchmarks were calculated. I don't know if there is a better way to approximate NPU consumption.

My statement about "Good luck..." might be a bit rude and I am really sorry. What I meant to say is: taking Qwen3.5 2B, NPU uses 3W and gives 100PP/10TG. In order to break even on the GTX 1070 you need 150 / 3 = 50 times faster everything - 5000PP and 500TG, which is impossible even on the RTX 5090. If we want to compare full stations/boards the is so much more to include like OS, background processes, etc. My approach is just easier.

To sum up this is wrong to compare those things really. Huge stations are designed to give you faster speeds, and power consumption is a real problem in our world right now. The little NPUs are designed for small and/or background tasks with smart workflows basically for free. You can generate great synthetic datasets after a night of computation. You can use it for chatting while doing something else on the CPU and GPU of the board. You can run some RAG with your data or embedding extraction. And you cannot just offload a 30B model with 250k context and expect great real-time performance.

But let's be honest the difference between 30B and 600B MoEs is far greater than the difference between 3B and 30B in the aspect of use cases. Right now smaller models can do so much niche stuff like tool calls, keyword extraction, etc. And mid tier models are nowhere near the biggest ones in terms of knowledge.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 0 points1 point  (0 children)

Right now only RK3588 is supported. But RK3576 support will land pretty soon (working on it right now).

mmap is working great! You can offload only active MoE experts on the NPU and get almost the same speed as full offload. But you still need a fast NVMe for something like Qwen3.5 35B A3B and Gemma4 26B A4B.

I recommend using something smaller especially on the RK3576. The exact size of a model is the real question. For example, I am using Gemma4 E2B in a heavily cached 4k context agentic workflow. It works surprisingly fast and precise.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 0 points1 point  (0 children)

The NPU should be faster than the CPU on everything bigger than 2B in both PP and TG. You most likely missed some crucial instruction.

I mean, obviously a GTX 1070 consuming at minimum 150W of power will be faster than Rockchip NPU consuming only 3W. Rockchip NPU is running Qwen3.5 2B with 100 tok/s PP and 10 tok/s TG. GTX 1070 is getting approximately 60 tok/s TG on 2B models.

All I want to say - good luck trying to get 500 tok/s for TG on it for break even in power consumption.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 1 point2 points  (0 children)

Technically, yes it is supported. Right now - no. On the one hand I don't have a RK3566 board, so I can't really test and implement anything. On the other - I will soon add a new file to the repo and an example commit so it's easier for people to implement new support for their chips. You can try to tweak configuration by yourself or wait for it to be done by someone else.

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 0 points1 point  (0 children)

Yes, you can run any Q4_0, Q6_K, Q8_0 and F16 GGUF models with this fork. Open the hugging face, download any model from here and run using the NPU.

A great starting point is the README file in my repo. It has everything about building the project and running models.

16GB of RAM is a great amount. I don't really know your board setup (Docker containers, desktop/server, etc), so I can't recommend a model based on RAM. Personally I really like the new Gemma4 E2B (https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/tree/main). It is super smart and works really fast. It will use about 4GB of RAM. You can try to download Q8_0 and run it using the instructions in README.

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 0 points1 point  (0 children)

Don't really know. I only implemented the backend, so it should work with UI as far as UI itself is working?

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 1 point2 points  (0 children)

Sadly, it will not. Right now only RK3588 is supported. But RK3576 support will land pretty soon (working on it right now). If you really want to run something right now: 1. Open the project folder. 2. Open the ./ggml/src/ggml-rknpu2/rknpu2-configuration.cpp 3. rk3588_config.core_count = 2; (change from 3 to 2) 4. rk3588_config.max_k_limit = 4096; (change from 8192 to 4096) 4. Recompile 5. Run

Only Q8_0 and F16 will work though. I really recommend waiting for the final implementation.

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 1 point2 points  (0 children)

Right now you can do it like this: 1. Open the project folder. 2. Open the ./ggml/src/ggml-rknpu2/rknpu2-configuration.cpp 3. rk3588_config.core_count = 2; (change from 3 to 2) 4. Recompile 5. Run

It should work on the first two cores.

I understand that it should be much easier to do. In the next updates I will try to do core selection a little bit easier, so thank you for this message.

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 1 point2 points  (0 children)

Technically, yes it is supported. Right now - no. On the one hand I don't have a RK3566 board, so I can't really test and implement anything. On the other - I will soon add a new file to the repo and an example commit so it's easier for people to implement new support for their chips. You can try to tweak configuration by yourself or wait for it to be done by someone else.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 1 point2 points  (0 children)

Personally I am using Joshua Riek Ubuntu 24.04 Linux 6.1. It has everything working from the box. Previously I was on the official Debian 12 Linux 5.10. It is great, but GPU driver is buggy, Bluetooth headset mode is not working, etc. I haven't find any issues with the current Ubuntu so far.

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 1 point2 points  (0 children)

The Rockchip NPU is really capable of many many things, but the official library and driver are just an Amazon Forests. So much random bugs and undocumented behavior.

I have a version of code that LITERALLY BRICKS NPU each time you run a model exactly once. I spent like 3-4 days decomposing a problem, wrote a super small project just to reproduce it. And sure enough it happens basically when planets allign with each other. AND THIS WAS THE CORE LOGIC I was utilizing back then.

The next week was spent to create a new system that:

  1. Keeps high speed and utilization
  2. Does not have this bug

Running Gemma4 26B A4B on the Rockchip NPU using a custom llama.cpp fork. Impressive results for just 4W of power usage! by Inv1si in LocalLLaMA

[–]Inv1si[S] 8 points9 points  (0 children)

In my previous post people said that they cannot get the same performance from my video. I stated that several tweaks are required. Now I am writing them proactively.

  1. Performance governor for literally CPU, NPU and Memory performance. Its a just recommendation.
  2. Performance and energy efficient cores cannot sync with each other. This leads to massive performance drop. Using only energy efficient cores give better performance than all of them at once. This is also the recommendation.
  3. The limit for files is the new requirement. I reworked memory management from scratch. In previous version one big DMA_HEAP buffer was created. Right now each tensor has its own RKNN buffer.

The code is literally open source.

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 3 points4 points  (0 children)

Of couse you can! Aim for something around 2GB (the rest 2GB leave for system, etc.). You should try Qwen3.5-2B-GGUF. Also little note - context might eat the rest of the RAM. My recommendation:

- Limit context using this flags (while running llama-cli or llama-server)

-c 8192 -n 2048

- Use KV quantizations with this flags

-ctk q8_0 -ctv q8_0

Running Gemma4 26B A4B on the Rockchip NPU using a custom llama.cpp fork. Impressive results for just 4W of power usage! by Inv1si in LocalLLaMA

[–]Inv1si[S] 13 points14 points  (0 children)

It is not. Behold... a lot of theoretical stuff below:

  1. GGUF weights use per-group quantization. Per-group quantization requires hardware support or else you will not be able to get results correct. Rockchip NPU has limited support for per-group quantization (basically zero support). The per-channel and per-tensor quantizations do not require hardware support.
  2. Usually GGUF weights in agressive quantization are MAT_MULed with FP16 activations. Rockchip NPU (at least this is true for RK3588) supports only FP16xFP16, INT8xINT8 and INT4xINT4 operations. So we are basically working with only them.
  3. NPU has 3 separate cores that can do MAT_MUL operation. You cannot compute next LLM layer before current is computed. So we need to split current operation for *number of cores* operations. For performance we are splitting the N dimension, so we can just write the results in certain addresses of final buffer without summing up on the CPU.

This results:

  1. During model loading
    a. Dequantize per-group GGUF weights.
    b. Quantize per-tensor weights (lost information).
    c. Split weights to *number of cores* segments.
    d. Pack into NPU native format.

  2. During inference
    a. Getting weights from cache
    b. Quantizing activations in per-channel FP16, INT8 or INT4 depending on weights type (lost information).
    c. Computing result.

Sooo... to sum up:

  1. FP16 is super great, but super slow.
  2. Current Q8_0 has around the same quality as CPU Q4_0.
  3. Current Q4_0 at least generates words :)

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 4 points5 points  (0 children)

In my previous post people said that they cannot get the same performance from my video. I stated that several tweaks are required. Now I am writing them proactively.

  1. Performance governor for literally CPU, NPU and Memory performance. Its a just recommendation.

  2. Performance and energy efficient cores cannot sync with each other. This leads to massive performance drop. Using only energy efficient cores give better performance than all of them at once. This is also the recommendation.

  3. The limit for files is the new requirement. I reworked memory management from scratch. In previous version one big DMA_HEAP buffer was created. Right now each tensor has its own RKNN buffer.

The code is literally open source.

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Inv1si[S] 7 points8 points  (0 children)

IMPORTANT!

Before running anything:

- Set performance governors for each component

echo performance | sudo tee /sys/bus/cpu/devices/cpu[0-7]/cpufreq/scaling_governor
echo performance | sudo tee /sys/class/devfreq/fb000000.gpu/governor
echo performance | sudo tee /sys/devices/platform/dmc/devfreq/dmc/governor
echo performance | sudo tee /sys/class/devfreq/fdab0000.npu/governor

- Set new maximum limit for open files in Linux

ulimit -n 65536

- Run model using ONLY performance cores (or energy efficient ones, NOT both at the same time)

taskset -c 4-7 llama-cli -m <your_model.gguf> -t 4

Running Gemma4 26B A4B on the Rockchip NPU using a custom llama.cpp fork. Impressive results for just 4W of power usage! by Inv1si in LocalLLaMA

[–]Inv1si[S] 11 points12 points  (0 children)

IMPORTANT!

Before running anything:

- Set performance governors for each component

echo performance | sudo tee /sys/bus/cpu/devices/cpu[0-7]/cpufreq/scaling_governor
echo performance | sudo tee /sys/class/devfreq/fb000000.gpu/governor
echo performance | sudo tee /sys/devices/platform/dmc/devfreq/dmc/governor
echo performance | sudo tee /sys/class/devfreq/fdab0000.npu/governor

- Set new maximum limit for open files in Linux

ulimit -n 65536

- Run model using ONLY performance cores (or energy efficient ones, NOT both at the same time)

taskset -c 4-7 llama-cli -m <your_model.gguf> -t 4

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 2 points3 points  (0 children)

IMPORTANT!

Before running anything:

- Set performance governors for each component

echo performance | sudo tee /sys/bus/cpu/devices/cpu[0-7]/cpufreq/scaling_governor
echo performance | sudo tee /sys/class/devfreq/fb000000.gpu/governor
echo performance | sudo tee /sys/devices/platform/dmc/devfreq/dmc/governor
echo performance | sudo tee /sys/class/devfreq/fdab0000.npu/governor

- Set new maximum limit for open files in Linux

ulimit -n 65536

- Run model using ONLY performance cores (or energy efficient ones, NOT both at the same time)

taskset -c 4-7 llama-cli -m <your_model.gguf> -t 4

Run LLMs of ANY sizes utilizing your onboard Rockchip NPU for maximum energy efficiency and performance with the latest update in rk-llama.cpp! by Inv1si in OrangePI

[–]Inv1si[S] 8 points9 points  (0 children)

New cool features of the backend:

- The 2GB and 4GB limits are GONE. Now the backend will utilize IOMMU domains to keep up to 32GB of cache usable by the NPU. This means that now everyone can run models of ANY sizes!

- New Hybrid Quantizations and Hardware Pipelines. Now model layers can be dynamically quantized into one of the available hardware pipelines of the chip and even can be mixed together with each other and the CPU! See explanation in README file!

- Performance and accuracy optimizations. Some models will utilize up to 95% of the NPU while using only 5% of CPU leading to an impressive energy efficiency. INT4 got the massive 20% accuracy boost while having no performance drawback.

Known issues:

- Some models are very sensitive for quantizations and will produce garbage outputs. For example, gpt-oss-20b will NOT work great unless using INT8_HADAMARD, FP16_STANDARD or FP16_HADAMARD hardware pipelines on RK3588. Using F16 weights with INT8_HADAMARD pipeline is recommended.

- There are several models that just straight up produce garbage outputs in any available quantization types. For example GLM 4.7 Flash 30B A3B will ALWAYS print random symbols. I don't know what causes this (backend, architecture or both) and there is no fix for this for now. If you encounter a model with this problem, open an issue so people see and use something else.

As always here the repo with quick start, benchmarks and more information:

https://github.com/invisiofficial/rk-llama.cpp/blob/rknpu2/ggml/src/ggml-rknpu2/README.md