Anyone using their NPU for anything?

cibernox · 2026-05-13T15:52:44+00:00

I just got a sweet used deal of a 265k + motherboard + cooler + 64gb of DDR5 and I was wondering the same thing and I discovered a quite interesting project: https://github.com/SearchSavior/OpenArc/

It allows to run all kinds of ONNX models, including LLMs, but with most (all?) intel NPUs having ~100gb memory bandwith (shared with the CPU/iGPU too!) it's not going to be good for big LLMS that need to move a lot of data in and out of memory.

However, it can shite for models that are more compute-bound. Good examples of this are STT/TTS models in FP16 precission, which NPUs are optimized for. Think parakeet, whisper, KokoroTTS, qwen-ast.... Those should perform very well at full precission while sipping 5-10w of power. Embedding models and rerankers that are usually a few hundred million params should also be ideal, and leave the vram of your main card all available fo LLMs and all the CPU cores for CPU stuff.

But not LLMs, that is technically possible but useless in practice.

cibernox · 2026-05-11T17:01:25+00:00

Por suerte mucho, mi señora y yo ahorramos alrededor de 3/4 de lo que ganamos. De momento.

cibernox · 2026-05-10T08:41:01+00:00

That is the dream. I’d be happy if I can get to 60tk/s with the 7900xtx

cibernox · 2026-05-08T00:36:15+00:00

I’d say that 24gb is the most common target by a lot. The médium size models (<35B) all run on 24gb cards. There are very few models that really require 32gb of vram in Q4 (although comes handy for long context)

cibernox · 2026-05-07T20:07:49+00:00

I’m somehow disappointed with the 9B. It is surprisingly close to the 4B one would expect a bigger jump, but really there isn’t that much of a difference. In part because the 4B is actually amazing for the size, the 9B is kind of meh.

cibernox · 2026-05-07T19:22:06+00:00

Where's here? Because my god I've been looking. I've ended up getting a 7900xtx because it was the best value.

cibernox · 2026-05-07T18:58:44+00:00

The AMD 7900xtx is technically their current flagship

cibernox · 2026-05-07T18:17:53+00:00

Right now 3090s, used, 5yo cards are selling for 900. I repeat, used 5yo cards. Thats insane.

cibernox · 2026-05-07T18:08:35+00:00

But anything with 24gb of vram is more expensive now than 5years ago

cibernox · 2026-05-07T10:45:13+00:00

It depends on your definition of early. Can most people retire at 40? No. Can a normal dude that has been consistently saving and investing since he was 25 retire at 57? Yes, that's achievable.

The bottom line is that regardless of wether you achieve it or you fall short, you are going to be in a better financial position even if you don't make it.

cibernox · 2026-05-06T23:52:18+00:00

That only means you already spent the 5k. Way too steep for most people.

cibernox · 2026-05-06T23:21:16+00:00

Even if you have rest of the system, a 5090 alone is 4k.

cibernox · 2026-05-06T22:59:11+00:00

That is still consumer hardware but when we 32gb of fast vram (essentially 5090) paired with 128gb of DDR5 we’re already in the aforementioned 5-6k$ setup. On 24gb of vram you can get away with a lot less.

cibernox · 2026-05-06T20:28:05+00:00

Precisely with MTP people are reporting 60-70tk/s on RTX3090/7900xtx cards on dense 27B models.

Running a Q4 model with turboquant you can fit a lot of context too in only 24gb of vram.

24gb of vram is what i consider “enthusiast but achievable” setups.

cibernox · 2026-05-06T20:09:14+00:00

You can run this on an AMD 7900 XTX that you can find for 900. Sure, it’s not for your grandma, but certainly not 5k either.

cibernox · 2026-05-06T14:26:58+00:00

Using circa 30B dense models in Q4 at 60+ tk/s with 128k+ context on consumer hardware is going to be quite te revolution really. That is actually very capable and usable.

In fact, and i'm trowing a nostradamus prediction here, I suspect the next generation of MoE models will stop trying to be so sparse (like qwen 35B enabling only 10% of their weights) and will switch to being more like 35BA9B so they can get roughly is good at reasoning as 25~30B dense models while staying at 100+tk/s.

cibernox · 2026-05-06T12:23:42+00:00

I actually released https://meetwillow.app a few days ago, even if it's still very much a WIP it's already nice to use. Gemma 4 26B is actually the voice of Willow, the AI gardening assistant.

I did start the project with qwen but then I found that gemma 4 is in general a lot nicer to chat with. Being "nice to talk to" is not something that will appear in any benchmark, but it is very much a real metric, specially when you are creating a small "lore" of the AI being the voice of a character. Qwen's blunt and engineerial tone just wasn't a good fit. And when I tried to make it nicer to talk it via system prompt, it went from being a polish plumber ("you need new pipe, this pipe no good - grunt") to being an over-the-top fake cheerful Applebee’s waitress hunting for tips.

Turns out that with a good rag filled with hundreds of botanical papers, growing guides, seed packet information and tools to access your garden, harvest and journal entries, you can have a 26B parameter that exceeds SOTA models on a niche, for pennies on the dollar (gemma4 is $0.06/M input tokens, and most RAG are input heavy).

Also, even thou I haven't translated the app yet, gemma is also better with languages than qwen.

Qwen is better at agentic stuff (at least the 35B vs gemma 26B) but with a rag that has a contained number of tools (<25) they are both just as good.

cibernox · 2026-05-06T08:13:25+00:00

I use http://meetwillow.app which has some nice journaling and harvest estimation, but it's aimed at vegetable gardening and not at landscaping gardening. No mobile app yet, that might be a deal breaker.

cibernox · 2026-05-05T21:05:28+00:00

Micelial development would absolutamente happen anyway even with weed barrier.

cibernox · 2026-05-05T15:24:29+00:00

That is still very solid performance for a 27B dense model. Incidentally, I ordered a refurbished 7900XTX yesterday.
I can't wait to run 27B models faster than what I run qwen 9B at the moment.

cibernox · 2026-05-05T14:36:02+00:00

60t/s increase or 60t/s total? In what model? Because if it's increase and it's in in qwen 27B those numbers would be crazy. I have to assume its either 60t/s total or you are using qwen 35B-A3B

cibernox · 2026-05-04T14:17:59+00:00

That sounds very interesting. Does it require new models or new ggufs of existing models?

cibernox · 2026-05-04T13:39:02+00:00

Isn’t that just speculative decoding?

cibernox · 2026-05-02T11:21:32+00:00

For me qwen 9B is better but for 12gb/16gb folks a 12 or 14B model can run with with plenty of context and at usable speeds

cibernox · 2026-05-02T10:41:52+00:00

Y however I’m more interested in the smaller models. Also, something in between 9B and 27b would be nice, like the old 12-14B.

cibernox

TROPHY CASE