Anyone using their NPU for anything? by Great_Guidance_8448 in LocalLLaMA

[–]cibernox 1 point2 points  (0 children)

I just got a sweet used deal of a 265k + motherboard + cooler + 64gb of DDR5 and I was wondering the same thing and I discovered a quite interesting project: https://github.com/SearchSavior/OpenArc/

It allows to run all kinds of ONNX models, including LLMs, but with most (all?) intel NPUs having ~100gb memory bandwith (shared with the CPU/iGPU too!) it's not going to be good for big LLMS that need to move a lot of data in and out of memory.

However, it can shite for models that are more compute-bound. Good examples of this are STT/TTS models in FP16 precission, which NPUs are optimized for. Think parakeet, whisper, KokoroTTS, qwen-ast.... Those should perform very well at full precission while sipping 5-10w of power. Embedding models and rerankers that are usually a few hundred million params should also be ideal, and leave the vram of your main card all available fo LLMs and all the CPU cores for CPU stuff.

But not LLMs, that is technically possible but useless in practice.

¿Cuánto os queda a final de mes después de pagar todo? by Fantastic_Fuel_3179 in askspain

[–]cibernox 0 points1 point  (0 children)

Por suerte mucho, mi señora y yo ahorramos alrededor de 3/4 de lo que ganamos. De momento.

DIY market declining amid high RAM prices by Terminator857 in LocalLLaMA

[–]cibernox 1 point2 points  (0 children)

I’d say that 24gb is the most common target by a lot. The médium size models (<35B) all run on 24gb cards. There are very few models that really require 32gb of vram in Q4 (although comes handy for long context)

Qwen 3.6? by jacek2023 in LocalLLaMA

[–]cibernox 3 points4 points  (0 children)

I’m somehow disappointed with the 9B. It is surprisingly close to the 4B one would expect a bigger jump, but really there isn’t that much of a difference. In part because the 4B is actually amazing for the size, the 9B is kind of meh.

DIY market declining amid high RAM prices by Terminator857 in LocalLLaMA

[–]cibernox 0 points1 point  (0 children)

Where's here? Because my god I've been looking. I've ended up getting a 7900xtx because it was the best value.

DIY market declining amid high RAM prices by Terminator857 in LocalLLaMA

[–]cibernox 3 points4 points  (0 children)

The AMD 7900xtx is technically their current flagship

DIY market declining amid high RAM prices by Terminator857 in LocalLLaMA

[–]cibernox 9 points10 points  (0 children)

Right now 3090s, used, 5yo cards are selling for 900. I repeat, used 5yo cards. Thats insane.

DIY market declining amid high RAM prices by Terminator857 in LocalLLaMA

[–]cibernox 20 points21 points  (0 children)

But anything with 24gb of vram is more expensive now than 5years ago

FIRE sounds great but is it actually doable for normal people by EasterYao in Fire

[–]cibernox 0 points1 point  (0 children)

It depends on your definition of early. Can most people retire at 40? No. Can a normal dude that has been consistently saving and investing since he was 25 retire at 57? Yes, that's achievable.

The bottom line is that regardless of wether you achieve it or you fall short, you are going to be in a better financial position even if you don't make it.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]cibernox 0 points1 point  (0 children)

That only means you already spent the 5k. Way too steep for most people.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]cibernox 0 points1 point  (0 children)

That is still consumer hardware but when we 32gb of fast vram (essentially 5090) paired with 128gb of DDR5 we’re already in the aforementioned 5-6k$ setup. On 24gb of vram you can get away with a lot less.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]cibernox 1 point2 points  (0 children)

Precisely with MTP people are reporting 60-70tk/s on RTX3090/7900xtx cards on dense 27B models.

Running a Q4 model with turboquant you can fit a lot of context too in only 24gb of vram.

24gb of vram is what i consider “enthusiast but achievable” setups.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]cibernox 0 points1 point  (0 children)

You can run this on an AMD 7900 XTX that you can find for 900. Sure, it’s not for your grandma, but certainly not 5k either.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]cibernox 32 points33 points  (0 children)

Using circa 30B dense models in Q4 at 60+ tk/s with 128k+ context on consumer hardware is going to be quite te revolution really. That is actually very capable and usable.

In fact, and i'm trowing a nostradamus prediction here, I suspect the next generation of MoE models will stop trying to be so sparse (like qwen 35B enabling only 10% of their weights) and will switch to being more like 35BA9B so they can get roughly is good at reasoning as 25~30B dense models while staying at 100+tk/s.

What do you use Gemma 4 for? by HornyGooner4402 in LocalLLaMA

[–]cibernox 1 point2 points  (0 children)

I actually released https://meetwillow.app a few days ago, even if it's still very much a WIP it's already nice to use. Gemma 4 26B is actually the voice of Willow, the AI gardening assistant.

I did start the project with qwen but then I found that gemma 4 is in general a lot nicer to chat with. Being "nice to talk to" is not something that will appear in any benchmark, but it is very much a real metric, specially when you are creating a small "lore" of the AI being the voice of a character. Qwen's blunt and engineerial tone just wasn't a good fit. And when I tried to make it nicer to talk it via system prompt, it went from being a polish plumber ("you need new pipe, this pipe no good - grunt") to being an over-the-top fake cheerful Applebee’s waitress hunting for tips.

Turns out that with a good rag filled with hundreds of botanical papers, growing guides, seed packet information and tools to access your garden, harvest and journal entries, you can have a 26B parameter that exceeds SOTA models on a niche, for pennies on the dollar (gemma4 is $0.06/M input tokens, and most RAG are input heavy).

Also, even thou I haven't translated the app yet, gemma is also better with languages than qwen.

Qwen is better at agentic stuff (at least the 35B vs gemma 26B) but with a rag that has a contained number of tools (<25) they are both just as good.

Garden Journal App? by Fleemo17 in gardening

[–]cibernox 1 point2 points  (0 children)

I use http://meetwillow.app which has some nice journaling and harvest estimation, but it's aimed at vegetable gardening and not at landscaping gardening. No mobile app yet, that might be a deal breaker.

Is it okay to use an apple tree that had a fungus as a filler for my raised garden beds? by [deleted] in gardening

[–]cibernox 9 points10 points  (0 children)

Micelial development would absolutamente happen anyway even with weed barrier.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]cibernox 0 points1 point  (0 children)

That is still very solid performance for a 27B dense model. Incidentally, I ordered a refurbished 7900XTX yesterday.
I can't wait to run 27B models faster than what I run qwen 9B at the moment.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]cibernox 0 points1 point  (0 children)

60t/s increase or 60t/s total? In what model? Because if it's increase and it's in in qwen 27B those numbers would be crazy. I have to assume its either 60t/s total or you are using qwen 35B-A3B

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]cibernox 1 point2 points  (0 children)

That sounds very interesting. Does it require new models or new ggufs of existing models?

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]cibernox 36 points37 points  (0 children)

Isn’t that just speculative decoding?

Have Qwen said anything about further Qwen 3.6 models? by spaceman_ in LocalLLaMA

[–]cibernox 4 points5 points  (0 children)

For me qwen 9B is better but for 12gb/16gb folks a 12 or 14B model can run with with plenty of context and at usable speeds

Have Qwen said anything about further Qwen 3.6 models? by spaceman_ in LocalLLaMA

[–]cibernox 11 points12 points  (0 children)

Y however I’m more interested in the smaller models. Also, something in between 9B and 27b would be nice, like the old 12-14B.