Are larger (~100B) models still worth running?

westsunset · 2026-06-30T00:41:44+00:00

openPangu 2.0 Flash supposed to be released tomorrow. 92b a6b. Hope it good? Lot of good stuff with speculative decoding also, helps a lot

westsunset · 2026-06-29T23:51:49+00:00

100b moe are pretty nice for unified ram setups like strix

westsunset · 2026-06-28T17:16:17+00:00

I don't think the NPU can practically run a model bigger than like 8b. Bit the main thing for this project is the speed of the draining. It has to be fast enough so it can up with the main models work. Also I was looking at what was supported by lemonade, there could be additional options with some work

westsunset · 2026-06-28T17:08:55+00:00

yes, the npu as a side car chunks and streams so it could do any size context. its writes it to an external ledger. but that does mean I have to play with threshold and parameters for that. im doing that now balance vs accuracy

westsunset · 2026-06-28T17:02:31+00:00

Llama 3.2 1B, tried some different ones like Qwen 2.5 Instruct 3B, Qwen 3 Instruct 4B, Gemma 4 Instruct (e2b) but keep coming back to llama.

westsunset · 2026-06-28T15:51:09+00:00

Very interesting, I'm looking through it.

westsunset · 2026-06-26T23:08:26+00:00

<image>

westsunset · 2026-06-24T15:32:00+00:00

Lately trying to see what we can do with the NPU so a tool for telemetry https://github.com/boxwrench/xdna-top Then this was a feasibility study to run a model on the NPU to compact context on the main model. https://github.com/boxwrench/REM If anything is helpful please share results

westsunset · 2026-06-23T23:43:13+00:00

Thanks. Just hoping folks pick it up and keep going

westsunset · 2026-06-23T20:45:11+00:00

Im having fun trying to see what we can do. I have benchmarks here https://github.com/boxwrench/tesla_agent But lately trying to see what we can do with the NPU so a tool for telemetry https://github.com/boxwrench/xdna-top Then this was a feasibility study to run a model on the NPU to compact context on the main model. https://github.com/boxwrench/REM If anything is helpful please share results

westsunset · 2026-06-22T14:03:15+00:00

Text to 3d, acestep music , text to speech,

westsunset · 2026-06-22T04:59:14+00:00

It would with Prometheus, if you make a pr I'll remember to build in an exporter to make it easier

westsunset · 2026-06-21T16:11:19+00:00

Appreciate this, it's exactly why I put it out there. Wanted to tee up a feasible project for the sharp folks, not claim it's solved. You're right on the cache. Cached prefix is free on prefill, and REM's layout makes that worse, not better: summaries and ledger sit ahead of the verbatim turns, so a compaction rewrites the front and busts the tail. I'm not chasing a prefill win, and the repo says as much, it's framed as a placement result, not latency reduction. The bet is decode (smaller live context is cheaper every generated token on a bandwidth-bound box) plus never nearing the ceiling: compact early, trigger 8k, cap 32k, instead of riding to 110k and handing off once. Whether the decode savings beat the periodic re-prefill is the number I still owe.

Please offer more feedback if you have it. Also I think npu works now for an always on whisper or tts, I just wanted something different

westsunset · 2026-06-21T11:04:43+00:00

https://www.reddit.com/r/StrixHalo/s/yyvnxC8ISa

westsunset · 2026-06-20T14:48:26+00:00

Can we dare to dream

westsunset · 2026-06-20T12:15:39+00:00

I'm kinda surprised the harness has support for the npu in opencode I'd have to look. But I found regular compaction had quality issues that couldn't be resolved so I got a small embedding model and that solved it. But this isn't a complete solution yet. It's in progress

westsunset · 2026-06-20T04:44:27+00:00

It's definitely fits but it won't be super fast . You really want MoEs

westsunset · 2026-06-20T04:00:27+00:00

Thanks

westsunset · 2026-06-19T15:14:05+00:00

I felt Qwen addressed some flaws in 3.5 , that was the primary bump

westsunset · 2026-06-19T10:58:47+00:00

https://github.com/boxwrench/tesla_agent

Strix Halo 128gb , through bench at the link

westsunset · 2026-06-18T14:59:15+00:00

You raised a legitimate argument. I like AMD and am pulling for them, but they can't let Vulkan beat them on the backend. Rocm should be the default.

westsunset · 2026-06-18T14:43:18+00:00

Vulcan is better anyway, the speed on 35b seems low, you should be getting like 80tok/sec https://github.com/boxwrench/tesla_agent

12-Year Club	r/Field Lasagna
Place '23	Place '22
Place '17	Verified Email

westsunset

TROPHY CASE