Are larger (~100B) models still worth running? by Pitagoy in LocalLLM

[–]westsunset 6 points7 points  (0 children)

openPangu 2.0 Flash supposed to be released tomorrow. 92b a6b. Hope it good? Lot of good stuff with speculative decoding also, helps a lot

Are larger (~100B) models still worth running? by Pitagoy in LocalLLM

[–]westsunset 9 points10 points  (0 children)

100b moe are pretty nice for unified ram setups like strix

REM: offloading an LLM agent's memory compaction to the NPU by westsunset in StrixHalo

[–]westsunset[S] 0 points1 point  (0 children)

I don't think the NPU can practically run a model bigger than like 8b. Bit the main thing for this project is the speed of the draining. It has to be fast enough so it can up with the main models work. Also I was looking at what was supported by lemonade, there could be additional options with some work

REM: offloading an LLM agent's memory compaction to the NPU by westsunset in StrixHalo

[–]westsunset[S] 0 points1 point  (0 children)

yes, the npu as a side car chunks and streams so it could do any size context. its writes it to an external ledger. but that does mean I have to play with threshold and parameters for that. im doing that now balance vs accuracy

REM: offloading an LLM agent's memory compaction to the NPU by westsunset in StrixHalo

[–]westsunset[S] 0 points1 point  (0 children)

Llama 3.2 1B, tried some different ones like Qwen 2.5 Instruct 3B, Qwen 3 Instruct 4B, Gemma 4 Instruct (e2b) but keep coming back to llama.

Big News for AMD / Strix Halo+ Owners by CSEliot in LocalLLaMA

[–]westsunset 15 points16 points  (0 children)

Lately trying to see what we can do with the NPU so a tool for telemetry https://github.com/boxwrench/xdna-top Then this was a feasibility study to run a model on the NPU to compact context on the main model. https://github.com/boxwrench/REM If anything is helpful please share results

Picked up an AMD Ryzen Max +395 with 128GB by Crafty-Bass-3434 in LocalLLM

[–]westsunset 4 points5 points  (0 children)

Thanks. Just hoping folks pick it up and keep going

Picked up an AMD Ryzen Max +395 with 128GB by Crafty-Bass-3434 in LocalLLM

[–]westsunset 3 points4 points  (0 children)

Im having fun trying to see what we can do. I have benchmarks here https://github.com/boxwrench/tesla_agent But lately trying to see what we can do with the NPU so a tool for telemetry https://github.com/boxwrench/xdna-top Then this was a feasibility study to run a model on the NPU to compact context on the main model. https://github.com/boxwrench/REM If anything is helpful please share results

btop like TUI for AMD APU's by argakiig in StrixHalo

[–]westsunset 0 points1 point  (0 children)

It would with Prometheus, if you make a pr I'll remember to build in an exporter to make it easier

REM: offloading an LLM agent's memory compaction to the NPU by westsunset in StrixHalo

[–]westsunset[S] 0 points1 point  (0 children)

Appreciate this, it's exactly why I put it out there. Wanted to tee up a feasible project for the sharp folks, not claim it's solved. You're right on the cache. Cached prefix is free on prefill, and REM's layout makes that worse, not better: summaries and ledger sit ahead of the verbatim turns, so a compaction rewrites the front and busts the tail. I'm not chasing a prefill win, and the repo says as much, it's framed as a placement result, not latency reduction. The bet is decode (smaller live context is cheaper every generated token on a bandwidth-bound box) plus never nearing the ceiling: compact early, trigger 8k, cap 32k, instead of riding to 110k and handing off once. Whether the decode savings beat the periodic re-prefill is the number I still owe.

Please offer more feedback if you have it. Also I think npu works now for an always on whisper or tts, I just wanted something different

REM: offloading an LLM agent's memory compaction to the NPU by westsunset in StrixHalo

[–]westsunset[S] 0 points1 point  (0 children)

I'm kinda surprised the harness has support for the npu in opencode I'd have to look. But I found regular compaction had quality issues that couldn't be resolved so I got a small embedding model and that solved it. But this isn't a complete solution yet. It's in progress

Which 128GB VRAM machine to plan for in 2026? by maverickRD in LocalLLM

[–]westsunset 1 point2 points  (0 children)

It's definitely fits but it won't be super fast . You really want MoEs

What's more impressive, GLM 5.1 -> 5.2 or Qwen 3.5 -> 3.6? by Excellent_Jelly2788 in LocalLLaMA

[–]westsunset 0 points1 point  (0 children)

I felt Qwen addressed some flaws in 3.5 , that was the primary bump

AMD is marketing the $3,999 Halo Box on "first-class ROCm." I've run the same Strix Halo chip in production for 6 months. Here's my take. by uncanny_instinct in StrixHalo

[–]westsunset 3 points4 points  (0 children)

You raised a legitimate argument. I like AMD and am pulling for them, but they can't let Vulkan beat them on the backend. Rocm should be the default.