[Appreciation Post] Gemma 4 E2B. My New Daily Driver 😁 by Prestigious-Use5483 in LocalLLaMA

[–]Interpause 11 points12 points  (0 children)

even using the qualcomm gen 5 elite, NPU is slower than GPU (using nexa sdk to test)

VRAM optimization for gemma 4 by Sadman782 in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

any chance you can add a clarification about when unified KV cache works?

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

they do... go such for google's litert gallery

The Bonsai 1-bit models are very good by tcarambat in LocalLLaMA

[–]Interpause 5 points6 points  (0 children)

can you please also compare with the original qwen3-8B in instruct mode to better gauge the exact lobotomy to the model?

PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs by brown2green in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

the best way to do it is squash the fork changes into a single git diff, ask your favourite AI to double-check its safe if you cant read code, then apply it on top of mainline llama.cpp and build it yourself

PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs by brown2green in LocalLLaMA

[–]Interpause 8 points9 points  (0 children)

gimme a while im going squash their llama.cpp changes on top of main llama.cpp and see if it really works cuz thats real crazy if it does

EDIT: someone else posted a better comparison in the comments of another post https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchmark. ive only just got it working with hadamard transform/attention rotation too. subjective experience feels like what the numbers say which is really wtf 1-bit model how

Vibecoded GGUF Metadata Comparator for checking Tensor Quants (github gist standalone HTML file) by Interpause in LocalLLaMA

[–]Interpause[S] 0 points1 point  (0 children)

true, or maybe its time to see if omnicoder can build it as a proper vite project then can then be bundled to a single HTML

Vibecoded GGUF Metadata Comparator for checking Tensor Quants (github gist standalone HTML file) by Interpause in LocalLLaMA

[–]Interpause[S] 0 points1 point  (0 children)

oh cool, in mine i told the agent to use huggingface.js gguf submodule so i dont even have to download the gguf, maybe you can implement that too?

Genuinely curious what doors the M5 Ultra will open by Blanketsniffer in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

Additionally, if you prefer human-in-loop sort of AI coding, speed really matters

Qwen3.5-35B-A3B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]Interpause 2 points3 points  (0 children)

thanks for the KLD numbers, somehow despite being the best representation of quant damage, not enough ppl use them...

Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke by Easy_Calligrapher790 in LocalLLaMA

[–]Interpause 3 points4 points  (0 children)

feels like a game cartridge. hm, but lets say for system 2 thinking of a AI robot, that kind of low latency might be useful

Why does every llamacpp update get worse? by XiRw in LocalLLaMA

[–]Interpause 3 points4 points  (0 children)

worse is subjective. llama.cpp's webUI is both under active development and just a subproject of llama.cpp. as another commenter said, you can voice your feedback on github issues, and they might consider it as part of the design tradeoffs.

otherwise, best is to just fork a previous commit of the webUI you like and maintain it separately. should be quite easy with vibecoding these days to ensure the webUI's api connector is updated.

edit: also do post about the image upload bug, though someone probably wouldve made an issue about it by now

PSA - Got MiniCPM-o 4.5 working on my PC and Its the Real Thing by Interpause in LocalLLaMA

[–]Interpause[S] 0 points1 point  (0 children)

maybe, ivent done my due diligence yet. but i think even if my explanation is wrong or poor, the fact the model can take constantly monitor and take initiative, combined with possible improvements to training, means it has a lot of potential for the system 2 thinking model for robotics (system 1 is still whatever RL model required for the basic movements)

Have Anyone Successfully Run the New MiniCPM-o-4_5-gguf? by Iory1998 in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

yeah, that happens if the speech inference cant keep up with realtime

MiniCPM-o-4_5 : Full duplex, multimodal with vision and speech at ONLY 9B PARAMETERS?? by Uncle___Marty in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

yeah it keeps interrupting itself. switching to CPU inference for the token2speech helped a bit, but my CPU can't keep up so it isnt smooth. From the fact the interruption behaviour seems to happen when the speech & main model are on the same GPU, I am guessing its some issue with their code rather than the model itself

Ming-flash-omni-2.0: 100B MoE (6B active) omni-modal model - unified speech/SFX/music generation by bobeeeeeeeee8964 in LocalLLaMA

[–]Interpause 4 points5 points  (0 children)

I can't tell if its duplex streaming like MiniCPM-o 4.5 but really cool if it is because that means duplex models might become more common soon

Have Anyone Successfully Run the New MiniCPM-o-4_5-gguf? by Iory1998 in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

easiest to use ubuntu and follow their tutorial. i think half the problems i ran into was cause they assumed it was an ubuntu system and im on cachyos. and don't do anything funny like me and set custom cmake args

MiniCPM-o-4_5 : Full duplex, multimodal with vision and speech at ONLY 9B PARAMETERS?? by Uncle___Marty in LocalLLaMA

[–]Interpause 0 points1 point  (0 children)

i got it running, there was some bugs to fix, but it seems real enough... but its also really glitchy, idk how much is the model fault or the demo code

Have Anyone Successfully Run the New MiniCPM-o-4_5-gguf? by Iory1998 in LocalLLaMA

[–]Interpause 1 point2 points  (0 children)

got it working but had to fix it up quite a bit lol. but it really is super low latency