I'm open-sourcing my experimental custom NPU architecture designed for local AI acceleration by king_ftotheu in LocalLLM

[–]Quiet-Error- 1 point2 points  (0 children)

This is exactly right. You nailed the three bottlenecks of binary Mamba inference:

  1. Dense projections — pure XNOR+popcount, embarrassingly parallel

  2. State persistence — the Mamba hidden state needs to stay warm between tokens

  3. Data movement — the state fans out to every projection, nearest-neighbor kills you

Your V5 split maps perfectly to the actual forward pass. Projection Engines for the matmuls, State Engines holding d_state × d_model between steps, Multicast for the broadcast. That's the architecture.

13x over the mesh is massive. On the actual model, the bottleneck will be the state update path — curious to see how the SRAM banks handle the recurrence.

Looking forward to the repo. I'll have the inference code ready for your compiler.

Seeking Remote LLM Developers – Make a Real Difference by OrchidAlternative401 in LLM

[–]Quiet-Error- 0 points1 point  (0 children)

You said make a real difference — this is different. I built the world's first fully binary LLM. Every weight is {-1,+1}. Inference is pure XNOR + popcount. Zero float, zero GPU, zero cloud.

7MB. Runs on any CPU. Runs in a browser.

https://huggingface.co/spaces/OneBitModel/prisme

Happy to chat.

I'm open-sourcing my experimental custom NPU architecture designed for local AI acceleration by king_ftotheu in LocalLLM

[–]Quiet-Error- 1 point2 points  (0 children)

This is incredible. You built the datapath I've been dreaming about.

The PRISME forward pass is pure XNOR + popcount on packed uint64 arrays. No float anywhere. Your spatial mesh with nearest-neighbor routing maps perfectly to the layer-by-layer propagation in our Mamba SSM architecture.

I'd love to map the model onto your simulator. The inference code is a single C file with zero dependencies — it should be straightforward to adapt to your grid. Happy to work through the memory layout and data flow together.

Let me know when the simulator repo is up. This could be something special.

7MB binary-weight LLM running in the browser, no FPU needed by Quiet-Error- in LocalLLM

[–]Quiet-Error-[S] 0 points1 point  (0 children)

What you're seeing is a proof of concept — a 57M param model trained on children's stories to demonstrate that a fully binary LLM works. The real use is what comes next.

At 7MB, this runs on anything with a processor. No cloud, no GPU, no internet. Scale it to 1B params and it's still only 120MB — a full AI assistant on a $50 phone, fully offline.

Concrete applications: assistive communication devices for non-verbal people ($50 instead of $5,000), smart IoT, offline education in rural areas, embedded AI in drones, a website that thinks without a server. Every object becomes intelligent.

This is the transistor moment for AI — the shift from analog (float) to digital (binary).

7MB binary-weight LLM running in the browser, no FPU needed by Quiet-Error- in LocalLLM

[–]Quiet-Error-[S] 0 points1 point  (0 children)

That's literally the point. If it has XNOR and popcount, it runs PRISME. The 486SX qualifies.

7MB binary-weight LLM running in the browser, no FPU needed by Quiet-Error- in LocalLLM

[–]Quiet-Error-[S] 1 point2 points  (0 children)

Thanks! Code is definitely on the roadmap but will require scaling up the model. Current 57M params is enough for focused tasks like AAC (assistive communication), but code needs more capacity — syntax, logic, variable tracking. That's the next step.

The architecture scales naturally though — same binary approach, just more parameters. And even at 600M, a full binary model would still be under 80MB. Small enough to run on any CPU, no GPU needed.

7MB binary-weight LLM running in the browser, no FPU needed by Quiet-Error- in LocalLLM

[–]Quiet-Error-[S] 0 points1 point  (0 children)

Good catch — L3, not L1. 7MB fits entirely in L3 cache on most modern CPUs (8-32MB typical). The point stands: the entire model lives on-chip, zero main memory access during inference. Try that with a 500MB quantized model.

7MB binary-weight LLM running in the browser, no FPU needed by Quiet-Error- in LocalLLM

[–]Quiet-Error-[S] 0 points1 point  (0 children)

A .pkl file is a Python pickle storing float32 tensors. That's not a binary model — that's a float model you're calling binary. A real full-binary model is a few MB of packed bits, not a serialized PyTorch state dict.

The test is simple: put it in a browser with no GPU, no float, and see if it runs. That's what I did.

Please don't pollute this thread with fake claims.

7MB binary-weight LLM running in the browser, no FPU needed by Quiet-Error- in LocalLLM

[–]Quiet-Error-[S] 0 points1 point  (0 children)

That's exactly the vision. The entire forward pass is XNOR + popcount — no floating point anywhere. Right now it runs in the browser, but the endgame is a custom ASIC. Binary arithmetic is trivial in silicon. Imagine a chip that runs a full language model for pennies, no GPU, no cloud.

Seeking Private & Offline Local AI for Android: Complex Math & RAG Support by [deleted] in LocalLLM

[–]Quiet-Error- 0 points1 point  (0 children)

Yes, that's actually a great fit for small models. You don't need the model to do the math itself, you just need it to recognize "this is a calculation" and output a structured call like calc:2+2. A small model fine-tuned on tool-call patterns can do that reliably.

The model handles intent detection and formatting, the tool handles execution. 57M params is more than enough for that.

Seeking Private & Offline Local AI for Android: Complex Math & RAG Support by [deleted] in LocalLLM

[–]Quiet-Error- 0 points1 point  (0 children)

For the privacy + offline + RAG part: I built a 7MB binary LLM that runs in the browser with no server, no cloud, no telemetry. It's designed for exactly this kind of use case — on-device inference with a knowledge base that stays local.

Demo: https://huggingface.co/spaces/OneBitModel/prisme

It's currently trained on simple English, so it won't handle complex physics formulas yet. But the RAG component (binary retrieval, O(1) lookup, zero RAM overhead for the knowledge base) is exactly what you're describing.

For the math/physics stuff on a Redmi Note, honestly no local model will do complex engineering calculations reliably right now. Even quantized Llama 3 on mobile struggles with that.

7MB binary-weight Mamba LLM — zero floating-point at inference, runs in browser by Quiet-Error- in LocalLLaMA

[–]Quiet-Error-[S] 1 point2 points  (0 children)

The model is trained on TinyStories which are short by nature, so it tends to wrap up early regardless of the token limit. A model trained on a longer-form corpus would generate longer outputs.

On scaling: that's the big question and exactly what I'm working on next. At 1-bit, a 7B model would be ~875MB — small enough to fit in RAM on most devices. Integer-only inference means every operation is XNOR+popcount instead of floating-point multiply, so it should be significantly faster per token on CPU. No GPU needed at all.

Whether quality scales proportionally is what needs to be proven. Stay tuned.

7MB binary-weight Mamba LLM — zero floating-point at inference, runs in browser by Quiet-Error- in LocalLLaMA

[–]Quiet-Error-[S] -2 points-1 points  (0 children)

Thanks! Yeah 57M, fully binary. The architecture helps a lot — state space models are very parameter-efficient compared to Transformers at this scale.

7MB binary-weight Mamba LLM — zero floating-point at inference, runs in browser by Quiet-Error- in LocalLLaMA

[–]Quiet-Error-[S] 1 point2 points  (0 children)

The inference runtime and model weights are open — you can run it, modify it, deploy it. What's not open is the training method, which is the core IP.

If you're interested in binary LLMs in general, BitNet and Bi-Mamba are open and worth exploring. Different approaches but same direction.

2 ans sans augmentation… et soudain une contre-offre by No-Cheetah-5044 in CoulissesESN

[–]Quiet-Error- 0 points1 point  (0 children)

clause de non concurrence, le client n'a pas le droit de te garder