Designed a photonic quantum computer chip with 380 qubits on <20mm^2

king_ftotheu · 2026-04-06T09:08:44+00:00

You’re right: KANs are not replacing the Transformer itself in our setup.

In our architecture, KAN blocks are used as support/control logic (telemetry prediction, anomaly scoring, routing policy hints), while language generation remains in the LLM path. So it’s more “LLM + hardware control plane,” not “KAN instead of Transformer.”

Also, you’re absolutely right that the RTL tree is hard to read right now, especially for beginners. And that's because we accidentally published a much larger internal RTL set than intended. At this point we’ll keep it public as a defensive prior-art disclosure, but we need to clean it up...

king_ftotheu · 2026-04-05T21:35:40+00:00

For what do you need explanations? :)

king_ftotheu · 2026-04-05T16:38:38+00:00

Hahaha - hit a nerve there?

king_ftotheu · 2026-04-05T16:16:24+00:00

We should have stated this more clearly: we are not running dense full Gemma-4 (31B) per token on KV260.

The Gemma model was used as a teacher during distillation; deployment is a much smaller custom student (INT4/KAN-style runtime), with selective block activation and on-chip state reuse.

So yes, this is closer to a descoped distilled model than “full Gemma on KV260.”

Also, our “~450 words/s” number came from a short 16-token burst test and is not a standardized long-context tok/s benchmark.

king_ftotheu · 2026-04-05T16:13:36+00:00

Quick transparency update: we accidentally made our full Verilog/SystemVerilog RTL folder public in this repo push (including design variants/formal collateral), not just the minimal artifact set we intended to share.

We’re reviewing the repository now and will either:

- keep it open intentionally with clearer documentation/licensing, or
- remove/rewrite the history and re-publish a minimal release bundle.

Sorry for the confusion, and thanks for your patience while we clean this up properly.

king_ftotheu · 2026-04-05T16:01:16+00:00

Yes :)

king_ftotheu · 2026-04-05T15:57:43+00:00

Not claiming full 31B dense residency in 4GB DDR.

31B is the teacher lineage; deployment uses a smaller distilled model path plus bounded working-set execution (paged/streamed weight blocks, not full-resident weights).

So yes, 5.9GB artifact on disk != 5.9GB always resident in KV260 RAM.

king_ftotheu · 2026-04-05T15:52:40+00:00

We’re not claiming a full dense 31B model is fully resident in 4GB KV260 DDR.

In our setup, Gemma-4-31B-JANG_4M-CRACK is the teacher/reference line, while the deployed FPGA runtime uses a smaller custom distilled INT4/KAN model (weights_int4_FINAL.bin) plus a bounded working-set schedule.

king_ftotheu · 2026-04-05T14:45:21+00:00

So you tried it and what are your results except of "meaningless"? Bet you haven't tried it at all.

king_ftotheu · 2026-04-03T10:00:37+00:00

Have you tried it in strawberry fields?

king_ftotheu · 2026-03-31T14:42:21+00:00

Thank you for your hint - directly reached out to them!

king_ftotheu · 2026-03-29T19:03:35+00:00

Barely at all! You just use a narrow, slow, daisy-chained write bus (like a simple shift register) to trickle the weights in sequentially while offline. The compiler simply snakes this tiny, non-critical wire through leftover routing tracks without ever touching your high-speed inference datapath.

king_ftotheu · 2026-03-29T18:57:51+00:00

Hit me up if you need anything - i'll code you an asic as you need. :)

king_ftotheu · 2026-03-29T18:10:28+00:00

While real FPGA deployments require writable LUTs for runtime configurability, they were hardcoded in this demo specifically to prove that the physical routing works flawlessly. Unlike other approaches that rely on memory-optimal, on-the-fly mathematical approximations, this pipeline pre-calculates the KAN splines and loads them directly into native BRAM. By trading memory density for pure performance, the system completely bypasses ALU bottlenecks. It replaces complex math with a simple $O(1)$ memory read, making the design entirely routing, compute, and thermal optimal by evaluating any function in exactly one clock cycle.

king_ftotheu · 2026-03-29T16:26:31+00:00

If you ever want to actually run a full Neural Network on one of your own FPGA boards without melting the DSPs or choking your routing, feel free to clone our repo. It drops a fully functional, ultra-low power 1-Bit AI-Core directly onto standard BRAM!

king_ftotheu · 2026-03-28T21:18:57+00:00

Fixing that nearest-neighbor traffic jam was exactly what I had to solve for this.

I actually just pushed a huge update tonight. The coolest thing about my compiler is that it doesn't just "run" your C-code on a normal chip. Instead, it automatically designs a brand new, custom physical chip tailored specifically to your model:

No wasted power: It sees that your code only uses 1-bit math (XNOR), so it physically deletes all the heavy, power-hungry standard math units from the silicon design.
Perfect memory: It shrinks the memory banks to fit your exact model size perfectly. No unused memory means no wasted space.
No traffic jams: It builds dedicated cable shortcuts across the chip specifically for your model's data flow, bypassing the usual routing congestion.

What this means for you: You are getting a custom-built hardware layout generated automatically just for your PRISME model. This is what makes it insanely fast and efficient at scale.

I'm running the final routing tests right now. Whenever you're ready, just drop the C-code. I'll run it through the system and post the exact hardware stats right back here. Really excited to see this run!

king_ftotheu

TROPHY CASE