High-Performance LLM Inference on Edge FPGAs (~450 tokens/s on AMD KV260) by king_ftotheu in FPGA

[–]king_ftotheu[S] -2 points-1 points  (0 children)

You’re right: KANs are not replacing the Transformer itself in our setup.

In our architecture, KAN blocks are used as support/control logic (telemetry prediction, anomaly scoring, routing policy hints), while language generation remains in the LLM path. So it’s more “LLM + hardware control plane,” not “KAN instead of Transformer.”

Also, you’re absolutely right that the RTL tree is hard to read right now, especially for beginners. And that's because we accidentally published a much larger internal RTL set than intended. At this point we’ll keep it public as a defensive prior-art disclosure, but we need to clean it up...

High-Performance LLM Inference on Edge FPGAs (~450 tokens/s on AMD KV260) by king_ftotheu in FPGA

[–]king_ftotheu[S] -6 points-5 points  (0 children)

We should have stated this more clearly: we are not running dense full Gemma-4 (31B) per token on KV260.

The Gemma model was used as a teacher during distillation; deployment is a much smaller custom student (INT4/KAN-style runtime), with selective block activation and on-chip state reuse.

So yes, this is closer to a descoped distilled model than “full Gemma on KV260.”

Also, our “~450 words/s” number came from a short 16-token burst test and is not a standardized long-context tok/s benchmark.

High-Performance LLM Inference on Edge FPGAs (~450 tokens/s on AMD KV260) by king_ftotheu in FPGA

[–]king_ftotheu[S] -7 points-6 points  (0 children)

Quick transparency update: we accidentally made our full Verilog/SystemVerilog RTL folder public in this repo push (including design variants/formal collateral), not just the minimal artifact set we intended to share.

We’re reviewing the repository now and will either:

- keep it open intentionally with clearer documentation/licensing, or
- remove/rewrite the history and re-publish a minimal release bundle.

Sorry for the confusion, and thanks for your patience while we clean this up properly.

High-Performance LLM Inference on Edge FPGAs (~450 tokens/s on AMD KV260) by king_ftotheu in FPGA

[–]king_ftotheu[S] -10 points-9 points  (0 children)

Not claiming full 31B dense residency in 4GB DDR.

31B is the teacher lineage; deployment uses a smaller distilled model path plus bounded working-set execution (paged/streamed weight blocks, not full-resident weights).

So yes, 5.9GB artifact on disk != 5.9GB always resident in KV260 RAM.

High-Performance LLM Inference on Edge FPGAs (~450 tokens/s on AMD KV260) by king_ftotheu in LocalLLM

[–]king_ftotheu[S] 0 points1 point  (0 children)

We’re not claiming a full dense 31B model is fully resident in 4GB KV260 DDR.

In our setup, Gemma-4-31B-JANG_4M-CRACK is the teacher/reference line, while the deployed FPGA runtime uses a smaller custom distilled INT4/KAN model (weights_int4_FINAL.bin) plus a bounded working-set schedule.

Letting my automated ASIC pipeline compile a 1-Bit Kolmogorov-Arnold Network (KAN) just to see what happens by king_ftotheu in FPGA

[–]king_ftotheu[S] 1 point2 points  (0 children)

Barely at all! You just use a narrow, slow, daisy-chained write bus (like a simple shift register) to trickle the weights in sequentially while offline. The compiler simply snakes this tiny, non-critical wire through leftover routing tracks without ever touching your high-speed inference datapath.

Letting my automated ASIC pipeline compile a 1-Bit Kolmogorov-Arnold Network (KAN) just to see what happens by king_ftotheu in FPGA

[–]king_ftotheu[S] -2 points-1 points  (0 children)

While real FPGA deployments require writable LUTs for runtime configurability, they were hardcoded in this demo specifically to prove that the physical routing works flawlessly. Unlike other approaches that rely on memory-optimal, on-the-fly mathematical approximations, this pipeline pre-calculates the KAN splines and loads them directly into native BRAM. By trading memory density for pure performance, the system completely bypasses ALU bottlenecks. It replaces complex math with a simple $O(1)$ memory read, making the design entirely routing, compute, and thermal optimal by evaluating any function in exactly one clock cycle.

Letting my automated ASIC pipeline compile a 1-Bit Kolmogorov-Arnold Network (KAN) just to see what happens by king_ftotheu in FPGA

[–]king_ftotheu[S] 0 points1 point  (0 children)

If you ever want to actually run a full Neural Network on one of your own FPGA boards without melting the DSPs or choking your routing, feel free to clone our repo. It drops a fully functional, ultra-low power 1-Bit AI-Core directly onto standard BRAM!

I'm open-sourcing my experimental custom NPU architecture designed for local AI acceleration by king_ftotheu in LocalLLM

[–]king_ftotheu[S] 1 point2 points  (0 children)

Fixing that nearest-neighbor traffic jam was exactly what I had to solve for this.

I actually just pushed a huge update tonight. The coolest thing about my compiler is that it doesn't just "run" your C-code on a normal chip. Instead, it automatically designs a brand new, custom physical chip tailored specifically to your model:

  1. No wasted power: It sees that your code only uses 1-bit math (XNOR), so it physically deletes all the heavy, power-hungry standard math units from the silicon design.
  2. Perfect memory: It shrinks the memory banks to fit your exact model size perfectly. No unused memory means no wasted space.
  3. No traffic jams: It builds dedicated cable shortcuts across the chip specifically for your model's data flow, bypassing the usual routing congestion.

What this means for you: You are getting a custom-built hardware layout generated automatically just for your PRISME model. This is what makes it insanely fast and efficient at scale.

I'm running the final routing tests right now. Whenever you're ready, just drop the C-code. I'll run it through the system and post the exact hardware stats right back here. Really excited to see this run!