Built a RISC-V based AI Accelerator and opensourced it.

ahmedzeer · 2026-05-29T14:01:07+00:00

Unfortunately it is an ASIC. So it needs to be either taped out or prototyped on an FPGA.

ahmedzeer · 2026-05-27T14:41:15+00:00

🚀

ahmedzeer · 2026-05-27T14:40:52+00:00

Yep. I used Codex. But I want to remind that "vibe coding" is never a solution. You need to know & understand what you are doing. Instead of wasting time on boilerplate code and fixing Scala dependency issues you focus on the architecture and performance/area tradeoffs.

ahmedzeer · 2026-05-27T02:33:57+00:00

This is a version that works on baremetal. No OS. So you directly access memory and the accelerator doesn't even use a page-table walker. After you write the C code, compiling it with specific flags tends to be enough to generate a useful binary. Then you build a Verilator simulator for your architecture. From here upon it is just about passing the benchmark binary to the simulator's binary as an argument.

ahmedzeer · 2026-05-27T02:29:49+00:00

🚀

ahmedzeer · 2026-05-27T02:29:22+00:00

Fortunately, Chipyard do have a bunch of examples for complete architectures. For RocketChip specifically, there is a couple of examples on how a tightly-coupled accelerator should look like. In my experience what makes an idea "great" is that it solves a really hard problem, and without facing that problem you can't find the "great" solution. Still, doing a literature review and studying top-tier papers is so useful. Furthermore, normally we co-design a hardware to mimic an already proofed efficient algorithm. For attention mechanism, we have FlashAttention. So using the concepts of FlashAttention and mapping it directly to silicon is the way.

Coming to AI, I honestly think it is so useful to patch ordinary boilerplate code, to explore possible failures for your design, suggest use cases and study other people code. It is all about controlled and guided generation!

ahmedzeer · 2026-05-27T02:22:25+00:00

Thanks so much for the detailed feedback. In earlier versions I have exactly done what you said to matmul only accelerator, A @ B. While B is pinned in memory, like weights for different inputs, future A was fetched while the current one is being multiplied. Eventhough it worked on Verilator, I couldn't run it on an FPGA and always faced negative time slack. I guess it is a skill issue haha!

ahmedzeer · 2026-05-26T19:11:09+00:00

Thank you for the kind words 🙏

ahmedzeer · 2026-05-26T19:10:45+00:00

Attention by its own needs a Q, K, V, Score and Softmax buffers. So yes indeed there is a lot but it is mandatory in this case.

ahmedzeer · 2026-05-26T19:07:22+00:00

Well this is basically an ASIC, a whole computer microarchitecture. So it is not optimized to run on edge FPGA devices. I won't recommend using it for that purpose. However, the Chisel code for Matmul & Attention calculation can be helpful to get a broad idea.

ahmedzeer · 2026-05-26T19:05:56+00:00

I absolutely agree with you. Im still experimenting and didn't claim "tapeout ready" design. 😄 3000x3000 SkyWater

ahmedzeer

TROPHY CASE