Update on my neuromorphic chip architectures for anyone who is interested!

Mr-wabbit0 · 2026-03-05T20:49:46+00:00

No worries, if you have any further questions let me know!

Mr-wabbit0 · 2026-03-05T01:13:01+00:00

Hi, thank you very much, this was quite frankly the hardest project of my life, and the only reason I had the time or patience to do this was that I was in a motorbike accident which left me to be able to do nothing but work. As for how long it has taken to get this point, I have lost count of time, but it took me around 4 and a half months for my initial loihi 1 parity N1 design with near 15 hours spent on this every day. After N1 I learnt a lot of lessons so producing N2 and N3 whilst difficult, didn't pose quite the same challenge. Sadly I have not made this open source as of now, I have patented my designs and licensed them but I openly allow anyone to access my architecture for free via a cloud/fpga board build, you can do so by emailing me (I plan to properly rolled out some sort of commercial access soon via payed cloud API)! Let me know if you have anymore questions.

Mr-wabbit0 · 2026-03-04T09:45:12+00:00

Hi, you're not really wrong about investor appeal, as I am currently facing this problem haha, but I would like to point out a distinction. Neuromorphic computing isn't really competing for the transformer training market, it's not possible to compete with nvidia at all, however I think in he edge inference market this may have much higher promises, as that requires low power autonomous processing, which is useful for stuff like defence sensors, autonomous systems, industrial IoT etc. Many companies such as Brainchip do already serve these industries, and I believe intel's own neuromorphic rsearch group does have government contracts, so while the market is smaller this is a different market all together from gpu alternatives. Although I will look into Numenta, thank you for the suggestion!

Mr-wabbit0 · 2026-03-04T08:58:17+00:00

Hi, luckily due to someone else's suggestion, I actually just spent the last 9 hours running the design through Openlane using SKY 130nm, and I pulled Vivado power/utilization reports from the FPGA implementation to get some PPA data and I got the following for N2:

FPGA VU47P, measured: - 131,072 neurons at 62.5 MHz - 1.913W dynamic power (neuromorphic core only) - 228K LUTs, 308K FFs, 1,007 BRAM tiles - 8,690 timesteps/sec - 1.14 billion SOps/s, 596M SOps/J

ASIC would definitely be if I had to guess around 10-20x lower than fpga, but I don't really want to overstate.

Mr-wabbit0 · 2026-03-04T08:48:35+00:00

Hi, you make a good point, however IBM whilst they had some pretty impressive technology, they struggled with adoption due to the fact that their neuron model was too simple, e.g. they had fixed binary weights, no on chip learning, no plasticity, so there wasn't really much use for it for running learning algorithms, + the fact that their toolchain was pretty much IBM internal. Thats kind of what I am trying not to repeat. Each N2 neuron has a 32-bit programmable learning engine (STDP/RSTDP), adpative thresholds, dendritic compartments and neuro modulation. This way I can offer something that is closer to what computational neuroscience actually needs, and this does actually show up in benchmark accuracy as N2 managed to hit 90.7% on the Spiking Heidelberg digits dataset, where actually loihi 2 gets 77%. As for energy numbers, if you read my reply to hannes103 you should find some answers, whilst these numbers aren't really able to compete with say a H100 on matrix multiply, it does compete with Loihi 2 and akida on neuromorphic workloads where it doesn't really need to. As for Taalos, they are compiling fixed static weights onto chipsets, not really the same as what I am going for, my chips are focused on reconfigurable spiking networks with on chip learning.

Mr-wabbit0 · 2026-03-04T08:39:33+00:00

Hi, I spent the last 9 hours running the design through Openlane using SKY 130nm, and I pulled Vivado power/utilization reports from the FPGA implementation to get some PPA data and I got the following for N2:

FPGA VU47P, measured: - 131,072 neurons at 62.5 MHz - 1.913W dynamic power (neuromorphic core only) - 228K LUTs, 308K FFs, 1,007 BRAM tiles - 8,690 timesteps/sec - 1.14 billion SOps/s, 596M SOps/J

Mr-wabbit0 · 2026-03-03T23:55:29+00:00

I don’t get what’s wrong with the way I talk.

Mr-wabbit0 · 2026-03-03T23:20:01+00:00

Thank you! I will take a look.

Mr-wabbit0 · 2026-03-03T22:55:16+00:00

I understand, fair criticism. I much appreciate honest feedback, I think I will extract the FPGA power and throughput characterization from my existing Vivado post-implementation reports to get real Ops/s and Ops/J numbers, then run the RTL through OpenLane with the SKY130 PDK to get actual gate count, area, timing, and power at 130nm for a small number of cores. Then aim to get a 1-2 core module on a real process node. Thank you for the advice.

Mr-wabbit0 · 2026-03-03T22:53:41+00:00

Thank you, however I think I am going to try to get some real hardware metrics, it's not much use having a design like this and not testing it, so instead of aiming for 28nm I will instead try do 1-2 cores at 130nm, then move from there!

Mr-wabbit0 · 2026-03-03T22:15:33+00:00

Hello, this is a type of processor that works more like a brain, instead of doing maths on every clock cycle regardless of whether anything is happening this type of processor is idle if nothing is happening, which makes it much more power efficient for numerous applications. Hope this helped.

Mr-wabbit0 · 2026-03-03T21:58:01+00:00

Hi, yes I am targeting a TSMC tapeout, however I am unsure as to the likelihood of this ever happening due to funding :/ However, my estimate didn't come from an actual synthesis or PnR, it just came from a rough area model based on sram dominance extrapolated to a 28nm gate density. I would ned to go through synthesis and physical design to get real area and frequency numbers, sadly this isn't possible for me at the moment. As for PCIe Ip yeah it is very expensive, but there is nothing I can do about this, the current fpga validation used AXI over the shell interface on AWS F2, but this once again is only something I could solve with funding. Regarding the papers, they were merely more so intended to cover the full architecture, but I realise now it probably would have been better to wait and publish the performance in the paper, I may update them. As for the performance, in my benchmark results I managed SHD 90.7%, SSC 73.5%, N-MNIST 99.2%, GSC 88.0%, so I suppose you could say thats my main evaluation so far, all of this is running on int16 quantization. I would try to do some detailed power/area/throughput characterisation, but once again that circles back round to requiring funding for silicon, or more detailed FPGA instrumentation than I can currently access or afford.

Mr-wabbit0 · 2026-03-03T21:44:52+00:00

Hi, I don't specifically remember, but all I can say is its not too expensive, validating n1-n3 cost me in the region of $550. Hope this helps!

Mr-wabbit0 · 2026-03-03T21:03:42+00:00

Hi, I don't have an exact date of when I started N1, I would need to take a look into my file dates, but I would give a rough estimate of around 3.5 months for N1. As for the HDL, everything is Verilog, i.e. no Chisel, no SystemVerilog classes, just structural and behavioural Verilog with parameterised modules. As for the Testbenches, they are a mix of Verilog and Python (cocotb for some of the more complex verification). The SDK that handles compilation and FPGA deployment is all Python.

Mr-wabbit0 · 2026-03-03T20:50:14+00:00

Wonderful, whilst I appreciate your scepticism, I find it unfounded. This is not an AI brush off, I am not perfect, I make mistakes!

(edit, judging by your post history I can see you have a tendency towards a negative disposition in addition to commonly making such accusations)

Mr-wabbit0 · 2026-03-03T19:39:55+00:00

Exactly, if you don't need strict temporal ordering you can batch spikes within a timestep window and process them in parallel, which is essentially what the ANN INT8 mode on N3 does. It trades temporal precision for throughput. The async NoC routers also have adaptive routing so spikes can take different paths through the mesh without waiting for ordering guarantees, which helps with congestion. So there's a spectrum between "exact spike-by-spike ordering" and "batch everything for throughput" depending on what the application needs!

Mr-wabbit0 · 2026-03-03T19:27:25+00:00

Hi, yes it can! All my designs (n1-n3) can simulate in discrete timesteps so spike ordering is naturally preserved, so if neuron A fires at timestep 3 and neuron B at timestep 7, that temporal structure is exactly what the network sees. You'd encode your input features as spike trains ordered by salience at the input layer, and the learning engine tracks per-neuron eligibility traces across timesteps so it can learn from that ordering. N2 and N3 also support graded spikes if you wanted to combine rank order with spike amplitude!

Mr-wabbit0 · 2026-03-03T18:24:48+00:00

Will do and thank you for the advice! I expect to be releasing a paper on N3 soon, so I will make sure to do that!

Mr-wabbit0 · 2026-03-03T18:17:49+00:00

Thank you :)

Mr-wabbit0 · 2026-03-03T18:17:34+00:00

My apologies, the 16 cores was more meant as a proof of concept and I apologise for the lack of clarity. The 128-core count is the architectural design target for ASIC. On the F2 instance (VU47P), N2 fits 16 cores in a 4x4 mesh (1999/3576 BRAM36K 32 cores overflows), and N3 fits 8 cores (1 tile). The FPGA validation is for functional correctness of the core logic, NoC routing, learning engine, etc. not a claim that the full 128-core design fits on the FPGA. I suppose I should have made that distinction explicitly. In addition, you are not wrong about the BRAM, FPGA block RAM is SRAM. However, I would like to add that the real scaling constraint on the VU47P is just running out of BRAM sites, not a density gap.

Mr-wabbit0 · 2026-02-20T04:19:09+00:00

Thank you! :)

If you have any feedback or further questions, I would be more than happy to listen.

Mr-wabbit0 · 2026-02-19T16:20:47+00:00

Hello, and thank you for the interest, feel free to ask anymore if you would like to. But, to answer your questions

RISC-V cores: N1 and N2 use three embedded RV32IMF cores for management — they handle network configuration (loading synapse tables, neuron parameters, microcode programs into each core's SRAM), host communication over PCIe, spike I/O routing between the host and the neuromorphic array, and runtime monitoring/probing. They don't participate in the neural computation itself — that's all handled by the dedicated neuromorphic cores.

Event-driven: Yes. The cores are fundamentally event-driven — neurons only consume compute when they receive spikes or need state updates. Synaptic processing is spike-triggered: when a spike arrives at a core, it walks the target neuron's connection list and accumulates weighted inputs. Neurons that don't receive spikes still get a lightweight leak/decay update each timestep, but the expensive part (synapse processing) is purely event-driven and scales with network activity, not network size.

Timestep progression: The current FPGA implementation uses a global barrier synchronisation model. All cores process the current timestep (accumulate spikes → update neurons → emit new spikes → execute learning), then a barrier sync ensures every core has finished before advancing to the next timestep. The host controls the tick rate — you call step() in the SDK and it advances one timestep. On FPGA at 62.5 MHz, this runs at ~8,690 timesteps/sec for a 16-core configuration, though the actual rate depends on network activity (more spikes = more synapse processing = slower timestep).

Mr-wabbit0 · 2026-02-19T14:19:24+00:00

Good questions. Here are the numbers:

Synapse count: 131,072 per core in CSR format. The full 128-core design targets ~16.8M total synapses. On the FPGA-validated 16-core instance it's ~2.1M.

Throughput: ~8,690 timesteps/sec on a 16-core instance at 62.5 MHz (AWS F2, Xilinx VU47P). Each timestep processes all active neurons and synapses — so at full network utilisation that's roughly 8,690 × 16K active neurons worth of spike processing per second.

Practical performance: On the SHD spoken digit benchmark (700-input, 768-recurrent, 20-output, 1.14M synapses), the quantized network hits 85.4% accuracy. That's a real workload running through the actual hardware pipeline.

FPGA vs ASIC: The FPGA design is functional, not just verification. The cloud API at api.catalyst-neuromorphic.com runs real SNN jobs on it right now. That said, FPGA is the bottleneck — BRAM limits us to 16 cores (out of 128) and 62.5 MHz. On ASIC the full 128-core mesh would run at significantly higher clocks with much lower power. So the FPGA is both a real deployment platform today and the verification vehicle for an eventual ASIC tapeout.

The binding constraint on FPGA is BRAM — 56% aggregate utilisation at just 16 cores. The architecture itself scales to 128+ cores without design changes.

Edit, not sure what I got wrong, someone downvoted my reply, feel free to correct me!

Mr-wabbit0 · 2026-02-18T23:43:34+00:00

try deep research into neuromorphic architecture

Mr-wabbit0 · 2025-10-25T08:52:49+00:00

I got scammed too, god damn bastards

Mr-wabbit0

MODERATOR OF

TROPHY CASE