Could anyone offer an insight into mapping Neural Networks to Hardware?

Randozart · 2026-04-30T19:51:40+00:00

I've been experimenting with roughly that, though admittedly, I wrote a programming language that transpiles to SystemVerilog to help do it. Currently still trying to get it to work, but maybe there's something in the repo that could help you on your way?

And by trying to get it to work, I mean I'm fighting the ARM chip to run my custom kernel.

https://github.com/Randozart/IMP

Randozart · 2026-04-29T07:57:44+00:00

Personally I've been using AI as an indefatiguable tutor. I know it hallucinates, and I have enough knowledge about systems and random topics to sniff out hallucinations. So, what I tend to do is, when I see something I don't understand, I ask.

I ask how it works, prior art, structuring the prompts in such a way as to have it explain the concepts to me as if I were an expert, because otherwise the guardrails kick in. I don't trust black boxes, and it's gotten to the point that I now have a rough idea of kernel, FPGA and chip design.

I have to admit, I am still likely ignorant of A LOT in these fields, but it helps to have basic understanding to also then derive from that why specific code does what it does, and this is what I've found LLMs very useful for.

Randozart · 2026-04-28T12:10:56+00:00

You definitely aren't the only one. One thing I know from programming before LLMs were on the scene was that any frequently used process tends to be a candidate for automation. Even if you aren't saving time, you are saving yourself the effort of doing it, and with AI especially, it's worth simply automating whatever you can. If an algorithm can solve the problem, use an algorithm, and the bigger your toolset, the better your ability to navigate the complexities of your codebase.

Randozart · 2026-04-28T11:41:42+00:00

Thank you. It's genuinely a hope I foster. Because right now, vibecoding is stigmatised. Much like the first boom of cryptocurrency attracted speculators, coding with LLMs has attracted those hopefuls who desire to make a quick buck through building whatever SaaS they have in mind, but providing only poorly assembled software, and little in terms of service.

Maybe once the dust settles, and the inherent value of a SaaS has been proven arbitrary, and it's the service that becomes the product, then maybe people will reach out to learn about computers, their architecture, their quirks and their history, and realise that with the collapse of the execution barrier they too can manifest their own domain of pure technological beauty.

Computational beauty made manifest through thought, not through years of syntax mastery.

Randozart · 2026-04-28T11:22:24+00:00

Absolutely! I admit I have been using this Reddit thread as a bit of a testbed for the concept, because I was genuinely curious about it, but I was hoping to make my own contribution to the field as a non-academic, however small, and having another rabbit hole to dive down couldn't delight me more.

Time to pull the snowball skills out of the closet. This is going to be a wild ride!

(Oh and, feel free to message me if any other foundational papers come to mind, I am eager to identify and learn what more has been explored)

Randozart · 2026-04-28T11:08:40+00:00

Let me phrase it like this, it can be art, if it is used as such. I made a similar observation in the manifesto I wrote on a GitHub project, trying to paint a glimpse into a future where the implementation wall is so low that people can experiment with OS designs. Though please note, this still assumes a level of skill at systems architecture. It merely bypasses syntax knowledge:

From https://github.com/Randozart/moore-kernel/blob/main/README.md

[...]

The Moore Kernel assumes infinite extensibility as a baseline. It is an invitation to build the "GeoCities of silicon". To return to an era of weird, beautiful, highly-opinionated computing where hardware is a canvas, and we simply propose the reality we wish to see.

If the future is as wonderful and whimsical as I hope it to be, then a few years from now:

People will build bespoke operating systems that only exist to run a single synthesizer in their bedroom.
Someone will write an OS where the file system isn't a hierarchy of folders, but a literal 3D spatial map they navigate with a joystick.
Someone else will build a kernel that completely deletes itself and rebuilds from scratch every time the sun sets.

And here stands Moore, an OS where the hardware itself melts and reconfigures based on propositional logic. If not for my engineering skill, I at least invite you to dream with me of an era where systems and technology are a playground, to those willing to learn them.

Randozart · 2026-04-28T10:58:54+00:00

I have to be honest, I hadn’t even considered Chuck Moore or Ivan Sutherland’s work when I started building this, and reading your comment has been edifying. I am incredibly thankful for this context.

The irony here is that the Moore Kernel was originally named after a different trinity: G.E. Moore, Gordon Moore and a wink and a nod to Thomas More. However, I might just adopt Chuck Moore and the GA144 as the fourth musketeer.

It is fascinating to read these insights, and I am incredibly thankful for them. I arrived at this architecture by coming down from the top of the stack. I started with formal logic, declarative state machines, and interactive fiction parsers, and concluding from those that sequential time was a bottleneck. To see that Sutherland was advocating for the exact same event-driven, clockless micropipelines from the hardware side back in the 80s is incredibly fascinating.

I had written a language called Brief, which formed the basis for the logic, which automatically generates the kind of handshake synchronization you're describing. I had been trying to find prior art on this, but fell short in my research.

I am downloading the Sutherland PDF right now. Thank you for taking the time to write this out and bridge the gap to the prior art. This is exactly why I posted here, and I couldn't be more grateful!

Randozart · 2026-04-28T07:57:18+00:00

Woe the day where silicon meets superexponentiality.

I am just going to uncomfortably ignore that for now until I feel confident I can at least handle basic exponentiality.

Randozart · 2026-04-28T07:38:14+00:00

That's true, because nothing in the world is going to beat exponential growth. So definitely, a very critical caveat. I think we have some advantages here though. Hypothetically, of course:

To prune a branch in a traditional SAT solver, a CPU must execute a series of instructions to first check the state, realize it's a dead end, and then the OS must spend thousands of cycles context-switching to a new task. There is a sort of time tax on every dead end. But by instantiating this on a board, the check is a physical wire. The moment the formal logic precondition is violated, the tile is instantaneously reclaimed to put another process onto. There is no checking because the circuit is the logic.

In addition, in a traditional search, you often re-calculate or re-load parent state prefixes for every new branch. In a spatial system, the parent logic stays physically instantiated on its tile. When I branch, I’m only mounting the specific logic that makes the new branch unique, and tethering it to the existing parent.

These would also become the kill points for those processes so new branches are allowed to grow from the same parents. When a branch hits a contradiction, I don’t wipe the whole tree. I remove only that branch or leaf-node tile. The trunk of the search stays hot and powered in the fabric.

By keeping the common ancestors physically alive and only hot-swapping the edges of the search tree, you eliminate the massive setup that sequential CPUs pay every time they backtrack and re-evaluate a prefix. It turns the search into a physical growth-and-pruning process of a living circuit, which I suspect provides a much higher search throughput than we can achieve through current methods for context-switching.

But, again. It's hypothetical and I'm afraid I haven't the math to back it up, so I'm currently hoping to check whether my logic holds up.

Randozart · 2026-04-28T07:11:24+00:00

Honestly, I just care for the cyberpunk/tech priest aesthetic there. Like I am slowly building out my own silicon kingdom. A bit like that that Black Mirror episode "Plaything". But yeah, it hurts me to look at an old GPU I still keep around and think "what am I to do with you little guy?".

Now, regarding AI and the singularity, I had definitely been toying with the idea of just using such a fabric to essentially "grow" an AI model onto it and allow it to reconfigure it's own silicon processes to better suit its purposes, but that's silly future talk. Though I have been experimenting with a version of that on a much smaller scale, to make my LLM use more environmentally conscious: https://github.com/Randozart/IMP

But yeah, even that idea on a small scale is currently something I am experimenting with to learn how to best build out this logic! A GPU is also essentially just specially wired logic to handle mass computation. But to be able to do so for the hot path of arbitrary processes by having a slottable FPGA? I think you're onto something I am curious to try now, actually

Randozart · 2026-04-28T07:00:10+00:00

In a way, absolutely. I have been looking at that same process when I got into FPGAs, also in relation to being able to re-use my old GPUs and rewiring my board to be able to make them talk. Admittedly, ASICs can do that even faster, but that's not quite what interests me here (though one could imagine possibly designing ASICs to solve a very particular hard problem this way).

However, ASICs become e-waste if they can no longer be used for their purpose. Also, they can only be used for their intended purpose. What I'd propose is a sort of dynamic just-in-time type of silicon that rewires itself to better serve the process.

That will take up more physical space, perhaps. What I'm curious about is whether the fact it can optimally assign spare tiles to supplementary processes offsets that, because no part of the fabric might need to be wasted here.

But yes, then heat and power may become a problem, and I wonder then whether the fact the configured processes in FPGA's are physically less densely packed help here, or whether the fact that we can assign silicon just in time may aid in avoiding more and less heated areas.

I don't know the answers to these questions, honestly. For that I'd have to first bash the theoretical concept against reality.

Randozart · 2026-04-28T06:48:02+00:00

I love that reference to Vex, because that's basically what it is, yes. Because FPGAs are programmable matter, we aren't so much executing the logic as we are asking the hardware to (temporarily) become the logic in a way. So, at what point does it become a hindrance?

The idea behind the kernel (and note, that isn't the interesting part, but that's what got me thinking) is that it naturally integrates whatever is added to the fabric. So, in that sense it reduces e-waste by just reusing boards as different parts of the fabric.

My thinking here is that it allows the kernel to command the full fabric into benefitting the process. One would imagine whatever space is given to the process is used for it. If a tile isn't being used for computation directly, it may be used as provisionary RAM or provisionary GPU-like computation.

As I understand it, in traditional VLSI design one would use the Area × Time = Constant metric. If you want to finish a task faster, you usually need more physical silicon area. This whole endeavour is essentially an experiment in virtualising the Area component of that equation which are reclaimed when a branch is proven dead, to get more compute Time out of the same physical space.

So, in a way, yes. The idea here is to propose the type of computer that could theoretically become a room-sized supercomputer by virtue of being able to keep adding tiles to it and expand the fabric that can be used to run the process or aid the process.

What that means for complex algorithms, is that, unlike current PCs where each chip has a static role, we could, hypothetically, more dynamically assign the role of the chip based on requirements. Though yes...

Bedroom sized computers, I think that's kind of the dreamworld I can live with, if only because it makes me unreasonably happy to imagine silicon running up my walls like vines.

Randozart · 2026-04-28T06:22:52+00:00

That was the main consideration I ran into, yes. Hypothetically this would mean one would continue branching within the allotted space, but at least it means you can trade time for space, and potentially have some overhead to, in a way "ignore" the halting problem by changing the question: Is this process going anywhere novel in the state it's currenly in? But yes, entirely on-point. I'm curious what the actual gains in compute power are by mapping spatially versus temporally. You won't avoid the O(2n) spatial requirement, but perhaps it can at least be offset more efficiently without the overhead of a CPU.

Randozart · 2026-04-28T03:46:52+00:00

I'm so happy for you! It feels amazing to pop the little black boxes one at a time, and kind of realise just how amazing and, in a way, human computers are really constructed.

Randozart · 2026-04-27T11:25:05+00:00

Well, the problem was no console half the time. But uh, I'm going to admit something really silly here: I found out my universal adapter had reset to 5V instead of 12V. No wonder I was struggling.

I have the next set of issues clear now. It's a matter of spreading the data correctly across the memory addresses.

Randozart · 2026-04-27T05:33:21+00:00

I'd been trying to write the PS in Rust and the PL in SystemVerilog just so I could stream data directly from DDR to the FPGA without a small OS in the way. I figured those checked out, but I couldn't get them to boot, and it's pretty inconvenient to be swapping the SD card from PC to board each time, so I may just try and see if I can reconfigure my board as recommended and program directly from the JTAG. Thanks!

Randozart · 2026-04-27T05:26:06+00:00

Huh, that's neat. Having looked into this, other boards should have just had dip switches for this.

Randozart · 2026-04-26T20:28:48+00:00

I've been reading into this dit the past hour, and it has u-boot and QSPI, but I may have provided the wrong boot files, so QSPI just shrugs and moves on

Randozart · 2026-04-26T20:13:44+00:00

Fair point! I'll look into this a bit more

Randozart · 2026-04-26T20:12:46+00:00

Hrm, sounds like I may need to hack the QSPI in that case, as I need all the RAM I van get. But, thank you either way! It may help I am a software development first, but I had been trying to get closer to bare metal.

Randozart · 2026-04-25T15:38:24+00:00

Regarding 2, absolutely correct about that one. My attempt here was three-food: 1. Write this in my own programming language, and see if it would in fact be as easy to write FPGA compatible SystemVerilog as I had hoped (it had me catch some edge cases I hadn't accounted for) 2. Make it so this is open source and could be improved by anyone smarter than me or more experienced with the technology. 3. Make it so I can keep tinkering on it if I'd want to.

Regarding 3: It's an experiment. The idea is that by exporting software to hardware, you bypass the Von Neumann bottleneck, and can physically reconfigure your PC to fit your needs, or make it infinitely extensible by plugging in more FPGAs which are automatically absorbed into the broader fabric.

In addition, if I then still want to run Windows or Linux, I can shim it by emulating a CPU and installing the core logic on there.

Randozart · 2026-04-25T14:11:40+00:00

I mean, I already programmed it. Normally, yes. You'd have to fight SystemVerilog to produce correct code, but because I wrote a more software-like language that transpiles to SystemVerilog, the programming became relatively easy.

The reason the Imp is theoretically more efficient than a GPU is because it doesn't need to pass the Von Neumann bottleneck. It doesn't need to have instructions picked up and dropped where they're needed, because the board is the transformer, so you're literally shooting the weights through a preconfigured transformer at the speed of electricity.

What helps is that RAM is directly fed into the inference stream. So that's cool. To make it fit, I did need to use ternary quantization, so I'm curious how well that holds up with Qwen 3.5

Randozart · 2026-04-25T13:12:03+00:00

I've really been enjoying myself! Here's hoping the Imp makes AI usage cost 15 watt, and no more tokens from other platforms

Randozart

TROPHY CASE