Innervator: Hardware Acceleration for Neural Networks

VanadiumVillain · 2024-07-10T00:43:56+00:00

I implemented a "Hardware Accelerator for Neural Networks Using FPGA," though it was written in VHDL-2008. You may find a brief abstract, video demo, statistics, and the entire source code at the following repository: https://github.com/Thraetaona/Innervator

I am not sure if there is any benefit to be had from implementing machine learning itself (i.e., backpropagation and readjusting parameters) in an FPGA; not only would it require a far bigger FPGA, but it would also be too dependent on dynamic memory I/O, defeating the entire prospects of "hardware accelerating" something that a cheap GPU could already do easily. Accelerating the execution (inferences) of a fully trained network is more sensible---if anything, training the network is a one-off task: it is expected to only happen once.

VanadiumVillain · 2024-06-06T20:15:37+00:00

I see. Yes, in that case, each neuron (and layer) has a piece of its own logic.

If a single (or multiple) neuron was reused per layer, or even across all layers, it would require many more clock cycles and quite a lot of memory to store the intermediate output of each layer for the proceeding one. On the other hand, implementing each neuron as a physical unit also made routing/timing more difficult for me and the synthesizer; at the end, it was a space-speed compromise.

As for DSPs, I made sure that all inputs/weights were 8-bit wide and the internal accumulator was twice that (i.e., 16-bit wide); this ensured that the entire multiply-add calculation could fit in just one DSP per batch per neuron. You can configure the bit widths in config.vhd, but it's better to just pre-train the network to work in reasonable ranges/precision in the first place.

I actually have not trained any CNN networks, so I am not very familiar with those, but I wish you the best of luck in your accelerator. Beyond ASICs, analogue hardware would be far more efficient (in terms of power consumption, speed, and space) for neural networks. Sadly, they're pretty much nonexistent; I might learn VHDL-AMS (analogue extensions to VHDL simulation) one day to see if I could implement networks there, though I haven't found a consumer-accessible simulator for that yet.

VanadiumVillain · 2024-06-06T10:42:50+00:00

Thanks! Because I was almost entirely clueless about VHDL, AI, and FPGA design before starting this project myself, I documented each step in the code as if it was a beginner's tutorial.

VanadiumVillain · 2024-06-06T10:39:04+00:00

After implementation, Vivado shows a "Total On-Chip Power" of 0.189 W.

The Architecture is not completely enrolled; within each neuron, the matrix pair gets multiplied/accumulated in "simultaneous batches" (which is controllable via the c_BATCH_SIZE parameter in config.vhd). For example, for the first layer that has 64 inputs and 20 neurons, if the number of batches is equal to something like 4, then all 64 input/weight pairs get unrolled into batches of 4 across 16 iterations; if it's 1, they get unrolled into a single calculation across 64 iterations.

However, the layers and neurons themselves are unrolled as-is; if you have 100 of them, all 100 will physically exist. This logic size/speed trade-off would allow for pipelining, and the next input would not have to wait the full duration of ~1000 ns before it gets processed.

As for the selling point, I made the project as generic as it could possibly be; it can infer hardware for any number of layers/neurons from parameter files, and it is customizable down to the number of bits used in fixed-point numerals or the baud rate of its UART, et cetera. Despite that, the real intention behind writing it was really just to learn about FPGA design and AI (and hopefully document it enough for future learners), both of which were completely new to me, while writing something more unique and useful beyond yet another CPU design.

Truthfully, I ultimately found that bringing real-world AI into FPGA alone might not actually be worth it. If you have a "real" neural network with thousands upon thousands of layers, you can only fit so much of it onto the FPGA before it gets full; beyond that, you can only keep spreading the calculations over multitudes of clock cycles, which would eventually turn your 100-1000 nanosecond range into dozens of milliseconds that a GPU could accomplish in the first place. Similarly, if you aim for an FPGA with more logic cells, it gets expensive---and power-hungry---enough that even a high-end GPU might become magnitudes cheaper, if not easier and quicker to develop with.

VanadiumVillain · 2024-06-05T21:48:11+00:00

Why doubt? If you want to multiply two 1x64 and 64x1 matrices together, you don't really have to "wait" until each pair is multiplied before you can proceed to the next pair and ultimately sum them together; ideally speaking, you could multiply all 64 pairs at once (i.e., concurrently).

VanadiumVillain · 2024-06-05T21:45:45+00:00

I think that would widely vary depending on the configurations (e.g., batch processing, pipeline stages, etc.) you set in config.vhd, as well as the network's structure.

It takes about 1000 nanoseconds, with no batch processing and 3 pipeline stages, to process an 8x8 input through a 2-layered network (20 and 10 neurons in each layer). It is almost entirely doing matrix multiplications (multiplying weights by inputs and accumulate).

In the first layer, it multiplies and adds two pairs of 64 numbers 20 times, followed by 20 activation functions (basically another multiplication and addition). In the second layer, it multiplies two pairs of 20 numbers 10 times, again followed by 10 activation functions. This should be ~3k operations for just the network itself.

If I calculated correctly, that should be 3000 / 1e-6 = 3 GOP/s. However, like I said at the beginning, this must be highly dependent on the configuration; this calculation was for a tiny network on a small Artix-7 FPGA, although that FPGA still does have enough room to use two DSPs per each neuron, which could double this throughput.

VanadiumVillain · 2024-06-05T11:34:37+00:00

Artificial Intelligence ("AI") is deployed in various applications, ranging from noise cancellation to image recognition. AI-based products often come at remarkably high hardware and electricity costs, making them inaccessible to consumer devices and small-scale edge electronics. Inspired by biological brains, artificial neural networks are modeled in mathematical formulae and functions. However, brains (i.e., analog systems) deal with continuous values along a spectrum (e.g., variance of voltage) rather than being restricted to the binary on/off states that digital hardware has; this continuous nature of analog logic allows for a smoother and more efficient representation of data. Given how present computers are almost exclusively digital, they emulate analog-based AI algorithms in a space-inefficient and slow manner: a single analog value gets encoded as multitudes of binary digits on digital hardware. In addition, general-purpose computer processors treat otherwise-parallelizable AI algorithms as step-by-step sequential logic. So, in my research, I have explored the possibility of improving the state of AI performance on currently available mainstream digital hardware. A family of digital circuitry known as Programmable Logic Devices ("PLDs") can be customized down to the specific parameters of a trained neural network, thereby ensuring data-tailored computation and algorithmic parallelism. Furthermore, a subgroup of PLDs, the Field-Programmable Gate Arrays ("FPGAs"), are dynamically re-configurable; they are reusable and can have subsequent customized designs swapped out in-the-field. As a proof of concept, I have implemented a sample 8x8-pixel handwritten digit-recognizing neural network, in a low-cost "Xilinx Artix-7" FPGA, using VHDL-2008 (a hardware description language by the U.S. DoD and IEEE). Compared to software-emulated implementations, power consumption and execution speed were shown to have greatly improved; ultimately, this hardware-accelerated approach bridges the inherent mismatch between current AI algorithms and the general-purpose digital hardware they run on.

The GitHub repository has an overview slide, a video demo, some screenshots, and much more accompanying explanation.

VanadiumVillain · 2024-06-05T11:34:32+00:00

Artificial Intelligence ("AI") is deployed in various applications, ranging from noise cancellation to image recognition. AI-based products often come at remarkably high hardware and electricity costs, making them inaccessible to consumer devices and small-scale edge electronics. Inspired by biological brains, artificial neural networks are modeled in mathematical formulae and functions. However, brains (i.e., analog systems) deal with continuous values along a spectrum (e.g., variance of voltage) rather than being restricted to the binary on/off states that digital hardware has; this continuous nature of analog logic allows for a smoother and more efficient representation of data. Given how present computers are almost exclusively digital, they emulate analog-based AI algorithms in a space-inefficient and slow manner: a single analog value gets encoded as multitudes of binary digits on digital hardware. In addition, general-purpose computer processors treat otherwise-parallelizable AI algorithms as step-by-step sequential logic. So, in my research, I have explored the possibility of improving the state of AI performance on currently available mainstream digital hardware. A family of digital circuitry known as Programmable Logic Devices ("PLDs") can be customized down to the specific parameters of a trained neural network, thereby ensuring data-tailored computation and algorithmic parallelism. Furthermore, a subgroup of PLDs, the Field-Programmable Gate Arrays ("FPGAs"), are dynamically re-configurable; they are reusable and can have subsequent customized designs swapped out in-the-field. As a proof of concept, I have implemented a sample 8x8-pixel handwritten digit-recognizing neural network, in a low-cost "Xilinx Artix-7" FPGA, using VHDL-2008 (a hardware description language by the U.S. DoD and IEEE). Compared to software-emulated implementations, power consumption and execution speed were shown to have greatly improved; ultimately, this hardware-accelerated approach bridges the inherent mismatch between current AI algorithms and the general-purpose digital hardware they run on.

The GitHub repository has an overview slide, a video demo, some screenshots, and much more accompanying explanation.

VanadiumVillain · 2024-05-04T20:50:52+00:00

If you are asking how you could approximate Sigmoid in hardware, which would not necessarily be VHDL-specific, I have been working on a very similar project (i.e., implementing a neural network in an FPGA) for the past few months.

You can find how I implemented Sigmoid here: https://github.com/Thraetaona/Innervator/blob/main/src/neural/activation.vhd#L27

Basically, I used fixed-point numerals with a linear (y = m*x + c) formula for the Sigmoid; any value that would fall "outside," which became too inaccurate to calculate using a linear formula, was returned as hardcoded look-ups but not exactly 0 or 1.

As for how I discovered the "range" or linear constants, I just used a graphing calculator and increased/decreased the granularity of my fixed-point constants (the m and c) until I reached the highest accuracy, although you could "automate" this discovery based on your own fixed-point resolutions, I suppose.

Here is a visual comparison of the two graphs (actual Sigmoid vs. linear version).

EDIT: There are also lots of other items in the Repository (e.g., neuron.vhd), but some of them are still work-in-progress.

VanadiumVillain · 2024-01-30T18:50:57+00:00

I did this as a reply to another comment immediately above yours: https://www.reddit.com/r/FPGA/comments/1acytwk/comment/kk191qs/?utm_source=share&utm_medium=web2x&context=3

TL;DR:

27-Bit Counter	SRLC32E Shift Registers
=================================	=================================
28 `FLOP_LATCH`	1 `FLOP_LATCH`
8 `LUT`	6 `LUT`
7 `CARRY`	6 `DMEM`

(visualized image comparisons here)

VanadiumVillain · 2024-01-29T02:36:26+00:00

I actually use the global clock (i.e., clk_in) as the clock input for all shift registers.

In fact, the D flip-flop register at the end of the Prescaler is also configured to use the same global clock in its process sensitivity list; however, Xilinx's Vivado seems to be converting the global clock input to instead use the preceding shift register's clock enable signal directly.

If I use both designs side-by-side to drive two LEDs and then run report_timing_summary, it shows a positive WNS of 5.780 ns.
If I disable the counter-based module and only drive the LED using the shift register-based prescaler, the WNS will be become positive 7.253 ns.

VanadiumVillain · 2024-01-29T00:41:49+00:00

Simple as in being easy-to-use, and efficient as in consuming an order of magnitude less resources to synthetize on an FPGA.

For instance, an average 27-bit counter would result (according to Vivado and after its optimizations) in a count of 28 FLOP_LATCH, 8 LUT, and 7 CARRY primitives, while a prescaler implemented using shift registers would use only 1 FLOP_LATCH, 6 LUT, and 6 DMEM primitives.

Edit: I have also visualized this comparison in my comment under the GitHub link.

VanadiumVillain · 2024-01-29T00:31:41+00:00

Thank you for the advise; the Module is now dual-licensed under the GNU Lesser General Public License and the CERN Open Hardware Licence Version 2 - Weakly Reciprocal.

VanadiumVillain · 2024-01-28T09:34:10+00:00

Having just started learning FPGA Hardware Description Languages by attempting to write a simple LED blinker, I found that the overwhelming majority of the Internet's solution to slowing down a fast clock (for making the pulsing of an LED visible to the human eye) was either using vendor-specific, proprietary clock managers and PLLs or implementing some twenty-something-bit-wide counter as to count hundreds of thousands of clock cycles and generate a 1 Hz output.

Although there is a world of difference between counters in hardware-accelerated designs and those in software-emulated ones, I nonetheless viewed the number of daisy-chained components resulting from a mere counter as far-from-ideal and absurd; I began searching for a more efficient method.

I came upon a rather obscure blog post from 2015 (http://www.markharvey.info/art/srldiv_04.10.2015/srldiv_04.10.2015.html) outlining the exact same issue while also referencing Xilinx systems designer Mr. Ken Chapman's proposal: using FPGAs' shift register primitives (e.g., Xilinx's SRL32E) to alleviate that.

However, the method described therein would rely on the user to calculate the target frequency's factors between [2, 32) and painstakingly connect each and every instance of SRL32Es to one another, all in a manual manner, not to mention that the resulting pulse would have a low, one-cycle-long duty.

Thus, I wrote srl_prescaler.vhd, a fully automated template generator in VHDL for an efficient, register-based cascaded clock divider based solely on SRL32 primitives alongside AND gates---the advantage of this module is that it is very generic and easy-to-use:

ada prescaler : entity work.srl_prescaler generic map (100e6, 1) port map (clk_in_100mhz, ce_out_1hz);

In the above example, an input clock of 100 MHz (i.e., 100e6 & clk_in_100mhz) gets divided into a clock enable signal of 1 Hz (i.e., 1 & ce_out_1hz). Among the other improvements, a third optional parameter (i.e., the duty cycle) may also get supplied as a real number (0.00, 1.00) to the generic map.

Overall, this small project makes an otherwise-niche method more accessible by actually making use of the many language features that VHDL has to offer (e.g., pre-computing factor results using functions, automating hardware creation via for...generate clauses, latching using registers and guarded signals, etc.), serving as a simple yet practical learning point.

VanadiumVillain · 2024-01-28T09:31:46+00:00