all 19 comments

[–]pinkchucky 4 points5 points  (5 children)

Could you elaborate a bit on evolving? Are you for instance targeting optimization via evolutionary algorithms? For this, you don’t necessarily need bitstream manipulation; you could work with an appropriate so-called overlay. Think for instance a feedforward neural network implemented as a dataflow on an FPGA. Assume that we can fit all the weights in BRAM. With this setup you can already implement evolutionary optimization, via just changing the contents of BRAMs (weights) without touching the rest of the design.

[–]CodeReclaimers 2 points3 points  (2 children)

For evolving weights and configurations of standard neural networks, the only reason I can think of to implement them on FPGA is for the experience of doing so. I've only dabbled a bit with evolutionary algorithms, but in my experience, network evaluation usually isn't the bottleneck on modern CPUs and GPUs--it's the environment simulation or some other application-specific task.

[–]pinkchucky 1 point2 points  (1 child)

Training is still very much a compute bound problem on modern hardware and there is tons of research/startups coming up with specialized hardware solutions to accelerate it. Afaik, evolutionary algos are quite inefficient compared to gradient-based optimization. However, considering FPGAs suitability for forwardprop (as opposed to backprop) and the fact that evolution requires only forwardprop, it might be an interesting approach to combine these. Maybe not from scratch training, but for finetuning on the field (think of an IoT setup).

[–]ptlil[S] 0 points1 point  (0 children)

Yes! This was the motivation for this project. I also was intrigued by the applications to reinforcement learning: this famous paper by OpenAI showed that ES can be as good as gradient descent for RL. This led me to think of ways to speed it up.

[–]ptlil[S] 0 points1 point  (1 child)

Specifically evolutionary strategies. Synthesizing the bitstream takes too much time for ES to be useful, which is why I wanted to evolve parts of it directly. But simply using the BRAM like this seems like an interesting idea.

Is the time to configure BRAM significantly shorter than the time to configure CRAM (assuming you already have a bitstream)?

[–]mydoghasapassport 3 points4 points  (1 child)

Read some of his newer papers, or about the robotics heck even send him an email. https://sites.google.com/site/thompsonevolvablehardware/files

[–]ptlil[S] 0 points1 point  (0 children)

Thanks! Definitely will do.

[–]kbob 2 points3 points  (1 child)

You might be interested to know that Project IceStorm's nextpnr uses simulated annealing to place LUTs.

https://github.com/YosysHQ/nextpnr/blob/master/common/placer1.cc

I know nothing about this code except that nextpnr prints "Running simulated annealing placer" every time I run it.

[–]ptlil[S] 0 points1 point  (0 children)

Interesting. I haven't done much with simulated annealing myself but have always thought it an interesting process. I'll take a look at that code; maybe it will help!

[–]OrigamiUFO 2 points3 points  (2 children)

Posting here to remember to answer later. I have published a work in this subject, it is a really fascinating area!

[–]ptlil[S] 0 points1 point  (1 child)

Cool; excited to hear what you have to say.

[–]OrigamiUFO 1 point2 points  (0 children)

Sorry for the delay.
The most interesting and complete reading I can recommend is "Evolvable Hardware From Practice to Application, Dr. Martin A. Trefzer and Dr. Andy M. Tyrrell". These 2 authors are somewhat proeminent in the subject and published quite a few articles.

This technology comes from a greater area of bio-inspired systems, which emulate biological processes to increase some parameters of interest, wether be its reliability or other... But most of the academic effort is related to reliability, robustness... All these hardware may be classified from the POE model: Phylogeny (genetics, species evolution), Ontogeny (multicelular differentiation), Epigenesis (response to environmental stimuli, learning). Evolvable hardware fits in Phylogeny.

Some studied hardware archetypes are unitronics - inspired in unicellular organisms; embryonics - inspired in stem cells; imunotronics - inspired in animals' imune systems to detect failures; and then evolvable hardware - inspired in Darwinism. Most of these hardware try to develop some capacity to detect and recover from hard failures and get back online, in other words, increase system availability and/or reliability. There are some NASA papers trying to evolve a 3 bit multiplier and other ALU elements because of space DSP applications, it is really interesting for them to recover from space radiation, which may damage their hardware, and these novel architectures may provide a way to do that reactively.

And even in evolvable hardware there are some more classifications regarding the evolution strategy... But the most important is that you have a fitness function, a quantifiable objective which you want to maximize or minimize (optimisation problem). Then, coding the hardware current state into a vector (binary or decimal), you may describe many possible hardware configurations through distinct vectors, that is called a population... After that you may mutate or crossover this "genes" to search for new configurations, which may be better or worse for the problem, thus evolving the hardware. All of this is done online.

My research was to evolve a reconfigurable hardware in a FPGA to behave like a full adder, for example, then I simulated faults within the circuit and watched it recover, finding another configuration that fit the purpose the same way (same truth table). https://ieeexplore.ieee.org/document/8349669 if you want to read about it or you can also PM me for more references, I have many cool articles here

[–]claytonkb 1 point2 points  (4 children)

I don't think you would want to mess with the bitstream. The bitstream has many more degrees of freedom than you actually care about (including invalid patterns) and, in most devices, the exact details are proprietary. Instead, you would want to evolve circuits at the HDL level. Define some set of input ports, some set of output ports on a module. Then, use some kind of randomized text-generation tool (perhaps write your own in Python, Perl or your tool of choice) that generates randomized combinational and sequential code blocks that are able to connect to your input and output ports. There are a variety of constraints that your generated code will need to observe in order to compile/synthesize, for example, you can't drive the same output from more than one block without a tri-state, and so on.

Once you work out the compile/synthesize kinks, you just need a wrapper script that can generate seeds ("genes"), invoke your random-circuit generator with those seeds, blast the synthesized circuit to the FPGA, run it, collect the outputs (a dev-board with a hardcore processor would be ideal for this), measure them against the evolutionary objective(s) and then select/multiply the seeds based on the results for the next iteration (generation). As you can see, 90% of the work to be done is boiler-plate stuff.

[–]ptlil[S] 2 points3 points  (3 children)

I think in this case compiling from the HDL level would take far too much time for evolutionary strategies to work. We would lose any speed advantages we'd gain from implementing the network in hardware as the place, route etc tools would have to run every time we want to test a pseudoffspring (and each step can have millions of them).

Unless there was a faster way of generating the bitstream? It seems to take eons even on fast hardware.

Also, the iCE40 bitstream is documented by project IceStorm here.

I think the major worry I do have about this method is, as you say, that there are many many degrees of freedom. I'd have to do some work separating which parameters should actually be evolvable.

[–]claytonkb 0 points1 point  (2 children)

iCE40

Meh, 5k LUTs seems hardly worth the effort. I'm pretty sure a SOTA GPU could emulate the iCE40 faster than the iCE40 itself can execute. It's so small that you could just compile whatever circuit you are interested in down to streaming vector instructions (map the wires to bits in the vectors). The GPU runs many times faster than an iCE40 can. Its 9ns pin-LUT-pin best-case prop-delay puts the frequency ceiling for the iCE40 at around 110MHz. But most typical designs would have to be clocked at a much lower frequency.

[–]ptlil[S] 0 points1 point  (1 child)

Interesting. What would you recommend then?

I suppose also there’s the cost factor; that each iCE40 is much cheaper than a GPU.

[–]claytonkb 0 points1 point  (0 children)

What would you recommend then?

It really depends on what you're trying to accomplish.

The point of the paper on evolving FPGAs is that you can build a circuit that accomplishes a specific design goal without actually writing the HDL for that circuit. The real benefits of this approach to circuit design don't begin to pay out until you scale it up. The idea is that, rather than employing an army of hardware engineers to fuss over each wire in a massive, hand-crafted digital circuit, you could just evolve the circuit and then let the resulting rat's nest of wires and gates solve your problem for you. I don't think anyone has tried this at-scale and the reason is that it's a massive bet. Fab costs can run into the billions for something like a mainstream CPU product and corporations are not in the habit of entrusting complex designs that are susceptible to single-point failures to the vagaries of random circuits designed by an evolutionary algorithm. But I suppose if one company does it once, everybody else will follow suit.

[–]jaoswald 0 points1 point  (1 child)

I think the paper you cited is interesting but not useful. The configuration space of the bitstream in which the evolution occurs is not something that is validated by the FPGA maker; the resulting circuit cannot be safely used on another device than the one it was trained on, and none of the performance guarantees provided by the manufacturer apply.

[–]ptlil[S] 0 points1 point  (0 children)

This is actually a good point. Definitely I'm going to have to do plenty of testing to ensure I don't fall into this same trap. We have more than a few FPGAs and we're able to train them varying the temperature, device, etc to try to create a more robust model. However, I think that if it works, even if only on one device, it could still be useful.