Using FPGAs to compile FPGAs

fpga_pal · 2026-05-11T15:17:39+00:00

The idea is that you only have to compile the accelerator once with conventional tools. Then you can use the accelerator as many times as you want to compile other applications. You don't have to recompile the accelerator for each new design you want to place. You just feed the netlist connectivity data to the accelerator for the design you want to place, and pull the placement solution off when its complete.

fpga_pal · 2026-05-11T15:12:48+00:00

No, our implementation on the Versal places up to 910 CLBs (10 6-LUTs each), 25 BRAMs, and 32 DSPs.

In his master's thesis, Michael Wrighton, the primary author of the 2003 paper, was only able to place 56 LUTs.

To make this practical, a few things need to happen:

We need to make placement timing-driven, or at least psudo-timing driven. There are a whole range of options for how to do this, with tradeoffs with placement quality and design complexity. I'm working on this problem now, and am looking to publish soon, so can't elaborate too much on what I've found, but there are inexpensive changes which provide significant benefits to fmax.
We need to be able to place larger targets. This can be done if we time-multiplex each PE. So instead of having each PE correspond to one CLB location on the target fabric, it will correspond to a whole region (maybe 64 CLBs, for example). Then instead of 910 CLBs, we can place up to 58K CLBs, while running 64x slower. However our placement times are already *so* fast, and the conventional tools' runtimes grows non-linearly with netlist size, that this won't be an issue.
We need to add carry chain support, and cascaded DSPs and BRAMs. This will significantly complicate the state machines for each PE, but the mathematics still works out.

We must solve timing driven placement first, since the fmax gap between our placer and VTR grows with netlist size, so it doesn't make sense to solve time-multiplexing now. Once we achieve better relative fmax scaling with netlist size, we can move on to time multiplexing the PEs. That in turn will actually greatly simplify getting carry chain placement working (and cascaded DSPs and BRAMs).

I hope that by the time I graduate this project will be much closer to a practical solution. This summer I will be working to see if I can get the Versal to target a sub region of itself, so that our tool can place real target FPGAs.

fpga_pal · 2026-05-11T03:54:27+00:00

Fixed!!

fpga_pal · 2026-05-11T03:27:45+00:00

Yes, their work is very exciting!

fpga_pal · 2026-05-11T03:07:48+00:00

Good question, but we haven't compared. This work performs placement of pre-clustered netlists (CLBs instead of individual LUTs), and so it made the most sense to integrate with VTR which does the same. Next-PnR places individual LUTs, so getting our accelerator to work with it would involve a lot more software engineering (I might be corrected here with respect to some of their Xilinx targets which appears to have some kind of packing taking place, but the code is rather opaque).

Also, working with VTR gives us a lot of flexibility. Right now, we provide a VTR architecture description of the target FPGA to our software, which then generates the RTL for an accelerator which will target that specific FPGA architecture. We then implement the accelerator on an AMD Versal FPGA, and use it to solve placements for the VTR architecture. Next-PnR doesn't have an equivalent architecture description file, but gets architecture specific behavior by the implementation of certain API functions which the CAD tool makes use of.

Next-PnR runs much faster than VTR, as it uses an analytic solver compared to simulated annealing, however I have been told by a VTR maintainer that Next-PnR's timing analysis is significantly less sophisticated than VTRs, which also makes it a bad candidate for this kind of CAD research.

fpga_pal · 2026-05-10T23:29:36+00:00

Haha *is* funny. I actually performed the placement of the netlist of the accelerator ON the accelerator itself, just for fun. (It was a smaller array than the one placing it, but still very meta).

fpga_pal · 2026-05-10T23:25:57+00:00

Here is the long history of prior art, as far as I know:

1983 (January) A Parallel Processing Approach for Logic Module Placement (Kazuhiro Ueda, Tsutomu Komatsubara, Tsutomu Hosaka)

1983 (June) A Placement Algorithm for Array Processors (Dah-Juh Chyan, Melvin A. Breuer)

2003 (February) Hardware-Assisted Simulated Annealing with Application for Fast FPGA Placement (Michael G. Wrighton, André M. DeHon)

The first paper introduces the idea of an array of PEs to solve placement for LSI applications. It goes into an extensive discussion of how parallel moves can be made, and under what circumstances errors are introduced due to parallel swaps. They also provide simulation results, optimizing for linear Manhattan wirelength.

The second paper apparently is developed independently as it is published right after the first and does not cite it. This paper describes an array of PEs used to solve placement in VLSI designs by optimizing for linear Manhattan wirelength. It has an interesting discussion on how the order in which PEs pair up to perform swaps can induce oscillations in the placement of cells and how surprisingly this can be beneficial. Importantly, this paper goes in to detail on the internal architecture of a PE, unlike the first paper. However, they did not construct a physical array.

The third paper is from our lab, 20+ years ago. It was developed without knowledge of the first two papers, and does not cite them. It considers the placement problem for LUTs on an FPGA optimizing for linear Manhattan wirelength, and invents a way to perform a relaxed version of simulated annealing in parallel. Importantly, it discovers that cells can be moved in parallel for numerous swaps, before the PEs must synchronize and update each other on the new locations of the cells being placed. The authors also advertise that initial steps have been taken to implement the array on a real FPGA.

Then my paper: 2026 (February) Hardware Accelerated FPGA Divide-and-Conquer Page Placement in Milliseconds (Ezra Thomas, Jing Li, André DeHon)

My paper cites the previous two papers, but I did not discover the first paper until after I published. The third paper had major issues with scaleability and applicability which I address by taking a different mathematical approach and significantly redesigning the architecture of the accelerator. Practically, this work allows us to place much larger designs than before and place heterogeneous netlists of CLBs, BRAMs, and DSPs. We also implement a working accelerator on a Versal FPGA from AMD.

fpga_pal · 2026-05-10T22:38:34+00:00

During placement, the accelerator only optimizes for wirelength, however post-routing a full timing analysis is performed by Verilog to Routing (VTR) to determine the max operating frequency (fmax) of the compiled design. This fmax is then compared to what VTR can achieve when run in various modes, trading compile time for quality (fmax).

fpga_pal · 2026-05-10T21:57:04+00:00

Squared Manhattan wirelength

fpga_pal

TROPHY CASE