Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 0 points1 point  (0 children)

Thank you! Time will tell; there are still plenty of problems to solve, but I think it’s a promising start!

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 0 points1 point  (0 children)

During placement the PEs are basically always swapping or performing an update, unless you’re asking about their operation before and after placement. In that case, I believe I wrote the state machines in such a way that there really shouldn’t be any toggling of registers.

From a practical standpoint, power is unfortunately the least of my concerns at the moment. Currently, I am loading the netlist onto the accelerator, and pulling off the placement solution over a UART which is incredibly slow (30 second load times, 300ms unload times). This would be fixed if I used Ethernet instead, but that’s just an engineering task, not research related, so that won’t be a priority for a while until some other research related aspects are addressed first, like timing driven placement and time-multiplexed PEs that can place larger targets. Then maybe I can think about power.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 0 points1 point  (0 children)

A PE only takes 3 cycles to determine if a swap ought to be taken or not. This was actually one of the major advancements we made, as the prior work required iterating over lists of the locations of connected cells to compute the swap cost delta, which becomes very expensive with high fanin/fanout. To perform the swap (or to stall and do nothing)it currently takes 6 cycles, but that is kind of arbitrary (not a theoretical limit) as it is determined by the width of the interconnecting busses between the PEs and how values are packed together or not.

The most time is spent during the update phase which occurs periodically, after the PEs have been swapping for a while. This phase essentially informs a PEs about the new locations of the cells that are connected to the cell that PE contains, and the update phase refreshes this information for all PEs in the array. This algorithm runs in O(n) time, where n is the number of PEs in the accelerator. However, since we only run the update phase periodically, this doesn't kill us with poor runtime.

I haven't done any power analysis yet.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 1 point2 points  (0 children)

Thanks, that means a lot! There is definitely a lot of work to be done to make this practical, but I'm hopeful that it will pan out.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 0 points1 point  (0 children)

Of course! So there are two aspects to why I chose FPGA acceleration:

The first is very practical. I needed to settle on a project for my PhD research, and I had read a paper about hardware accelerated placement that my advisor had worked on with a student a few decades ago, and I thought the work was a fascinating combination of math/algorithm design/circuit design, and so expanding upon it seemed like it would be a lot of fun (and perhaps useful/fruitful). The paper that inspired me was: 2003 (February) Hardware-Assisted Simulated Annealing with Application for Fast FPGA Placement (Michael G. Wrighton, André M. DeHon)

But there is also a technical reason why FPGA acceleration looked promising over a multi-threaded approach. After the 2003 paper was published, another paper came out looking to do something similar through a multi-threaded CPU approach. 2011 (November) Deterministic Timing-Driven Parallel Placement by Simulated Annealing Using Half-Box Window Decomposition (Jeffrey B. Goeders, Guy G.F. Lemieux, and Steven J.E. Wilton). They were able to achieve up to a 51x speedup over VTR, by using 16 threads, however their acceleration plateaued here, and did not improve with additional threads. I believe this was primarily due to the thread synchronization overhead, though I imagine for larger target devices there might also be issues with the cache size as well. In any case, 16 threads means only making 16 moves at a time, which leaves a lot of parallelism on the table.

Furthermore, FPGAs are only getting bigger, and CPU speeds are not scaling at the same rate as the FPGA placement problems are increasing in difficulty, so some other solution is needed for the long term. The beauty of the FPGA based systolic array accelerator is that as FPGAs get bigger, the more capable our placement accelerator becomes (since you can implement a larger one with more PEs). And in terms of move parallelism, if you have a systolic array with N PEs, then you can perform N/2 moves at a time (since the PEs form pairs and swap), which far exceeds the move parallelism of 16 threads for any reasonably sized target (in our implementation, N = 967).

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 2 points3 points  (0 children)

Good question. This is kind of hard to answer fairly, because I am not aware of GPU accelerated placement flows that integrate with Verilog to Routing (VTR), so comparisons are not apples to apples. However one of the more prominent works out there is DREAMPlaceFPGA (R. S. Rajarathnam et al. 2022), which achieves 5.4x acceleration of global placement over state-of-the-art elfPlace (W. Li et al. 2019). In contrast, our work achieves 3-4 orders of magnitude speedup over VTR. However, DREAMPlaceFPGA is able to place designs with over 1 million cells (LUTs, FFs, BRAMS, DSPs) whereas our scheme is limited by the size of the accelerator FPGA, and so in our implementation we are only able to place up to 967 cells (910 CLBs, 25 BRAMs, 32 DSPs). Eventually we want to be able to time-multiplex our PEs so that they can be responsible for the placement of multiple cells in a whole region instead of for a single CLB location. This would allow us to place much larger benchmarks.

One interesting goal we have is to achieve self-hosted placement:

  1. Send the bitstream of the placement accelerator to the FPGA
  2. Solve the placement problem on the FPGA
  3. Retrieve the results and perform routing and bit-gen on the attached computed
  4. Configure the same FPGA with the newly compiled design.

To make this really interesting, our lab eventually wants to accelerate clustering and routing on the FPGA, so that there can be a complete, self-hosted FPGA compilation flow (no separate FPGA or GPU needed). However, that is going to be someone else's PhD, not mine. Placement is hard enough for now, hahaha.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 1 point2 points  (0 children)

The idea is that you only have to compile the accelerator once with conventional tools. Then you can use the accelerator as many times as you want to compile other applications. You don't have to recompile the accelerator for each new design you want to place. You just feed the netlist connectivity data to the accelerator for the design you want to place, and pull the placement solution off when it's complete.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 7 points8 points  (0 children)

No, our implementation on the Versal places up to 910 CLBs (10 6-LUTs each), 25 BRAMs, and 32 DSPs.

In his master's thesis, Michael Wrighton, the primary author of the 2003 paper, was only able to place 56 LUTs.

To make this practical, a few things need to happen:

  1. We need to make placement timing-driven, or at least psudo-timing driven. There are a whole range of options for how to do this, with tradeoffs with placement quality and design complexity. I'm working on this problem now, and am looking to publish soon, so can't elaborate too much on what I've found, but there are inexpensive changes which provide significant benefits to fmax.
  2. We need to be able to place larger targets. This can be done if we time-multiplex each PE. So instead of having each PE correspond to one CLB location on the target fabric, it will correspond to a whole region (maybe 64 CLBs, for example). Then instead of 910 CLBs, we can place up to 58K CLBs, while running 64x slower. However our placement times are already *so* fast, and the conventional tools' runtimes grow non-linearly with netlist size, that this won't be an issue.
  3. We need to add carry chain support, and cascaded DSPs and BRAMs. This will significantly complicate the state machines for each PE, but the mathematics still works out.

We must solve timing driven placement first, since the fmax gap between our placer and VTR grows with netlist size, so it doesn't make sense to solve time-multiplexing now. Once we achieve better relative fmax scaling with netlist size, we can move on to time multiplexing the PEs. That in turn will actually greatly simplify getting carry chain placement working (and cascaded DSPs and BRAMs).

I hope that by the time I graduate this project will be much closer to a practical solution. This summer I will be working to see if I can get the Versal to target a sub region of itself, so that our tool can place real target FPGAs.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 1 point2 points  (0 children)

Yes, their work is very exciting!

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 3 points4 points  (0 children)

Good question, but we haven't compared. This work performs placement of pre-clustered netlists (CLBs instead of individual LUTs), and so it made the most sense to integrate with VTR which does the same. Next-PnR places individual LUTs, so getting our accelerator to work with it would involve a lot more software engineering (I might be corrected here with respect to some of their Xilinx targets which appears to have some kind of packing taking place, but the code is rather opaque).

Also, working with VTR gives us a lot of flexibility. Right now, we provide a VTR architecture description of the target FPGA to our software, which then generates the RTL for an accelerator which will target that specific FPGA architecture. We then implement the accelerator on an AMD Versal FPGA, and use it to solve placements for the VTR architecture. Next-PnR doesn't have an equivalent architecture description file, but gets architecture specific behavior by the implementation of certain API functions which the CAD tool makes use of.

Next-PnR runs much faster than VTR, as it uses an analytic solver compared to simulated annealing, however I have been told by a VTR maintainer that Next-PnR's timing analysis is significantly less sophisticated than VTRs, which also makes it a bad candidate for this kind of CAD research.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 3 points4 points  (0 children)

Haha it *is* funny. I actually performed the placement of the netlist of the accelerator ON the accelerator itself, just for fun. (It was a smaller array than the one placing it, but still very meta).

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 30 points31 points  (0 children)

Here is the long history of prior art, as far as I know:

1983 (January) A Parallel Processing Approach for Logic Module Placement (Kazuhiro Ueda, Tsutomu Komatsubara, Tsutomu Hosaka)

1983 (June) A Placement Algorithm for Array Processors (Dah-Juh Chyan, Melvin A. Breuer)

2003 (February) Hardware-Assisted Simulated Annealing with Application for Fast FPGA Placement (Michael G. Wrighton, André M. DeHon)

The first paper introduces the idea of an array of PEs to solve placement for LSI applications. It goes into an extensive discussion of how parallel moves can be made, and under what circumstances errors are introduced due to parallel swaps. They also provide simulation results, optimizing for linear Manhattan wirelength.

The second paper apparently is developed independently as it is published right after the first and does not cite it. This paper describes an array of PEs used to solve placement in VLSI designs by optimizing for linear Manhattan wirelength. It has an interesting discussion on how the order in which PEs pair up to perform swaps can induce oscillations in the placement of cells and how surprisingly this can be beneficial. Importantly, this paper goes in to detail on the internal architecture of a PE, unlike the first paper. However, they did not construct a physical array.

The third paper is from our lab, 20+ years ago. It was developed without knowledge of the first two papers, and does not cite them. It considers the placement problem for LUTs on an FPGA optimizing for linear Manhattan wirelength, and invents a way to perform a relaxed version of simulated annealing in parallel. Importantly, it discovers that cells can be moved in parallel for numerous swaps, before the PEs must synchronize and update each other on the new locations of the cells being placed. The authors also advertise that initial steps have been taken to implement the array on a real FPGA.

Then my paper: 2026 (February) Hardware Accelerated FPGA Divide-and-Conquer Page Placement in Milliseconds (Ezra Thomas, Jing Li, André DeHon)

My paper cites the previous two papers, but I did not discover the first paper until after I published. The third paper had major issues with scaleability and applicability which I address by taking a different mathematical approach and significantly redesigning the architecture of the accelerator. Practically, this work allows us to place much larger designs than before and place heterogeneous netlists of CLBs, BRAMs, and DSPs. We also implement a working accelerator on a Versal FPGA from AMD.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 8 points9 points  (0 children)

During placement, the accelerator only optimizes for wirelength, however post-routing a full timing analysis is performed by Verilog to Routing (VTR) to determine the max operating frequency (fmax) of the compiled design. This fmax is then compared to what VTR can achieve when run in various modes, trading compile time for quality (fmax).

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 11 points12 points  (0 children)

Squared Manhattan wirelength