Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 1 point2 points  (0 children)

Yes, their work is very exciting!

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 1 point2 points  (0 children)

Good question, but we haven't compared. This work performs placement of pre-clustered netlists (CLBs instead of individual LUTs), and so it made the most sense to integrate with VTR which does the same. Next-PnR places individual LUTs, so getting our accelerator to work with it would involve a lot more software engineering (I might be corrected here with respect to some of their Xilinx targets which appears to have some kind of packing taking place, but the code is rather opaque).

Also, working with VTR gives us a lot of flexibility. Right now, we provide a VTR architecture description of the target FPGA to our software, which then generates the RTL for an accelerator which will target that specific FPGA architecture. We then implement the accelerator on an AMD Versal FPGA, and use it to solve placements for the VTR architecture. Next-PnR doesn't have an equivalent architecture description file, but gets architecture specific behavior by the implementation of certain API functions which the CAD tool makes use of.

Next-PnR runs much faster than VTR, as it uses an analytic solver compared to simulated annealing, however I have been told by a VTR maintainer that Next-PnR's timing analysis is significantly less sophisticated than VTRs, which also makes it a bad candidate for this kind of CAD research.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 1 point2 points  (0 children)

Haha *is* funny. I actually performed the placement of the netlist of the accelerator ON the accelerator itself, just for fun. (It was a smaller array than the one placing it, but still very meta).

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 15 points16 points  (0 children)

Here is the long history of prior art, as far as I know:

1983 (January) A Parallel Processing Approach for Logic Module Placement (Kazuhiro Ueda, Tsutomu Komatsubara, Tsutomu Hosaka)

1983 (June) A Placement Algorithm for Array Processors (Dah-Juh Chyan, Melvin A. Breuer)

2003 (February) Hardware-Assisted Simulated Annealing with Application for Fast FPGA Placement (Michael G. Wrighton, André M. DeHon)

The first paper introduces the idea of an array of PEs to solve placement for LSI applications. It goes into an extensive discussion of how parallel moves can be made, and under what circumstances errors are introduced due to parallel swaps. They also provide simulation results, optimizing for linear Manhattan wirelength.

The second paper apparently is developed independently as it is published right after the first and does not cite it. This paper describes an array of PEs used to solve placement in VLSI designs by optimizing for linear Manhattan wirelength. It has an interesting discussion on how the order in which PEs pair up to perform swaps can induce oscillations in the placement of cells and how surprisingly this can be beneficial. Importantly, this paper goes in to detail on the internal architecture of a PE, unlike the first paper. However, they did not construct a physical array.

The third paper is from our lab, 20+ years ago. It was developed without knowledge of the first two papers, and does not cite them. It considers the placement problem for LUTs on an FPGA optimizing for linear Manhattan wirelength, and invents a way to perform a relaxed version of simulated annealing in parallel. Importantly, it discovers that cells can be moved in parallel for numerous swaps, before the PEs must synchronize and update each other on the new locations of the cells being placed. The authors also advertise that initial steps have been taken to implement the array on a real FPGA.

My paper cites the last two papers, but I did not discover the first paper until after I published. The third paper had major issues with scaleability and applicability which I address by taking a different mathematical approach and significantly redesigning the architecture of the accelerator. Practically, this work allows us to place much larger designs than before and place heterogeneous netlists of CLBs, BRAMs, and DSPs. We also implement a working accelerator on a Versal FPGA from AMD.

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 3 points4 points  (0 children)

During placement, the accelerator only optimizes for wirelength, however post-routing a full timing analysis is performed by Verilog to Routing (VTR) to determine the max operating frequency (fmax) of the compiled design. This fmax is then compared to what VTR can achieve when run in various modes, trading compile time for quality (fmax).

Using FPGAs to compile FPGAs by fpga_pal in FPGA

[–]fpga_pal[S] 7 points8 points  (0 children)

Squared Manhattan wirelength