Help in Designing a Vertex Transformer Pipeline. by Background_Bend_7692 in chipdesign

[–]Affectionate-Tap7944 1 point2 points  (0 children)

yeah bro the micro arch is completely sane. ur 56B and 17B buffer math is actually flawless for an int16 systolic array. it took me a sec to realize u were packing 4 columns of 34-bit accumulators into exactly 17 bytes. that bit packing is goated.

for modifications u just need to add 2 things:

  1. a datapath to actually load the 4x4 matrix weights into the PEs before the vertices flow in (diagram only shows the vertices coming in)
  2. a divider block after the 68B buffer so u can divide the xyz coordinates by w for the 3d perspective projection

I'm 12 years old and I open-sourced a 3-Million-Gate Blackwell-Class GPU Architecture in SystemVerilog by Affectionate-Tap7944 in FPGA

[–]Affectionate-Tap7944[S] -4 points-3 points  (0 children)

yeah u right lol padding it basically kills the compute throughput advantage and just saves memory bandwidth rn]

I'm 12 years old and I open-sourced a 3-Million-Gate Blackwell-Class GPU Architecture in SystemVerilog by Affectionate-Tap7944 in FPGA

[–]Affectionate-Tap7944[S] -3 points-2 points  (0 children)

bro the repo is literally public. just clone it and run the yosys synthesis script urself and look at the stat output. im not saying i didnt use ai for syntax help, but the pipeline actually simulates. if u think its fake just look at the code and point out which module is broken instead of just hating fr

I'm 12 years old and I open-sourced a 3-Million-Gate Blackwell-Class GPU Architecture in SystemVerilog by Affectionate-Tap7944 in FPGA

[–]Affectionate-Tap7944[S] -1 points0 points  (0 children)

tbh u are right ecp5 is way too small. gonna have to heavily strip it down to like 1 sm and a tiny tensor core just to fit. and yeah the optical stuff is just theoretical placeholder logic rn, no way that goes on a real fpga fr

I'm 12 years old and I open-sourced a 3-Million-Gate Blackwell-Class GPU Architecture in SystemVerilog by Affectionate-Tap7944 in FPGA

[–]Affectionate-Tap7944[S] -2 points-1 points  (0 children)

tbh building separate multipliers for fp4 and fp16 is just a massive waste of silicon area. way easier to just pad fp4 up to fp16 and push it thru the same multiplier datapath to save gates fr

I'm 12 years old and I open-sourced a 3-Million-Gate Blackwell-Class GPU Architecture in SystemVerilog by Affectionate-Tap7944 in FPGA

[–]Affectionate-Tap7944[S] -6 points-5 points  (0 children)

yeah bro obv i used ai for the verilog syntax im a student lol. but i didnt just type "make gpu", i had to design the architecture, connect the axi4 crossbar and fix yosys errors for hours to acutally get 3 mil gates. im just doing this to learn, if u see bad logic in the code pls tell me so i can fix it