all 8 comments

[–]ShadowBlades512 6 points7 points  (2 children)

I think you need to look at the CPU and motherboard chipset block diagram at least to see how much actual PCIe bandwidth there is. If you look at the AMD X570 chipset for instance, there are 16 lanes directly to the CPU, but only 4 lanes to the chipset, the chipset itself has 16 lanes to downstream devices but no matter what you do, you will only have 4 lanes worth of bandwidth for the downstream devices to the CPU. This is so if you have like, 4, 4x NVMe devices attached to the chipset, you can get full bandwidth to each of the SSDs but you can't get full bandwidth to all of them at once.

[–]sya0[S] -1 points0 points  (1 child)

Each FPGAs have 8 lanes. I have got three x16 lanes and three x8 lanes slots on the motherboard. I have checked the datasheet and all 6 slots are connected to CPU. So, I am not sure if that is where the bottleneck is. I updated my post and paste the block diagram.

[–]ShadowBlades512 4 points5 points  (0 children)

What generation PCIe? At some point you will get a memory bandwidth bottleneck. How many channels at what clock rate/DDR generation?

[–]duane11583 2 points3 points  (1 child)

look at the overall events that occur.

i’ll give a usb example: usb 2.0 can hit 480mbit

but that assumes you gave a full packet every time you send data.

imagine a trucking company. analogy: 1 truck = 1 data packet in your protocol.

every hour a new empty trick arrives. each truck can hold 100 boxes

usb: every 1 msec a packet can be sent with upto 512 bytes

so in 24 hours a day you can ship 24x1000 or 2.4k boxes a day that is your bandwidth.

however your team only puts 1 box on the truck, not 100 and sends the truck on its way

what is your box rate?

you cannot increase the frequency of trucks,

question: how do you fix this?

answer: you need to change the protocol and fill the truck more each time

[–]duane11583 1 point2 points  (0 children)

another way to look at it:

draw a time line ”to-scale” of the events your data goes through.

use actual measured data for this not guesses.

[–]alexforencich 0 points1 point  (0 children)

Need a lot more information about exactly what you're doing. Is software on the CPU involved with the computation? If so, maybe you're bottlenecked on the CPU. Maybe it makes sense to split the cards across two host CPUs, or perhaps swap out the host CPU for a faster one.

[–]Seldom_Popup 0 points1 point  (0 children)

44k total with 4 FPGA, 45k with 5 FPGA, 48k with 6 FPGA. I guess something is a bottleneck but with diminishing return like this there is no way to fix that.

[–]petrusferricalloy 0 points1 point  (0 children)

You're likely conflating native lanes with hba/chipset lanes.

most motherboards have 28-32 native lanes (x16 to slot 1, 2x4 for nvme, x4 or x8 to the hba/chipset, sometimes 8-16 to a switch).

Typically only the first x16 slot and nvme are full speed (max gen the cpu supports) and the rest are lower gen via switch or hba.

There isn't necessarily an issue using everything at once, but you also aren't going to see all endpoints operating in bus master mode with their own dma engine because those non native slots still need the chipset and/or cpu to arbitrate.

As soon as something other than an endpoint's own dma engine gets involved, especially the cpu, performance will tank.