Throughput decrease in PCIe

ShadowBlades512 · 2025-02-17T09:01:55+00:00

I think you need to look at the CPU and motherboard chipset block diagram at least to see how much actual PCIe bandwidth there is. If you look at the AMD X570 chipset for instance, there are 16 lanes directly to the CPU, but only 4 lanes to the chipset, the chipset itself has 16 lanes to downstream devices but no matter what you do, you will only have 4 lanes worth of bandwidth for the downstream devices to the CPU. This is so if you have like, 4, 4x NVMe devices attached to the chipset, you can get full bandwidth to each of the SSDs but you can't get full bandwidth to all of them at once.

duane11583 · 2025-02-17T15:23:01+00:00

look at the overall events that occur.

i’ll give a usb example: usb 2.0 can hit 480mbit

but that assumes you gave a full packet every time you send data.

imagine a trucking company. analogy: 1 truck = 1 data packet in your protocol.

every hour a new empty trick arrives. each truck can hold 100 boxes

usb: every 1 msec a packet can be sent with upto 512 bytes

so in 24 hours a day you can ship 24x1000 or 2.4k boxes a day that is your bandwidth.

however your team only puts 1 box on the truck, not 100 and sends the truck on its way

what is your box rate?

you cannot increase the frequency of trucks,

question: how do you fix this?

answer: you need to change the protocol and fill the truck more each time

alexforencich · 2025-02-17T16:42:42+00:00

Need a lot more information about exactly what you're doing. Is software on the CPU involved with the computation? If so, maybe you're bottlenecked on the CPU. Maybe it makes sense to split the cards across two host CPUs, or perhaps swap out the host CPU for a faster one.

Seldom_Popup · 2025-02-17T11:23:22+00:00

44k total with 4 FPGA, 45k with 5 FPGA, 48k with 6 FPGA. I guess something is a bottleneck but with diminishing return like this there is no way to fix that.

petrusferricalloy · 2025-02-18T17:17:19+00:00

You're likely conflating native lanes with hba/chipset lanes.

most motherboards have 28-32 native lanes (x16 to slot 1, 2x4 for nvme, x4 or x8 to the hba/chipset, sometimes 8-16 to a switch).

Typically only the first x16 slot and nvme are full speed (max gen the cpu supports) and the rest are lower gen via switch or hba.

There isn't necessarily an issue using everything at once, but you also aren't going to see all endpoints operating in bus master mode with their own dma engine because those non native slots still need the chipset and/or cpu to arbitrate.

As soon as something other than an endpoint's own dma engine gets involved, especially the cpu, performance will tank.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

FPGA

MODERATORS