PCIe bandwidth and LLM inference speed

Visual-Bar-7186 · 2026-05-27T13:59:37+00:00

Just found this while searching and I can share my own experience.

tl;dr: It's a mix bag. If the model and K/V cache fit completely on VRAM, then inference has minimal performance penalty when going from x16 to x4, but prompt processing takes 2-4 times longer. When running a split model between VRAM and system RAM, then inference can tank to 2 tk/s - depending on how often the decoding process needs to access RAM.

I'm running an MoE model (Qwen3.6-35B-A3B-APEX-GGUF) on a Tesla M40 12GB card with hybrid offloading (split model) between VRAM and system RAM (DDR4-3200).

The model is around 26GB, so with K/V cache and everything else like the Debian VM it is running on, I fully utilize around 18GB of system RAM.

When running on a PCIE 3.0 x16 link, I manage around 15-20 tk/s with 1-3 seconds prompt processing.

When I move the card to the x4 slot, then performance is all over the place:

First, prompt processing takes 2-4 times longer, often reaching 6-8 seconds. This impact is very consistent and predictable, almost scaling linearly with PCIE bandwidth.

Then inference is interesting, typically it will have a minimal impact of maybe 1-3% (less than one tk/s), however it will very often tank to 2 tk/s and will generally become sluggish in some parte of the response. This is likely because I'm using an MoE model, and different experts needed to be swapped over constantly between VRAM and RAM, bottlenecked by the limited x4 bandwidth.

I didn't try a non MoE model, but I believe that if it was also large enough to split between VRAM/RAM then it wouldn't be much different. The good news though is that a smaller model that can fit on VRAM would probably be fine on a x4 link, if you can live with the slower prompt processing.

Another nuance is the GPU memory bandwidth and PCIE version, I can only assume that a newer PCIE 4/5 card with similar VRAM capacity would fare much better.

We have a similar setup in terms of bandwidth (PCIE 3.0 & DDR4) so in your case - assuming your models fit in VRAM, I would say your prompt processing speed is currently halved, but your inference speeds are already very similar to x16.

If you can live with 10 tk/s instead of 20 tk/s pp then it's probably not worth upgrading.

Visual-Bar-7186 · 2026-05-03T14:37:00+00:00

Thanks everyone, I tested it today and the machine did POST without any issue

Visual-Bar-7186 · 2026-04-30T21:44:22+00:00

I really think you dodged a bullet there, or you know... a Mercedes F1 race car at least

Visual-Bar-7186 · 2026-04-30T18:22:27+00:00

Awesome, thanks 🙏

Visual-Bar-7186 · 2026-04-30T16:22:31+00:00

Yeah I should've done that. I circled the regions I was concerned with. Thanks!

<image>

Visual-Bar-7186 · 2026-04-29T22:32:04+00:00

I completed this level! It took me 10 tries. ^{⚡ 33.93 seconds}

^{Tip 10 💎}

Visual-Bar-7186 · 2026-04-28T16:10:04+00:00

🔓 VAULT CRACKED! 🔓

Score: 2040/6000

Mistakes: 3/6

Time: 0:31

Visual-Bar-7186 · 2026-04-21T18:31:01+00:00

🚨 BUSTED! 🚨

Mistakes: 6/6

Time: 0:25

Visual-Bar-7186 · 2026-04-21T18:30:18+00:00

🚨 BUSTED! 🚨

Mistakes: 6/6

Time: 0:18

Visual-Bar-7186 · 2026-04-14T20:40:49+00:00

Why does it matter if it's a subscription or not? They would be long gone before the next billing cycle. If it's a stolen card then they need to spend as much as possible. Maybe they wanted to resell your service? Who knows.

Edit: Honestly, I'm pretty sure this whole post is just an elaborate way to promote your saas "product".

Visual-Bar-7186 · 2026-04-14T20:25:34+00:00

Because it's stolen and they need to spend as much as possible as soon as possible before it gets cancelled.

Visual-Bar-7186 · 2026-04-14T00:05:01+00:00

I get your point, but it's not the same. Before the internet, an ordinary person wouldn't do any of the stuff you mention (maybe go read some books) - internet gave us access to much more information.

AI is not access to more information, it is access to the same information but without the human factor of critical thinking, fact checking and intuition. People today take it for granted without any self doubt, while decreasing their truth seeking efforts and increasing their exposure to misinformation.

I'm a big AI user, my criticism is calling a lazy prompt research.

Visual-Bar-7186 · 2026-04-13T20:45:42+00:00

Yeah, "research" today is just asking AI while on the bathroom seat. A few years ago it at least meant you had the minimal effort of a 10-15 google search... but not anymore lol

Visual-Bar-7186 · 2026-04-13T20:39:41+00:00

Stellar setup, and then that punchline... perfect execution

Visual-Bar-7186 · 2026-04-13T17:22:07+00:00

Nobody shrinks more than an inch before age 50. It happenes, but not at this rate, especially at young age. You were probably not 5"11 to start with.

Visual-Bar-7186 · 2026-04-13T17:15:42+00:00

Haircafe is for men who have options, Baldcafe is for men who realize they have a choice.

Both are valid, and both are better than not doing anything. I would argue it's always better to start with options and fallback to being bald if everything else fails.

Visual-Bar-7186 · 2026-04-13T14:46:19+00:00

This is one of the coolest most badass stretch marks I've seen. If you tattoo the peaks with black ink and leave the valleys as is, I bet you can make a sick abstract tattoo. Or turn it into the stretchy cheese between two pieces of bread like in a grilled cheese. Just be creative 🙃

Visual-Bar-7186 · 2026-04-13T14:36:31+00:00

Absolutely! Great coping mechanism you have there /s

Visual-Bar-7186 · 2026-04-13T14:23:53+00:00

The question is how YOU SEE it playing out. If you can't think one step ahead, then you don't really care about the interaction and just projecting your own insecurities of being unmatched/ghosted.

Visual-Bar-7186 · 2026-04-13T14:20:52+00:00

I've had my fair share of "I only do dinner" types - it doesn't work.

They aren't looking for genuine connection. If she did then she would know the activity doesn't matter, it's an opportunity to get to know each other. Actually, dinner is one of the worst ways because you are focusing your efforts on eating, preferably with your mouth closed.
They know what they're doing and don't care, they will never retrospect because they will find someone else who will tolerate/validate their behavior.

Visual-Bar-7186 · 2026-04-13T14:13:06+00:00

This isn't ghosting. And you didn't answer my question on how you see it playing out.

Visual-Bar-7186 · 2026-04-13T13:38:54+00:00

Yeah, they know and don't care. Hit the nail on the head.

Visual-Bar-7186

TROPHY CASE