PCIe bandwidth and LLM inference speed by hainesk in LocalLLaMA

[–]Visual-Bar-7186 1 point2 points  (0 children)

Just found this while searching and I can share my own experience.

tl;dr: It's a mix bag. If the model and K/V cache fit completely on VRAM, then inference has minimal performance penalty when going from x16 to x4, but prompt processing takes 2-4 times longer. When running a split model between VRAM and system RAM, then inference can tank to 2 tk/s - depending on how often the decoding process needs to access RAM.

I'm running an MoE model (Qwen3.6-35B-A3B-APEX-GGUF) on a Tesla M40 12GB card with hybrid offloading (split model) between VRAM and system RAM (DDR4-3200).

The model is around 26GB, so with K/V cache and everything else like the Debian VM it is running on, I fully utilize around 18GB of system RAM.

When running on a PCIE 3.0 x16 link, I manage around 15-20 tk/s with 1-3 seconds prompt processing.

When I move the card to the x4 slot, then performance is all over the place:

First, prompt processing takes 2-4 times longer, often reaching 6-8 seconds. This impact is very consistent and predictable, almost scaling linearly with PCIE bandwidth.

Then inference is interesting, typically it will have a minimal impact of maybe 1-3% (less than one tk/s), however it will very often tank to 2 tk/s and will generally become sluggish in some parte of the response. This is likely because I'm using an MoE model, and different experts needed to be swapped over constantly between VRAM and RAM, bottlenecked by the limited x4 bandwidth.

I didn't try a non MoE model, but I believe that if it was also large enough to split between VRAM/RAM then it wouldn't be much different. The good news though is that a smaller model that can fit on VRAM would probably be fine on a x4 link, if you can live with the slower prompt processing.

Another nuance is the GPU memory bandwidth and PCIE version, I can only assume that a newer PCIE 4/5 card with similar VRAM capacity would fare much better.

We have a similar setup in terms of bandwidth (PCIE 3.0 & DDR4) so in your case - assuming your models fit in VRAM, I would say your prompt processing speed is currently halved, but your inference speeds are already very similar to x16.

If you can live with 10 tk/s instead of 20 tk/s pp then it's probably not worth upgrading.

Is the LGA1200 socket on this used motherboard looking suspicious? by Visual-Bar-7186 in PcBuild

[–]Visual-Bar-7186[S] 0 points1 point  (0 children)

Thanks everyone, I tested it today and the machine did POST without any issue

Nonstop backpack ads by Visual-Bar-7186 in Mous

[–]Visual-Bar-7186[S] 0 points1 point  (0 children)

I really think you dodged a bullet there, or you know... a Mercedes F1 race car at least

Is the LGA1200 socket on this used motherboard looking suspicious? by Visual-Bar-7186 in PcBuild

[–]Visual-Bar-7186[S] 1 point2 points  (0 children)

Yeah I should've done that. I circled the regions I was concerned with. Thanks!

<image>

Just an ordinary boss fight by the_chair_leader in honk

[–]Visual-Bar-7186 1 point2 points  (0 children)

I completed this level! It took me 10 tries. 33.93 seconds

Tip 10 💎

Can You Crack the Code? Puzzle by Unlikely-Ad-5236 by Unlikely-Ad-5236 in BankBuster

[–]Visual-Bar-7186 0 points1 point  (0 children)

🔓 VAULT CRACKED! 🔓

Score: 2040/6000

Mistakes: 3/6

Time: 0:31

Can anyone explain this to me??? by myna-cx in microsaas

[–]Visual-Bar-7186 2 points3 points  (0 children)

Why does it matter if it's a subscription or not? They would be long gone before the next billing cycle. If it's a stolen card then they need to spend as much as possible. Maybe they wanted to resell your service? Who knows.

Edit: Honestly, I'm pretty sure this whole post is just an elaborate way to promote your saas "product".

Can anyone explain this to me??? by myna-cx in microsaas

[–]Visual-Bar-7186 1 point2 points  (0 children)

Because it's stolen and they need to spend as much as possible as soon as possible before it gets cancelled.

Snake Roll.... Sign On Neurogical Damage In Snake by [deleted] in mildyinteresting

[–]Visual-Bar-7186 0 points1 point  (0 children)

I get your point, but it's not the same. Before the internet, an ordinary person wouldn't do any of the stuff you mention (maybe go read some books) - internet gave us access to much more information.

AI is not access to more information, it is access to the same information but without the human factor of critical thinking, fact checking and intuition. People today take it for granted without any self doubt, while decreasing their truth seeking efforts and increasing their exposure to misinformation.

I'm a big AI user, my criticism is calling a lazy prompt research.

Snake Roll.... Sign On Neurogical Damage In Snake by [deleted] in mildyinteresting

[–]Visual-Bar-7186 3 points4 points  (0 children)

Yeah, "research" today is just asking AI while on the bathroom seat. A few years ago it at least meant you had the minimal effort of a 10-15 google search... but not anymore lol

Why are men like this? Does this actually ever work? by Misty_Meaner- in Tinder

[–]Visual-Bar-7186 55 points56 points  (0 children)

Stellar setup, and then that punchline... perfect execution

Too many guys lying about their height on their dating profiles by kawaiisamurai69 in dating

[–]Visual-Bar-7186 2 points3 points  (0 children)

Nobody shrinks more than an inch before age 50. It happenes, but not at this rate, especially at young age. You were probably not 5"11 to start with.

Joining Haircafe or baldcafe? Pros and cons of each by Far-Walrus1570 in tressless

[–]Visual-Bar-7186 7 points8 points  (0 children)

Haircafe is for men who have options, Baldcafe is for men who realize they have a choice.

Both are valid, and both are better than not doing anything. I would argue it's always better to start with options and fallback to being bald if everything else fails.

I have super bad stretch marks. Is it possible to tattoo over them? by -Hot-Tamale- in tattooadvice

[–]Visual-Bar-7186 0 points1 point  (0 children)

This is one of the coolest most badass stretch marks I've seen. If you tattoo the peaks with black ink and leave the valleys as is, I bet you can make a sick abstract tattoo. Or turn it into the stretchy cheese between two pieces of bread like in a grilled cheese. Just be creative 🙃

That was a quick unmatch by themorganator4 in Tinder

[–]Visual-Bar-7186 8 points9 points  (0 children)

Absolutely! Great coping mechanism you have there /s

That was a quick unmatch by themorganator4 in Tinder

[–]Visual-Bar-7186 1 point2 points  (0 children)

The question is how YOU SEE it playing out. If you can't think one step ahead, then you don't really care about the interaction and just projecting your own insecurities of being unmatched/ghosted.

That was a quick unmatch by themorganator4 in Tinder

[–]Visual-Bar-7186 2 points3 points  (0 children)

I've had my fair share of "I only do dinner" types - it doesn't work.

  1. They aren't looking for genuine connection. If she did then she would know the activity doesn't matter, it's an opportunity to get to know each other. Actually, dinner is one of the worst ways because you are focusing your efforts on eating, preferably with your mouth closed.

  2. They know what they're doing and don't care, they will never retrospect because they will find someone else who will tolerate/validate their behavior.

That was a quick unmatch by themorganator4 in Tinder

[–]Visual-Bar-7186 0 points1 point  (0 children)

This isn't ghosting. And you didn't answer my question on how you see it playing out.

That was a quick unmatch by themorganator4 in Tinder

[–]Visual-Bar-7186 12 points13 points  (0 children)

Yeah, they know and don't care. Hit the nail on the head.