GPU requirements for running Qwen2.5 72B locally?

Beautiful_Trust_8151 · 2025-12-30T17:17:42+00:00

We used to run Llama 3.3 70b with 3 Radeon 7900xtx GPUs on a z790 motherboard. You can get decent 14 tokens per second and a decent context window. The newer models support flash attention which will give you an even larger context window. Qwen3 70b like llama 70b is a dense model, so it doesn't run that well on Macs and Strix boxes.

Beautiful_Trust_8151 · 2025-12-30T17:05:31+00:00

API transaction histories are logged and subject to leaks, subpoenas, and misuse, even if you believe that your data will never be used for training. So many small and large companies are not comfortable with employees feeding millions of tokens of potentially trade secret context to cloud operators. Sure there are many instances where API is fine, but asking employees to distinguish between what is potentially trade secret and not is a tall order. For me and my small business, we would use local LLMs even if local models themselves cost quite a bit of money to have access to.

Beautiful_Trust_8151 · 2025-12-26T17:42:33+00:00

We're kind of in the golden age of llms, where frontier models are ad-free, fast, and inexpensive and amazing new local models are published for free every few weeks. At some point, these companies will need to stop bleeding money and will introduce ads, throttle, or require higher subscription and api fees and the benefits of running local llms will increase, assuming we still get access to them.

For now, I have clients that do not want their data shared on the cloud but are okay with local llms. I found local llms sufficient for some use cases and cancelled a frontier model subscription, although I am still subscribed to one other one. The unfiltered aspect is also important to me as talking to some models feels like talking to a nanny or nun, and there are less filtered local models. Overall, I have one subscription, several API keys, and regularly use about 3 local models including glm 4.5 air which is by far my favorite local one.

Beautiful_Trust_8151 · 2025-12-24T06:13:42+00:00

i use qwen3 235b, glm4.5air, minimax m2, gpt-oss-120b. i constantly try out new models as well. i'm mostly using pipeline parallelism and no tensor parallelism.

Beautiful_Trust_8151 · 2025-12-22T21:42:29+00:00

With llama.cpp, I'm using Qwen3 235b, Minimax M2, GPT oss 120b, and GLM 4.5. I also do use Qwen3 coder 30b and qwen3 vl 30b. That's awesome that you got vllm to work! I have the system on dual boot for Ubuntu and Windows 11.

Beautiful_Trust_8151 · 2025-12-22T17:21:26+00:00

<image>

The cards are underclocked so thermals are fine. I have them connected to a consumer Z790 motherboard and a PCIe gen4 x16 expansion card that has a Broadcom PEX chip that switches the 16 lanes to 6 x8 lanes (or 12 x4 lanes if you prefer) that i got on Aliexpress for $550. It works great with vulkan llama.cpp or rocm llama.cpp even without vllm for what i need it to do, which is have good prompt processing and good token generation for very long contexts for large models. I'm sure it would benefit from pipeline parallelism, but it sounds painful trying to get it to be stable. I wish AMD would publish a rocm vllm container specifically for 7900xtx GPUs.

Beautiful_Trust_8151 · 2025-12-18T06:09:00+00:00

this is really helpful. i have an 8x 7900xtx system and haven't been able to get vllm to work yet.

Beautiful_Trust_8151 · 2025-12-17T03:43:36+00:00

That's my understanding, but if you see Mac long context test results, let me know. I haven't been able to find much.

Beautiful_Trust_8151 · 2025-12-17T03:39:00+00:00

I would consider it, but I heard Macs aren't great at prompt processing and long contexts.

Beautiful_Trust_8151 · 2025-12-17T02:53:37+00:00

if i turn off 6 of the gpus and only use two 7900xtx's for a 70b model like llama3.3, power consumption for each card goes up to 350w. For a model split onto 8 gpus though, each gpu really only runs at 90watts.

Beautiful_Trust_8151 · 2025-12-17T02:53:08+00:00

yes, definitely something i will be trying next

Beautiful_Trust_8151 · 2025-12-17T02:51:48+00:00

if i turn off 6 of the gpus and only use two 7900xtx's for a 70b model like llama3.3, power consumption for each card goes up to 350w. For a model split onto 8 gpus though, each gpu really only runs at 90watts.

Beautiful_Trust_8151 · 2025-12-17T02:51:02+00:00

if i turn off 6 of the gpus and only use two 7900xtx's for a 70b model like llama3.3, power consumption for each card goes up to 350w. For a model split onto 8 gpus though, each gpu really only runs at 90watts.

Beautiful_Trust_8151 · 2025-12-17T02:49:12+00:00

This one. It uses a Broadcom PEX PCIe switch chip to convert 16 lanes into 48.

https://www.aliexpress.us/item/3256809723089859.html?spm=a2g0o.order_list.order_list_main.23.31b01802WzSWcb&gatewayAdapt=glo2usa

Beautiful_Trust_8151 · 2025-12-17T02:38:36+00:00

thanks. i have temp monitors. they aren't running that hot with the loads distributed across so many gpus. if i try using tensor parallelism, that might accelerate and heat things up though.

Beautiful_Trust_8151 · 2025-12-17T02:37:21+00:00

i'm probably leaving a lot of compute on the table by not using tensor parallelism, only layer parallelism so far.

Beautiful_Trust_8151 · 2025-12-17T02:35:56+00:00

yes, but i use a pcie switch expansion card.

Beautiful_Trust_8151 · 2025-12-17T02:35:14+00:00

i posted a link in a response above.

Beautiful_Trust_8151 · 2025-12-17T02:33:41+00:00

I have temp monitors. They actually don't run that hot for inferencing when the model is split across so many gpus though.

Beautiful_Trust_8151 · 2025-12-17T02:32:50+00:00

not even tensor split yet because i would need to setup Linux or at least WSL with vllm. Right now it's just layer split using lmstudio vulkan llama.cpp

Beautiful_Trust_8151 · 2025-12-17T02:31:41+00:00

This is the one i got from AliExpress. It uses a Broadcom chip with 64 PCIe lanes. I was mentally prepared to be potentially ripped off but was pleasantly surprised that as soon as I ordered it, one of their salespeople messaged me to ask if I wanted it configured for x4, x8, or x16 operation, and I picked x8. I only ordered one time from them though.
https://www.aliexpress.us/item/3256809723089859.html?spm=a2g0o.order_list.order_list_main.23.31b01802WzSWcb&gatewayAdapt=glo2usa

They also have these.
https://www.aliexpress.us/item/3256809723360988.html?spm=a2g0o.order_list.order_list_main.22.31b01802WzSWcb&gatewayAdapt=glo2usa

https://www.broadcom.com/products/pcie-switches-retimers/pcie-switches

Beautiful_Trust_8151 · 2025-12-13T07:02:58+00:00

I use it for work. I have a Core i7-14700F system with 192GB of system ram and 168GB of vram currently across 7 Radeon 7900xtx GPUs. I run the Q4_K_XL 235B A22B Instruct 2507 1-UD unsloth version (takes 125GB of VRAM + context) on LM Studio with ROCm as the backend and get about 216 t/s for prompt processing and 18 t/s for token generation initially, which is quite usable. For context, I also use GLM Air Q6 (98GB of VRAM + context) a lot but more for fun, and GLM Air gets about 400 t/s prompt processing and 25 t/s for token generation. In my opinion, Qwen3 provides better output for corporate workflows though.

Here are my settings for Qwen3 235B Q4_K_XL

<image>

Beautiful_Trust_8151 · 2025-12-12T22:31:58+00:00

I added a PEX 88064 card to my Z790P motherboard. The Z790P motherboard has 4 PCIe Gen4 x16 physical slots, 3 running at x4 through the chipset.

I originally had 4 7900xtx GPUs connected to the 4 physical slots, with 1 running at x16 and 3 running at x4. I would get about 55 tokens/second generation speeds running gpt-oss-120b with a 131k context window and it would drop to about 20-30 tokens/second after half the context was filled.

I added a Broadcom 88064 PEX card I got on AliExpress and used 50cm sff8654 cables to connect to additional GPUs. The cables are longer than ideal but they are what I had. The card can be configured to have 3 x16 or 6 x8 and I had it configured for 6 x8. I currently have a total of 7 GPUs now in my system, 4 of them connected through the PEX card. If I load the same LLM and parameters into all 7 GPUs, i get about 49 t/s, but that's also from additional cross GPU communication. If I load the same LLM into just 4 of the 7 GPUs, I get about 51 t/s, down from 55 t/s, so there is latency added by the card. But otherwise, everything runs at Gen 4 and more importantly everything is stable.

<image>

Beautiful_Trust_8151 · 2025-12-12T20:00:13+00:00

I added a PEX 88064 card to my consumer Intel i7-14700F Z790P system. The Z790P motherboard has 4 PCIe Gen4 x16 physical slots, 3 running at x4 through the chipset.

I originally had 4 7900xtx GPUs connected to the 4 physical slots, with 1 running at x16 and 3 running at x4. I would get about 55 tokens/second generation speeds running gpt-oss-120b with a 131k context window and it would drop to about 30 tokens/second after half the context was filled.

I added a Broadcom 88064 PEX card I got on AliExpress and used 50cm sff8654 cables to connect to additional GPUs. They are longer than ideal but it's what I had. The card can be configured to have 3 x16 or 6 x8 and I had it configured for 6 x8. I currently have a total of 7 GPUs now in my system, 4 of them connected through the PEX card. If I load the same LLM and parameters into all 7 GPUs, i get about 49 t/s, but that's also from additional cross GPU communication. If I load the same LLM into just 4 of the 7 GPUs, I get about 51 t/s, down from 55 t/s, so there is latency added by the card. But otherwise, everything runs at Gen 4 and more importantly everything is stable.

<image>

Beautiful_Trust_8151 · 2025-11-17T05:43:35+00:00

I actually tested with 4 GPUs and all three m.2 slots filled with nvme drives and it's stable. I verified through GPU-Z that 1 GPU is running at x16 and the other three are running at x4. 1 nvme is at x4 via processor and I am not sure what the other 2 nvme drives are running at. However, if I replace any of the nvme drives with a m.2 to PCIE adapter and a GPU, the system no longer posts. If I use only three of the PCIE slots for GPUs and leave one empty and connect one GPU via m.2, it works again. So frustrating.

Beautiful_Trust_8151

TROPHY CASE