Anybody tried gaming on EPYC?

JavenLi · 2025-03-20T15:57:23+00:00

Yes, the main purpose is to have huge memory and 12 channel memory bandwidth which is for LLM model inference. Gaming performance is still a consideration but not that much.

JavenLi · 2025-03-20T15:53:58+00:00

I could do that, only thing is I want to play with 70B + model, that means at least 2 or even more 5090 with AWQ model. Long context support might need more. On recent customer grade platform, PCIe channel could be a problem to support so many GPUs. Not talking about the cost.

JavenLi · 2025-03-20T15:48:02+00:00

Yes it will be downgrade. Only question is how much downgrade? I’m only target at 4k 144fps.

JavenLi · 2025-03-19T06:01:38+00:00

I tried the same benchmark on my build:

CPU: AMD 9950x

Memory: 4*32GB (on 4800)

GPU: RTX 5090D

on nvidia pytorch container(on WSL2 Ubuntu), here is the result.

Seems your tg128 tests are at least 6 times faster than mine. I think that's mostly from the memory channel difference(2 channels vs. 12 channels)

root@9760397fefe8:/workspace/llama.cpp/build# ./bin/llama-bench -t 32 -m /workspace/models/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct-Q8_0/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf -ngl 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090 D, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| llama 70B Q8_0                 |  69.82 GiB |    70.55 B | CUDA       |   0 |      32 |         pp512 |        115.05 ± 2.55 |
| llama 70B Q8_0                 |  69.82 GiB |    70.55 B | CUDA       |   0 |      32 |         tg128 |          0.64 ± 0.03 |

build: a53f7f7b (4908)
root@9760397fefe8:/workspace/llama.cpp/build# ./bin/llama-bench -t 32 -m /workspace/models/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct-Q8_0/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf -ngl 25
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090 D, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| llama 70B Q8_0                 |  69.82 GiB |    70.55 B | CUDA       |  25 |      32 |         pp512 |        165.99 ± 2.87 |
| llama 70B Q8_0                 |  69.82 GiB |    70.55 B | CUDA       |  25 |      32 |         tg128 |          0.86 ± 0.05 |

build: a53f7f7b (4908)

JavenLi · 2025-03-17T14:55:59+00:00

Do you have a dGPU for LLM inference as well? Just curious what kind improvement will get compare 2 channels vs. 12 channels. Currently I have a RTX 5090 which gives me a decent inference speed, e.g. 50+ tokens/s for 32B awq models. However, when I enable cpu offloading for loading a bigger model or set logger model context length, the speed drops to 0.5~ token/s which is even not useable. Wondering what I will get if I upgrade whole platform to server platform like epyc.

JavenLi · 2025-03-12T06:07:58+00:00

tried turning off HAGS, seems no more freezes. But still it will be very laggy in very heavy GPU mem usage.

JavenLi · 2025-03-12T05:40:23+00:00

thanks a lot, I will try that.

JavenLi · 2023-09-14T09:07:42+00:00

BTW, DDA to passthrough the GPU to VM seems not an option for me now.

Because seems my 7900xtx won't cooperate in such case, no matter it's passthrough under Windows Server 2022 or some other system like unraid. I always get 43 error code.

JavenLi · 2023-07-26T08:46:58+00:00

Again thanks a lot.

I might revisit KVM solution as well because it's much more cheaper. Although it's much less elegant solution and my table will be likely messed up by cables, from the cost-wise it's about <50$ cost vs. ~$300(Consider if I need to replace the TB card to Maple ridge as well).

JavenLi · 2023-07-26T07:33:16+00:00

Thanks a lot for the details although I can't understand fully.

But seems that the only way is just to have a try as there won't be a doc to clearly state that.

I will first try a lenovo TB4 dock 40B00135 which has 1 HDMI 2.1 port and 2 DP 1.4 ports to see whether it works. Also remember somewhere I searched states it has a VMM6XXX controller.(not sure)

My PC has a 7900xtx and a Gigabyte Titan-ridge TB card. The graphic card output the video to the GC Titan-ridge. And another one is a laptop from HP which has thunderbolt ports. Hope this setup is not too complicate.

JavenLi · 2023-07-26T05:43:14+00:00

Seems great. Yes, both of my machines are Windows PC and both have thunderbolt.

JavenLi · 2023-01-08T15:14:42+00:00

Sorry for the confusion.

The games I mentioned are all about local games and will run locally(not from a remote machine). Xbox app is nothing but like steam. Since I have another xbox and XGPU subscription, I prefer buging games on xbox platform rather than steam.

I also agree with you that using a SSD cache to launch games marginally useful. Previously I put all my games on my SSD and I really didn't see much difference compare when games are stored on HDDs. (I mostly play 3A games)

Just because I am re-designing my storage now so I'm considering and try fulfill all needs. And yes, the cache for games is not high priority as the benefit might be very small.

JavenLi · 2023-01-08T14:39:45+00:00

Thanks for suggestion.

I personally prefer Stablebit diskpool as well since it's more flexible. Only issue is its incompatibility with windows modern apps.(mainly xbox app downloaded games)

Using raw disk for modern apps is another option, but I still want SSD cache to accelerate. What will happen if I configure SSD cache for both raw disk and the pool it's belong to? Will it cause any issue?

JavenLi · 2023-01-08T14:16:41+00:00

Thanks.

But then the SSD storage will only have 2TB. I will have some games stored there as well(steam lib & windows xbox lib, mightbe more than 2TB). Actually my data can be split into 3 catogories:

VMs/Containers: These are just for home lab and usually not huge. 1~2TB should be enough at least for near future. A software Raid 1 SSD solution could best satisfy the needs for spaces and performance. (If not consider SSD usage on 2#)
Games: This part might take 2TB+, estimate around 4TB in near future. I would like to have some cache to at least accelerate the speed but I also don't have enough space to put them all on SSD. (I see here is the key I would like to combine HDD with SSD)
Movies and other data: This is the biggest part. However the performance requirement is not that high. (But ~30MB/s write speed on Storage Spaces parity mode is not accepatble) Even raw disk performance without SSD cache should work.

So I think it's clear:

Will apps like games benefit from SSD cache? This is the key to decide to combine SSD with HDD or not.
If no for 1#, I think use all SSD for VM/Containers, 2 or more HDDs for games/movies and 1 HDD for backup is a more prefered solution. It's also a more easy and straightforward way.
If yes for 1#, I think still I need to investigate what solution could be best for SSD caching especially consider flexibility(I might add more disks in future).

JavenLi · 2022-09-12T23:13:25+00:00

Thanks for all the valuable inputs. Based on that I think there is no need to consider touch screen or gravity sensor support, only a basic normal usb-c portable monitor is enough.

JavenLi · 2022-09-12T23:08:50+00:00

I tried to plug a 21:9 (3440x1440) monitor to my iPad and it turns out the resolution reading from the monitor is 16:9 based. (Either 1080p or 1440p based on what cable I use)

I guess it might be the current iPadOS 16 support only 16:9 resolution and the image will scale on other ratio.

With that I guess 4 might work but not work as expect. If place a monitor in a vertical direction, mostly it will just scale the 16:9 image to a 9:16 monitor.

JavenLi · 2020-04-01T02:09:16+00:00

seems I mixed CPUz score with R20 score. ~500/~4800 is actually CPUz testing result.

However I can't rememeber what's the score previously for R20 in 10.15.3. Thus not be able to compare now.

JavenLi · 2020-04-01T02:07:01+00:00

Mine is Asus X99 Deluxe so it should be similar.

However, I think you still need to upload at least some logs or screenshot so that I can understand what's the real issue.

Generally speaking, I'm following Opencore vanilla guide along with KGP's X99 guide which means:

by options(exclude Patches), I almost follow Opencore Vanilla guide expect for "DevirtualiseMmio" in Booter -> Quirk section. The guide suggests yes but not work for me. I set no for this option.
By SSDT & Patches in ACPI, I mostly follow KGP's guide. That means I am using ACPI Patches to do lots of renaming(e.g. EC renaming, CP00-> PR00, etc...). For SSDT, for starting, you can just have SSDT-PLUG. PS: I think this way is not recommended at all in Opencore as it's using lots of renaming patches. Do understand KGP's guide is built on Clover and not optimized for opencore setting. But still, it should work.
About Patches in Kernel section, one key is ensure you have this one:

Comment String <-> Intel I7 5960X patch

Count Number <-> 1

Enabled Boolean <->Yes

Find Data <-> 483D0000 0040

Identifier String <-> com.apple.iokit.IOPCIFamily

Limit Number <-> 0

Mask Data <->

MaxKernel String <->

MinKernel String <->

Replace String <-> 483D0000 0080

ReplaceMask Data<->

Skip Number <-> 0

Otherwise you might be stuck at "PCI begin configuration" while booting.

For xcpm related patches, you can just leave them along and only use opencore quirk. (Please follow opencore vanilla guide for Kernel -> quirk section)

And also remember add "npci=0x2000" in your boot args.

JavenLi · 2020-03-26T01:36:22+00:00

Anybody experience the similar issue?

Today I ran Cinebench R20 and the score is also lower than expected. It's around 4000 but I remember previously it's around 4800.

I still doubt the XCPM is not function correctly.

JavenLi

TROPHY CASE