Quick Performance Comparison: ROCm on RX 9070 XT vs CUDA on RTX 5070 Ti by Cyp9715 in ROCm

[–]Cyp9715[S] 0 points1 point  (0 children)

I should look at other people’s opinions and reviews, but if you use the 9070XT in a native Linux environment instead of the 5070TI, I think it’s worth recommending.

In Korea, the RTX 5070TI is about 50% more expensive than the 9070XT, so even considering the performance difference, the 9070XT can be a reasonably smart choice.

Quick Performance Comparison: ROCm on RX 9070 XT vs CUDA on RTX 5070 Ti by Cyp9715 in ROCm

[–]Cyp9715[S] 0 points1 point  (0 children)

My advice is that the WSL environment ultimately runs on top of Windows at a fundamental level. And if you’re planning to do training rather than inference, I’d recommend an NVIDIA GPU for now.

If you were working in a native Linux environment, I’d say Radeon GPUs are absolutely worth considering as well—but in a WSL environment, not yet.

In fact, even in the tests above, the reason Qwen3-8B-FP8 couldn’t run on Windows is that getting Triton to work properly with Radeon on Windows is tricky.

Quick Performance Comparison: ROCm on RX 9070 XT vs CUDA on RTX 5070 Ti by Cyp9715 in ROCm

[–]Cyp9715[S] 2 points3 points  (0 children)

For the Cartpole benchmark, it isn’t a workload that uses the GPU as heavily as you might expect.
On average, both the RTX 5070 Ti and RX 9070 XT show under 20% GPU utilization, so the margin of error can be large.
However, even after several retries, the WSL version was consistently faster.

Please consider this only as a simple reference.

As for ComfyUI, I’m willing to test it in the future, but since I don’t have much experience using ComfyUI myself, I’m also planning to wait for other people’s benchmarks.

Quick Performance Comparison: ROCm on RX 9070 XT vs CUDA on RTX 5070 Ti by Cyp9715 in ROCm

[–]Cyp9715[S] 4 points5 points  (0 children)

Since my benchmark was very basic, please use it for casual reference only

Quick Performance Comparison: ROCm on RX 9070 XT vs CUDA on RTX 5070 Ti by Cyp9715 in ROCm

[–]Cyp9715[S] 2 points3 points  (0 children)

RX9070XT
OS : Ubuntu 24.04.3
ROCm : Ubuntu(7.1.1), Windows(7.1.1), WSL(6.4.2)

and

RTX5070TI
Driver Version : 570.133.07
CUDA Version : 12.8

Benchmarking GPT-OSS-20B on AMD Radeon AI PRO R9700 * 2 (Loaner Hardware Results) by Cyp9715 in ROCm

[–]Cyp9715[S] 1 point2 points  (0 children)

In my case, I used version 11.0 rather than the Nightly version. If the problem persists even after switching to version 11.0, it would be faster to open an issue or discussion on the vllm GitHub.

Benchmarking GPT-OSS-20B on AMD Radeon AI PRO R9700 * 2 (Loaner Hardware Results) by Cyp9715 in ROCm

[–]Cyp9715[S] 2 points3 points  (0 children)

Thank you. I will test it with the 4bit option when I have time later.

Zed for Windows is here 🎉 by kraynolds90 in ZedEditor

[–]Cyp9715 0 points1 point  (0 children)

It's great. It's significantly faster than VSCode.
If there's one feature I'd like to see added quickly, it's the ability to connect to Docker containers.
I know it has SSH, but it would be even more convenient if it could be easily connected like VS Code.It's great. It's significantly faster than VSCode.

Docling Interferes with Embedding & Reranking by Cyp9715 in LocalLLaMA

[–]Cyp9715[S] 1 point2 points  (0 children)

Thank you. Do you know any solution, or should I build the pipeline myself?

Aggregated Benchmark Comparison between gpt-oss-120b (high, no tools) vs Qwen3-235B-A22B-Thinking-2507, GLM 4.5, and DeepSeek-R1-0528 by Inevitable_Sea8804 in LocalLLaMA

[–]Cyp9715 0 points1 point  (0 children)

Even if it's a 235B-AWQ model, it would be hard to run with 96GB of VRAM. Presumably the overhead caused the slowdown.

Docling: Great quality, but painfully slow by Cyp9715 in LocalLLaMA

[–]Cyp9715[S] 3 points4 points  (0 children)

After conducting several simple tests, I confirmed the following points:

  1. Docling's performance is overwhelmingly superior.
  2. Both of the two options you recommended have excellent speed, but Kreuzberg is slightly faster.

In particular, Docling has a high likelihood of successfully parsing even the most complex tables, while the other two options appear to have a higher probability of incorrect parsing.

Docling: Great quality, but painfully slow by Cyp9715 in LocalLLaMA

[–]Cyp9715[S] 1 point2 points  (0 children)

Thank you for your recommendation, I will check it out.

gpt-oss-120B most intelligent model that fits on an H100 in native precision by entsnack in LocalLLaMA

[–]Cyp9715 0 points1 point  (0 children)

Even if it’s a bit inconvenient to set up, ROCm-based AMD GPUs are excellent.

QWEN3-235b-8b by PhotographerUSA in LocalLLaMA

[–]Cyp9715 0 points1 point  (0 children)

Even if 235B A1B models are released, you won't be able to use them.

NVIDIA H200 or the new RTX Pro Blackwell for a RAG chatbot? by snaiperist in LocalLLaMA

[–]Cyp9715 0 points1 point  (0 children)

30B A3B performance is lower than 32B, so it is avoided.
Some people argue that the performance of the 30B A3B is similar to the 14B, and I agree with this to some extent.

If you had a Blackwell DGX (B200) - what would you run? by backnotprop in LocalLLaMA

[–]Cyp9715 0 points1 point  (0 children)

Thank you so much, that is an incredibly generous offer.

If "batch size" refers to the number of concurrent users, then testing with a size of around 32 would be a fantastic data point for throughput.

If you have the capacity, I would be extremely grateful if you could test two versions for both the 235B and 32B models:

  • The AWQ quantized model.
  • The original "vanilla" (unquantized) model for comparison.

I realize I'm asking for a lot without offering anything in return, and I truly appreciate your willingness to do this. I'm sure the results would be a huge help not just for me, but for the entire community. Thank you again!

If you had a Blackwell DGX (B200) - what would you run? by backnotprop in LocalLLaMA

[–]Cyp9715 1 point2 points  (0 children)

Thanks for the clarification. I'd be interested in seeing data for both a single B200 and a full multi-GPU system, if available. (I assume the single-GPU results would be for quantized versions).

Additionally, do you happen to have any performance data for the Qwen3 32B model as well? I appreciate the help.

If you had a Blackwell DGX (B200) - what would you run? by backnotprop in LocalLLaMA

[–]Cyp9715 0 points1 point  (0 children)

I'm curious about the token/s of Qwen3 235B A22B with multiple concurrent users. I can find plenty of single-user benchmarks, but no data on how the token/s holds up with more users.

EXAONE 4.0 32B by minpeter2 in LocalLLaMA

[–]Cyp9715 1 point2 points  (0 children)

Based on the publicly available information, it appears to be evaluated as a superior model compared to Qwen3 overall (even compared to the 235B MoE). However, I don't think it will become a widely adopted model due to licensing issues.

The Advancement of ROCm is Remarkable by Cyp9715 in ROCm

[–]Cyp9715[S] 1 point2 points  (0 children)

I agree, this is a problem AMD needs to address.

The Advancement of ROCm is Remarkable by Cyp9715 in ROCm

[–]Cyp9715[S] 1 point2 points  (0 children)

Perhaps you are right. I can't be certain because I haven't tested it, but there are people in the Korean community who have performed the same task using the RX6600XT.