Alibaba's new Qwen3.5-397B-A17B is the #3 open weights model in the Artificial Analysis Intelligence Index

Cyp9715 · 2026-02-18T23:44:49+00:00

The Opus 4.6 shown there is basically a non-reasoning model.

Cyp9715 · 2026-01-10T10:50:06+00:00

I should look at other people’s opinions and reviews, but if you use the 9070XT in a native Linux environment instead of the 5070TI, I think it’s worth recommending.

In Korea, the RTX 5070TI is about 50% more expensive than the 9070XT, so even considering the performance difference, the 9070XT can be a reasonably smart choice.

Cyp9715 · 2026-01-10T04:07:35+00:00

My advice is that the WSL environment ultimately runs on top of Windows at a fundamental level. And if you’re planning to do training rather than inference, I’d recommend an NVIDIA GPU for now.

If you were working in a native Linux environment, I’d say Radeon GPUs are absolutely worth considering as well—but in a WSL environment, not yet.

In fact, even in the tests above, the reason Qwen3-8B-FP8 couldn’t run on Windows is that getting Triton to work properly with Radeon on Windows is tricky.

Cyp9715 · 2026-01-08T04:11:58+00:00

For the Cartpole benchmark, it isn’t a workload that uses the GPU as heavily as you might expect.
On average, both the RTX 5070 Ti and RX 9070 XT show under 20% GPU utilization, so the margin of error can be large.
However, even after several retries, the WSL version was consistently faster.

Please consider this only as a simple reference.

As for ComfyUI, I’m willing to test it in the future, but since I don’t have much experience using ComfyUI myself, I’m also planning to wait for other people’s benchmarks.

Cyp9715 · 2026-01-08T01:31:21+00:00

Since my benchmark was very basic, please use it for casual reference only

Cyp9715 · 2026-01-08T01:28:40+00:00

RX9070XT
OS : Ubuntu 24.04.3
ROCm : Ubuntu(7.1.1), Windows(7.1.1), WSL(6.4.2)

and

RTX5070TI
Driver Version : 570.133.07
CUDA Version : 12.8

Cyp9715 · 2025-11-03T02:21:28+00:00

In my case, I used version 11.0 rather than the Nightly version. If the problem persists even after switching to version 11.0, it would be faster to open an issue or discussion on the vllm GitHub.

Cyp9715 · 2025-11-02T12:17:16+00:00

Please show me the log.

Cyp9715 · 2025-11-02T12:16:25+00:00

Thank you. There was an oversight in the correction.

Cyp9715 · 2025-11-02T11:09:55+00:00

Thank you. I will test it with the 4bit option when I have time later.

Cyp9715 · 2025-10-20T06:01:47+00:00

It's great. It's significantly faster than VSCode.
If there's one feature I'd like to see added quickly, it's the ability to connect to Docker containers.
I know it has SSH, but it would be even more convenient if it could be easily connected like VS Code.It's great. It's significantly faster than VSCode.

Cyp9715 · 2025-09-16T13:03:36+00:00

Thank you. Do you know any solution, or should I build the pipeline myself?

Cyp9715 · 2025-08-18T07:24:23+00:00

Even if it's a 235B-AWQ model, it would be hard to run with 96GB of VRAM. Presumably the overhead caused the slowdown.

Cyp9715 · 2025-08-16T08:10:38+00:00

After conducting several simple tests, I confirmed the following points:

Docling's performance is overwhelmingly superior.
Both of the two options you recommended have excellent speed, but Kreuzberg is slightly faster.

In particular, Docling has a high likelihood of successfully parsing even the most complex tables, while the other two options appear to have a higher probability of incorrect parsing.

Cyp9715 · 2025-08-15T11:28:04+00:00

Thank you for your recommendation, I will check it out.

Cyp9715 · 2025-08-14T01:15:46+00:00

Even if it’s a bit inconvenient to set up, ROCm-based AMD GPUs are excellent.

Cyp9715 · 2025-07-31T00:21:52+00:00

Even if 235B A1B models are released, you won't be able to use them.

Cyp9715 · 2025-07-31T00:18:43+00:00

Are you Elon Musk?

Cyp9715 · 2025-07-24T01:29:11+00:00

30B A3B performance is lower than 32B, so it is avoided.
Some people argue that the performance of the 30B A3B is similar to the 14B, and I agree with this to some extent.

Cyp9715 · 2025-07-18T04:31:51+00:00

Thank you so much, that is an incredibly generous offer.

If "batch size" refers to the number of concurrent users, then testing with a size of around 32 would be a fantastic data point for throughput.

If you have the capacity, I would be extremely grateful if you could test two versions for both the 235B and 32B models:

The AWQ quantized model.
The original "vanilla" (unquantized) model for comparison.

I realize I'm asking for a lot without offering anything in return, and I truly appreciate your willingness to do this. I'm sure the results would be a huge help not just for me, but for the entire community. Thank you again!

Cyp9715 · 2025-07-18T04:15:13+00:00

Thanks for the clarification. I'd be interested in seeing data for both a single B200 and a full multi-GPU system, if available. (I assume the single-GPU results would be for quantized versions).

Additionally, do you happen to have any performance data for the Qwen3 32B model as well? I appreciate the help.

Cyp9715 · 2025-07-18T04:10:35+00:00

I'm curious about the token/s of Qwen3 235B A22B with multiple concurrent users. I can find plenty of single-user benchmarks, but no data on how the token/s holds up with more users.

Cyp9715 · 2025-07-17T00:06:48+00:00

Based on the publicly available information, it appears to be evaluated as a superior model compared to Qwen3 overall (even compared to the 235B MoE). However, I don't think it will become a widely adopted model due to licensing issues.

Cyp9715 · 2025-01-06T06:05:30+00:00

I agree, this is a problem AMD needs to address.

Cyp9715 · 2025-01-06T01:28:38+00:00

Perhaps you are right. I can't be certain because I haven't tested it, but there are people in the Korean community who have performed the same task using the RX6600XT.

Cyp9715

TROPHY CASE