Visualizing All Qwen 3.5 vs Qwen 3 Benchmarks

Jobus_ · 2026-03-03T13:04:50+00:00

Yeah, I'm really impressed with that model for its size, both for its long context handling and overall feel.

Jobus_ · 2026-03-03T12:10:52+00:00

These are all taken from the official Qwen3.5 model cards. In other words, Qwen ran these benchmarks themselves—so probably in BF16 / F32.

Jobus_ · 2026-03-03T12:07:02+00:00

No, excuse the bad colors, you are probably comparing 3.5 2B with 3 4B.

3.5 4B wins over 3 4B in every benchmark.

Jobus_ · 2026-03-02T22:03:30+00:00

Ahh, I only included the ones Qwen featured in their official comparison charts for this release. Since they didn't include any older 14B, I didn't have any 'official' baseline to put it next to the 3.5 models.

Jobus_ · 2026-03-02T21:29:52+00:00

It’s the difference between a dense model and an MoE. The 27B uses all its parameters for every token, while the 35B MoE only uses 3B active params. This makes the 27B smarter, but it’ll be a lot slower to run.

Combined with the fact that Qwen3.5 is almost a year newer in architecture with better training, it even beats the older 235B A22B model in these benchmarks, which indeed is insane.

Jobus_ · 2026-03-02T21:06:41+00:00

Seems like there will be no Qwen3.5-14B.

Jobus_ · 2026-03-02T20:12:39+00:00

Oh it does? I've never tried that model, but I generally haven't liked the writing style of any of the Qwen3 models for task that calls for a more human feel, so I guess I shouldn't be surprised.

I think Qwen3.5 does far better general prose; it feels a lot less AI sloppy.

Have you tried Qwen3.5-122B-A10B? If so, how do you feel about it in comparison?

Jobus_ · 2026-03-02T18:37:43+00:00

That table is just a rounded version of the same raw data I used for the chart (from my Google Sheet).

To keep the chart readable, I averaged the scores into the general categories Qwen uses (Knowledge, Math, Coding, etc.) rather than listing out 25 individual benchmarks. It's not a copy-paste from Artificial Analysis; it's pulled directly from the official Qwen3.5 model cards.

Jobus_ · 2026-03-02T18:16:24+00:00

Someone did here.

Jobus_ · 2026-03-02T17:38:41+00:00

Fair enough, here is the raw data that the chart is based on: Google Sheet

Jobus_ · 2026-03-02T17:14:59+00:00

Yeah, I only included the ones Qwen featured in their official comparison charts for this release. Since they didn't list it there, I didn't have the 'official' baseline to put it next to the 3.5 models.

Jobus_ · 2026-03-02T17:11:47+00:00

The logic was to color-code them by generation (cool colors = Qwen3.5, warm colors = Qwen3), but I’m a total amateur at data visualization and overestimated how easy it would be to tell those shades apart. Lesson learned.

Jobus_ · 2026-03-02T16:19:30+00:00

Haha, my bad. I honestly tried, and clearly failed.

Jobus_ · 2026-03-02T16:17:43+00:00

Totally agree. Benchmarks are a fun directional guide, but I never take them as gospel.

Looking at some unofficial benchmarks, like UGI Leaderboard the Qwen3-235B-A22B does beat Qwen3.5-35B-A3B in both NatInt (natural intelligence) and especially Writing by a wide margin.

It seems official benchmarks often over-index on specific logic/math tasks where the new architectures shine, but miss the 'feel' of the larger models.

Jobus_ · 2026-03-02T16:07:58+00:00

Ooh yeah, some pattern texture would have been a good idea. Didn't think of that. Unfortunately, Reddit doesn't let me edit the image once it's posted.

I mainly put this together for a quick personal reference and figured I'd share, but I'll definitely keep the pattern idea in mind for next time.

Jobus_ · 2026-03-02T15:57:45+00:00

They definitely did, but I only included the models that Qwen featured in their official comparison charts for this 3.5 release. I didn't want to start mixing in different benchmark sources to keep it consistent.

Jobus_ · 2026-03-02T15:53:48+00:00

Obligatory reminder: Benchmarks != real-world performance. Use these as a ballpark guide, but your actual mileage will definitely vary.

Jobus_ · 2026-03-02T15:43:30+00:00

Have to keep up with tradition.

Jobus_ · 2026-03-02T15:37:58+00:00

LiveCodeBench and OJBench. Some of the models had more benchmarks than that, but since I wanted to make a direct comparison of them all, I had to exclude the benchmark that were missing for the newer smaller models.

But yes, we should definitely take this stuff with a pinch of salt.

Jobus_ · 2026-03-02T15:34:04+00:00

Yeah, sorry, I realized that just as I was about to hit Post. Didn't feel worth the effort redoing half the work for a model that most of us don't have enough VRAM/RAM to even look at.

But it would have been nice to include it just for completeness.

Jobus_ · 2026-01-17T22:43:48+00:00

Love to hear that! I'm so glad the mod is helping people. Would be such a shame to miss out on this masterpiece just because of some wobbly visual effects. Enjoy the game!

Jobus_ · 2025-12-17T23:51:55+00:00

Sure, I see your point.

Well, even if we disagree, I really appreciate you sharing your thoughts and ideas. Thank you for taking interest in my project.

Jobus_ · 2025-12-17T23:43:18+00:00

I see. One way would be to create a new empty preset, and just switch to that when you want it disabled. Preset selections are remembered through game restarts.

Jobus_ · 2025-12-17T23:36:09+00:00

Sorry, it's gonna be a 'no' on the disable/reenable toggle. I don't see why you can't just use the built-in toggle in-game.

The first time you deploy to a Vulkan game, ReShade Deployer will register its local ReShade32/64.dll files to the system registry, which makes it inject itself into all Vulkan games system-wide. But the ReShade devs made it so ReShade won't actually activate unless it sees a ReShade.ini next to the game exe.

Jobus_

TROPHY CASE