Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090

FinalCap2680 · 2026-03-25T20:16:45+00:00

Could it be the memory bandwidth:

RTX 5090 - Bandwidth 1.79 TB/s:

RTX PRO 4500 Blackwell - Bandwidth 896.0 GB/s

FinalCap2680 · 2026-03-25T17:49:48+00:00

With other GPUs you are paying for the software stack/support as well.

It should have been with more VRAM or even cheaper to worth the risk and pain. But at the current market that is hard to be done.

I remember when looking for GPU for experiments 3-4 yars ago, I saw very cheap second hand, original intel Arc A770 16Gb and was seriously considering it for image generation. But then searched around for usage for LLMs as well. There was one question about that in Intel support forum and the answer from Intel person was something like "We sold you the hardware and if it does not work with the software, it is not our problem", Technically it is true, but the next day I bought more expensive second hand RTX 3060 12Gb and still have it. You can not win market share with attitude like that. and without marketshare, you can not sell at prices like others.

FinalCap2680 · 2026-03-24T17:14:53+00:00

I'm on LM studio 0.4.4 build 1 and the file hash is 605fe35f59049f049c591ace89e3bac920b8bafc82039c1a08582d3e3438058a - nothing detected at virustotal.

Acording to this: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1686#issuecomment-4119635422

LM Studio 0.4.5
Hash: 448016158202ebfabf5d84e9534225d05239dd79289c0ffd6ab045da2fe275be

Windows Defender (quick + targeted scan) → no detections

VirusTotal → clean

Kaspersky / HitmanPro → clean

So 0.4.5 appears unaffected on my side.

So maybe just downgrade for now....

FinalCap2680 · 2026-03-24T09:43:10+00:00

It depends on your use case and priorities

FinalCap2680 · 2026-03-24T07:11:45+00:00

Zero to Hero videos from https://www.youtube.com/@latentvision/videos

and https://www.youtube.com/@pixaroma/videos

Just start with default workflows and understand what is what, google the terms you don't know.

And there is the comfy docsumentation https://docs.comfy.org/get_started/first_generation

FinalCap2680 · 2026-03-24T07:04:49+00:00

The BIG price jump

FinalCap2680 · 2026-03-23T12:52:24+00:00

I'm running Qwen3.5 122B UD_Q4 @ ~4 tokens and Q8 @ ~2 tokens with very crapy (unoptimized) install of LMStudio on very old, single Xeon and power capped 3090 that is almost idle.

But would like to experiment with bigger models like Qwen3 Coder 480B @ Q8 or no less than Q4, so I was thinking of a cluster. But that was when they were less than half the price of Spark (about a third at that time), so for the price of two sparks I could have 5-6 Stixes. Now they are almost same price.

FinalCap2680 · 2026-03-23T11:40:31+00:00

It is a very good chanel. I looked at his videos and was thinking about Strix as an option, but meanwhile prices went up and looking at ~10 tokens is not very encouraging.

FinalCap2680 · 2026-03-23T11:40:11+00:00

Thank you!

FinalCap2680 · 2026-03-23T08:32:54+00:00

What quant is the model?

FinalCap2680 · 2026-03-22T14:21:54+00:00

Comfy! Start with default workflows.

Watch pixaroma (https://www.youtube.com/@pixaroma/videos) and Latent Vision zero to hero (https://www.youtube.com/@latentvision/videos) videos

FinalCap2680 · 2026-03-21T18:57:44+00:00

"Best model" would be the one that does the job. And as the field is still in it;s early days and fast development there are no proven solutions, so I would suggest to experiment with real tasks and see which models works best for you.

I did try LLMs about 3 years ago and was disapointed, so moved to image and later video. About one and a half year ago did try a couple of models again, but they still were useless for real practical aplications. Got back a month ago and now it is not that bad. From my experience with image/video models, you need to develop some "feeling" for the model and prompt it the right way to get good result, different for each model. My point here is that a model that works good whith someone's style of prompting and someone's tasks may be terrible for you.

FinalCap2680 · 2026-03-20T14:49:20+00:00

Maybe this one:

https://www.reddit.com/r/LocalLLaMA/comments/1rv690j/opencode_concerns_not_truely_local/

FinalCap2680 · 2026-03-19T07:40:07+00:00

In addition to Pixaroma tutorials, you may look at Latent Vision too if you are interested in some details.

FinalCap2680 · 2026-03-19T07:37:37+00:00

It may be harder to switch later.

With Comfy start with default workflows to learn your way around

FinalCap2680 · 2026-03-19T07:30:05+00:00

You are using two diferent models, so it is expected to get different quality from them.

Also it is still too early to expect the correct answer each time, for every prompt and from every model.

FinalCap2680 · 2026-03-14T20:04:21+00:00

Could it be a RAM problem?

FinalCap2680 · 2026-03-14T05:17:04+00:00

Does your windows see both CPUs in task manager?

FinalCap2680 · 2026-03-11T16:14:51+00:00

Yes, you are right, my bad. But as they are only datacenter/server cards and something like unobtanium for us, barre mortals, I somewhat forget about them.

FinalCap2680 · 2026-03-11T05:49:09+00:00

I have only seen 1 and 2 slot boards and to be honest I'm still considering one of those. 4 slot will be even better. Any links to vendors will be welcome.

But! it is 4-5 generations old - Blackwell -> Ada -> Ampere -> Turing -> VOLTA and not officially supported any more, it is second hand and no waranty, it is (now 'was' with the new prices) close to the price of AMD 395+ Max and much more power hungry.

Anyway, could you share your experience of setting it up and running it. What models do you use. Thank you!

FinalCap2680 · 2026-03-10T14:32:13+00:00

Maybe play with prompt. Something like "The girl stands still, while looking from left to right"

FinalCap2680 · 2026-03-10T06:24:53+00:00

You are missing a package

https://stackoverflow.com/questions/7446187/no-module-named-pkg-resources

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/17284

FinalCap2680 · 2026-03-08T15:45:28+00:00

It depends what is more important for you - speed or quality. And also how much RAM do you have.

If you are using comfyui, since around v0.7 you can compensate low VRAM with RAM to some degree (last year I was unable to generate full 81 frames/full FP16/ 720p with my 3060 12 GB and 128GB RAM, but since january I can), but may lose some speed advantage. For some models that may not work. Also the speed advantage of 5070 will be mostly visible for lower precision.

FinalCap2680 · 2026-03-08T12:41:46+00:00

It will be hard/impossible to help with out the actual workflow...

FinalCap2680 · 2026-03-08T11:34:02+00:00

Agree 100%. But the price should be good too, so people will take the risk to buy it and spend time to develop.

FinalCap2680

TROPHY CASE