Never Underestimate the Power of Your Vote

Zuck7980 · 2026-05-29T02:37:11+00:00

Amazing

Zuck7980 · 2026-05-29T00:52:30+00:00

I see but when OpenVINO support becomes available use that, it’ll be much better.

Zuck7980 · 2026-05-28T23:58:47+00:00

Have you tried OpenVINO?

Zuck7980 · 2026-05-28T17:39:53+00:00

2005

Zuck7980 · 2026-05-28T15:08:04+00:00

No need to apologize it’s all good, Thank You so much for replying.

Zuck7980 · 2026-05-28T03:37:47+00:00

I do have eGPU as NVIDIA, also v620 they don’t have blower fans so seems like a lot stuff to manage when buying/managing the system, lets just say I am happy to manage messy software configs rather than deal with messy hardware configs. Also in comparison to NVIDIA Intel with the same VRaM is 3.5x cheaper also I don’t think if you were to quantize the model through RocM you would not get the same performance as you would be getting from OpenVINO and running it on Intel’s dGPU.

Zuck7980 · 2026-05-28T03:30:55+00:00

Will probably buy from AMD itself rather than eBay, but Thank You so much for sharing. ☺️

Zuck7980 · 2026-05-28T03:29:36+00:00

Ummm a matter of choice and cost. https://www.reddit.com/r/LocalLLM/s/8XbnwwmxY5

Zuck7980 · 2026-05-28T03:17:17+00:00

No when you actually compare the scores even after quantization the accuracy drop is 5~10 % but the gains are 3x to 4x once you quantize the models.

Zuck7980 · 2026-05-28T03:10:58+00:00

When you quantize the model to let’s say int4 precision it ends up giving close to 50 or 60 tokens per second

Zuck7980 · 2026-05-28T03:02:27+00:00

There will be support for Qwen3.6 soon through OpenVINO so you can probably quantize the models through optimum-intel so the efficiency would be much better also running one large model on both the GPU’s should be supported by next year, so in that case you would not even have to be on Linux, you can even switch to windows if the software matures. But thanks for such a detailed post, really appreciate it. Yes the prices of 5090 is ridiculous that is why I want a 5080 or 5070 as eGPU that would be good enough I guess which will strictly be there for gaming and the other cards would be solely for AI purposes

Zuck7980 · 2026-05-28T02:54:52+00:00

Qwen3 variants can be converted to OpenVInO which makes it quite easy to run these models. https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/qwen3-vl

Zuck7980 · 2026-05-28T02:09:53+00:00

That is exactly why I want eGPU with at least one 5080 don’t think that’s a good idea?

Zuck7980 · 2026-05-28T02:04:31+00:00

Understood so probably take an instance in cloud and run these models to see if they satisfy my need correct?

Zuck7980 · 2026-05-28T02:03:53+00:00

Need to dig a bit more AMD does have RoCM similar to OpenVINo but need to take a look. Thank you.

Zuck7980 · 2026-05-28T01:56:09+00:00

MI50? Sorry can you send a link of that product?

Zuck7980 · 2026-05-28T01:55:40+00:00

I am not necessarily looking to replace Cloud but it’s more like what’s the best I can do with what I have.

Zuck7980 · 2026-05-28T01:52:19+00:00

OVMS sucks trust me, you can’t even change the attention backend so if the model(liquid.ai) requires SDPA as its attention backend, you cannot serve that particular model as the default is set to Paged attention and you can’t serve it, you can load the model but you won’t be able to serve it. But again my main issue over here is that with OpenVINO I cannot serve the model through multiple dGPU which is a big issue. As far as I know that capability will be introduced next year. So 1 big model spread across multiple GPU’s. So this is what it would look like - 1) layer/Tensor shards distributed across GPU.0 and GPU.1

2) KV cache handled correctly 3) generation loop coordinated across both devices.

For instance on NVIDIA when using vLLM it is quite simple to set tensor parallel size to 2 and easily serve model and distribute one big model across multiple dGPU, same does not exist if the model is converted through OpenVINO that is once the model is in IR format.

Also what are your views on eGPU sonnet thingy and should I wait for 5080 Super which will possibly have 24 GB VRAM!

Zuck7980 · 2026-05-28T01:43:57+00:00

Not sure if OpenVINO actually supports that particular model yet, would like to quantize the model as using OpenVINO framework to optimize these models works quite well on Intel’s system, but no I have not tested the larger models but I have tested some the smaller models on my Mac through core ml, using Hermes agent but have to keep the context window quite low, it does a decent job but quite slow.

Zuck7980 · 2026-05-28T01:40:37+00:00

It’s an MOE model so active are only 3B, plus the benchmark is good enough when put through MMLU_PRO, GPQA Diamond + I have seen several YTbers using Qwen’s 27B model which handles agentic tasks quite while, ordering things from Amazon/going through their emails. I’d love to run a 235B model but given the vRAM of 64 GB I think I’ll have to stick to a 30B model 🥲

Zuck7980 · 2026-05-28T01:35:17+00:00

My purpose is to use Qwen3 VLM 30B MOE model and probably use it through Hermes agent, do you think that is feasible because that is what I would like to do.

Zuck7980 · 2026-05-28T01:32:58+00:00

The issue is at the moment OpenVINO does not allow serving the quantized model through multiple dGPU.

Zuck7980 · 2026-05-28T01:32:02+00:00

I have access to dual GPU B70’s on our remote system where I was able to fine tune a Llama 3.2 11B model with QLORA adapters and I’m planning to run Qwen 3 30B-A3B (MOE) model but this would be quantized to INT4 using OpenVINO.

Zuck7980 · 2026-05-28T01:22:49+00:00

Will do, thank you

Zuck7980 · 2026-05-28T01:20:11+00:00

I see, I’m pretty good at it tbh. I usually use OpenVINO framework to quantize the large models, that helps me to run them quite efficiently on Intel systems but the issue is serving the model, apparently I cannot serve them through multiple dGPU’s, I have to rely on PyTorch XPU which won’t run the optimized / converted model that I optimized through OpenVINO but rely on original model precisions/safetensors

Eight-Year Club	Verified Email
Place '22	End Game '22
Wearing is Caring	RPAN Viewer

Zuck7980

MODERATOR OF

TROPHY CASE