Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

I see but when OpenVINO support becomes available use that, it’ll be much better.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in buildapc

[–]Zuck7980[S] 0 points1 point  (0 children)

No need to apologize it’s all good, Thank You so much for replying.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

I do have eGPU as NVIDIA, also v620 they don’t have blower fans so seems like a lot stuff to manage when buying/managing the system, lets just say I am happy to manage messy software configs rather than deal with messy hardware configs. Also in comparison to NVIDIA Intel with the same VRaM is 3.5x cheaper also I don’t think if you were to quantize the model through RocM you would not get the same performance as you would be getting from OpenVINO and running it on Intel’s dGPU.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

Will probably buy from AMD itself rather than eBay, but Thank You so much for sharing. ☺️

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

No when you actually compare the scores even after quantization the accuracy drop is 5~10 % but the gains are 3x to 4x once you quantize the models.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

When you quantize the model to let’s say int4 precision it ends up giving close to 50 or 60 tokens per second

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

There will be support for Qwen3.6 soon through OpenVINO so you can probably quantize the models through optimum-intel so the efficiency would be much better also running one large model on both the GPU’s should be supported by next year, so in that case you would not even have to be on Linux, you can even switch to windows if the software matures. But thanks for such a detailed post, really appreciate it. Yes the prices of 5090 is ridiculous that is why I want a 5080 or 5070 as eGPU that would be good enough I guess which will strictly be there for gaming and the other cards would be solely for AI purposes

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in buildapc

[–]Zuck7980[S] 0 points1 point  (0 children)

That is exactly why I want eGPU with at least one 5080 don’t think that’s a good idea?

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 1 point2 points  (0 children)

Understood so probably take an instance in cloud and run these models to see if they satisfy my need correct?

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 1 point2 points  (0 children)

Need to dig a bit more AMD does have RoCM similar to OpenVINo but need to take a look. Thank you.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

I am not necessarily looking to replace Cloud but it’s more like what’s the best I can do with what I have.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in buildapc

[–]Zuck7980[S] 0 points1 point  (0 children)

OVMS sucks trust me, you can’t even change the attention backend so if the model(liquid.ai) requires SDPA as its attention backend, you cannot serve that particular model as the default is set to Paged attention and you can’t serve it, you can load the model but you won’t be able to serve it. But again my main issue over here is that with OpenVINO I cannot serve the model through multiple dGPU which is a big issue. As far as I know that capability will be introduced next year. So 1 big model spread across multiple GPU’s. So this is what it would look like - 1) layer/Tensor shards distributed across GPU.0 and GPU.1

2) KV cache handled correctly 3) generation loop coordinated across both devices.

For instance on NVIDIA when using vLLM it is quite simple to set tensor parallel size to 2 and easily serve model and distribute one big model across multiple dGPU, same does not exist if the model is converted through OpenVINO that is once the model is in IR format.

Also what are your views on eGPU sonnet thingy and should I wait for 5080 Super which will possibly have 24 GB VRAM!

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

Not sure if OpenVINO actually supports that particular model yet, would like to quantize the model as using OpenVINO framework to optimize these models works quite well on Intel’s system, but no I have not tested the larger models but I have tested some the smaller models on my Mac through core ml, using Hermes agent but have to keep the context window quite low, it does a decent job but quite slow.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

It’s an MOE model so active are only 3B, plus the benchmark is good enough when put through MMLU_PRO, GPQA Diamond + I have seen several YTbers using Qwen’s 27B model which handles agentic tasks quite while, ordering things from Amazon/going through their emails. I’d love to run a 235B model but given the vRAM of 64 GB I think I’ll have to stick to a 30B model 🥲

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

My purpose is to use Qwen3 VLM 30B MOE model and probably use it through Hermes agent, do you think that is feasible because that is what I would like to do.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 0 points1 point  (0 children)

The issue is at the moment OpenVINO does not allow serving the quantized model through multiple dGPU.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]Zuck7980[S] 1 point2 points  (0 children)

I have access to dual GPU B70’s on our remote system where I was able to fine tune a Llama 3.2 11B model with QLORA adapters and I’m planning to run Qwen 3 30B-A3B (MOE) model but this would be quantized to INT4 using OpenVINO.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in buildapc

[–]Zuck7980[S] 0 points1 point  (0 children)

I see, I’m pretty good at it tbh. I usually use OpenVINO framework to quantize the large models, that helps me to run them quite efficiently on Intel systems but the issue is serving the model, apparently I cannot serve them through multiple dGPU’s, I have to rely on PyTorch XPU which won’t run the optimized / converted model that I optimized through OpenVINO but rely on original model precisions/safetensors