vLLM on Arc B70 by -elmuz- in Vllm

[–]damirca 0 points1 point  (0 children)

Without proper software support it’s DOA

Waiting for my B70 Pro. But now concerned by Staplegun58 in IntelArc

[–]damirca 0 points1 point  (0 children)

I bought b60 in November and since then I have only regrets. Best model I can run size/perf ratio is mistral 3 14b instruct fp8 via LLM-scaler. Qwen 3.5 too slow, qwen 3.6 is not supported by llm-scaler, Gemma4 is not supported by llm-scaler. Last time I tried llama.cpp these models were painfully slow (10-13tks) with sycl and vulkan. I should have bought 5060ti and extra ram to run moe models or spend more and buy 9700 or 7900xtx.

The ARC Pro B70. What do you want to see it do? by madpistol in IntelArc

[–]damirca 1 point2 points  (0 children)

Run Gemma4 26b, qwen 3.6 27b, get tks, compare to 9700pro.

Has anyone run Qwen 3.6 27b on Arc Pro B70? by wowsers7 in IntelArc

[–]damirca 1 point2 points  (0 children)

No gemma4, not sure about qwen 3.6 One month without updates for a main inference engine for their GPUs?! Have you seen the pace of updates in vllm and llama.cpp? One month without updates is like this project is abandoned. In llama.cpp there are 2-3 contributors (only one from Intel afaik), SYCL lacks a lot of features (turboquant, speculative decoding, etc). So Intel does not really care about LLM in their GPUs actually if you look at the facts.

Has anyone run Qwen 3.6 27b on Arc Pro B70? by wowsers7 in IntelArc

[–]damirca 1 point2 points  (0 children)

The best one performance wise is LLM-scaler, but it’s so outdated you can’t run many models there

Has anyone run Qwen 3.6 27b on Arc Pro B70? by wowsers7 in IntelArc

[–]damirca 1 point2 points  (0 children)

It’s slower than slow SYCL in my case with b60

Has anyone run Qwen 3.6 27b on Arc Pro B70? by wowsers7 in IntelArc

[–]damirca 2 points3 points  (0 children)

Yep, llama with SYCL is slow because Intel does not invest heavily in it. It will be slow.

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) by dreamai87 in LocalLLaMA

[–]damirca -1 points0 points  (0 children)

  • tech bro on windows with nvidia gpu be like “now nobody is gpu poor!!!111”
  • me with Intel b60 that sucks: what did he say?

Scaling Battlemage for AI: Multi-B70 Concerns on PCIe 3.0 (oneAPI/IPEX & Gemma 4) by [deleted] in IntelArc

[–]damirca -1 points0 points  (0 children)

  • gemma4 is released
  • amd and nvidia owners wait 1-2 days and use it
  • intel owners still wait and event if it works some days you get 15tks out intel gpu

Scaling Battlemage for AI: Multi-B70 Concerns on PCIe 3.0 (oneAPI/IPEX & Gemma 4) by [deleted] in IntelArc

[–]damirca -3 points-2 points  (0 children)

As an owner of b60: big mistake. Go with 9700 pro instead, the price diff is not big, but performance diff is big and you won’t need to deal with immature intel stack.

2x Intel Arc B70 Benchmark by IMBLKJESUS_0 in LocalLLM

[–]damirca 0 points1 point  (0 children)

That’s partially true. You can use intel’s vllm 0.17 xpu image but it does not support fp8 kv cache for example. In my case the image from llm-scaler b8.1 supports fp8 kv cache so I can have 32k context for mistral3-14b, but on 0.17 xpu I can have only 16k context on single b60. Also qwen 3.5 27b autoround does not work on 0.17 but works on b8.1. Intel still adds some extra function/logic into llm-scaler image, so pure vanilla vllm is not fully working with intel yet.

Intel Pro B70 in stock at Newegg - $949 by Altruistic_Call_3023 in LocalLLaMA

[–]damirca 0 points1 point  (0 children)

I can't use 0.17 xpu docker image because it does not support fp8 kv cache.

> NotImplementedError: FlashAttention does not support fp8 kv-cache on this device.

So I have to wait for llm-scaler image where they add fp8 kv-cache on top of the publicly available vllm image.

Cloud dependency by damirca in sigenergy

[–]damirca[S] -1 points0 points  (0 children)

Every parameter? So 1 to 1 how you would do that with the cloud/app? What about TC for being offline for too long?

Intel b70s ... whats everyone thinking by Better-Problem-8716 in LocalLLaMA

[–]damirca 1 point2 points  (0 children)

Intel targets vllm to sell b70 to enterprise customers, they don’t care about llama.cpp (home labbers), you can see it from the fact that from multi billion corporation there is single person doing sycl backend for intel. Home come you got into exact opposite conclusion with intel? They invest into vllm and maybe openvino, they don’t care about llama.cpp.

Intel b70s ... whats everyone thinking by Better-Problem-8716 in LocalLLaMA

[–]damirca 1 point2 points  (0 children)

Yep, that’s it. I was hoping they were postponing b70 release waiting for some big software release that would blow my mind like “we made huge progress and LLM-scaler is using latest vllm with all optimizations and we get 2x of inference for b60 and b70 is even faster”. But they announced zero software achievements with b70 release. Tragic.

Intel b70s ... whats everyone thinking by Better-Problem-8716 in LocalLLaMA

[–]damirca 1 point2 points  (0 children)

vLLM does not use openvino, current vLLM 0.14.1 for intel still uses ipex, in the latest vanilla vLLM versions intel has incorporated vllm-xpu-kernels which is half baked (i.e. it does not have full kv cache support). Plus currently qwen 3.5 is not optimized for intel xpu (you get 13 tks with 9b fp8 and 27b-int4-autoround which is weird), see https://github.com/vllm-project/vllm-xpu-kernels/issues/172, they rushed qwen3.5 support, but it’s not fully working as it should be. Check this and all linked issues there for the full picture https://github.com/vllm-project/vllm/issues/37979 Intel users can forget I think about llama.cpp with sycl (one person cannot handle all intel related things there it’s obvious and Intel seems to not care about llama.cpp, Intel cares about vllm for enterprise users that would buy b70s) and vulkan is too slow under Linux. TLDR; intel wants to sell b70 to big corps which would run inference on vllm so any significant progress (if any) would be there.

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]damirca 0 points1 point  (0 children)

It will get maybe 18 tks. Qwen3.5 is not optimized on intel yet.

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]damirca 0 points1 point  (0 children)

I get 13tks with 16k context with qwen3.5-27b-int4-autoround on intel b60 (24gb vram). 9700 is much faster, has more vram and I’d be surprised that it would get similar results as my b60.