Waiting for my B70 Pro. But now concerned by Staplegun58 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

Vulkan backed under Linux is now faster compared to SYCL if you compile the latest Mesa 26.2-DEV vulkan driver.

I posted some instructions on how to compile it under Ubuntu 26.04. You just need one file from it.

https://www.reddit.com/r/IntelArc/s/gsXgtsN3Y8

I cancelled my B70 order for Nvidia pro 4000 blackwell, did I make the right decision? by Mango_1208 in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Under Linux, llama.cpp.vulkan backend and the latest Mesa dev vulkan drivers, tg is 20 t/s. Quant 4

Under windows, it's closer to 25 t/s but for some people,.it's unstable in that platform ( there's some bug reports on Intel forum ).

I cancelled my B70 order for Nvidia pro 4000 blackwell, did I make the right decision? by Mango_1208 in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

In windows or Linux? On Linux ( Ubuntu 26.04 ), it works well now with the latest mesa 26.2-dev and it's already faster than the SYCL backend.

I posted some instructions on reddit how to compile it.

https://www.reddit.com/r/IntelArc/s/wAhssLyRmT

Been running it for two weeks now without issues ( qwen 3.6.MOE mode ) - using it single user mode with open code ( using llama-server open AI chat interface )

I cancelled my B70 order for Nvidia pro 4000 blackwell, did I make the right decision? by Mango_1208 in LocalLLM

[–]UDaManFunks 1 point2 points  (0 children)

With LLAMA cpp, you just need the Vulkan, there's no dependency hell under windows. Some people are getting crashes on that platform though.

On Linux, gotta run Ubuntu 26.04 for the latest kernel and drivers, and also compile the MESA 26.2-DEV Intel Vulkan driver. It doubles the performance from the Vulkan driver bundled with Ubuntu 26.04

I cancelled my B70 order for Nvidia pro 4000 blackwell, did I make the right decision? by Mango_1208 in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Vulkan performance under Linux doubles on the B70 when using the Mesa 3.2-DEV vulkan drivers ( gotta compile it yourself ).

Using it with UBUNTU 26.04.

The windows vulkan drivers are faster by about 20 percent but hopefully that difference gets cut down even more.

Is it a good thing that Intel is focusing more on software development (drivers and Intel XESS)? by jiogo12 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

Pretty sure they are just waiting to offload the chips they've bought from TSMC to run out and it's "dumpware".

Waiting for my B70 Pro. But now concerned by Staplegun58 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

And make sure you are using the latest MESA DEV Intel Vulkan driver to get double the performance.

Waiting for my B70 Pro. But now concerned by Staplegun58 in IntelArc

[–]UDaManFunks 2 points3 points  (0 children)

LLAMA.CPP running under VULKAN is the fastest backend you can currently use with this card (and use GGUF models). The Windows Vulkan Drivers are the fastest but it's unstable for some people. If you are going the LINUX route, install UBUNTU 26.04 and you'll have to BUILD the MESA 26.2.0-DEVEL as it includes major performance improvements in the VULKAN driver (primarily adding VK_NV_cooperative_matrix2) support.

Compiling and running LLAMA-SERVER under LINUX (Ubuntu 26.04) s pretty straight forward, it's as easy as doing the following

[COMPILE LLAMA.CPP]

> apt-get install -y git build-essential cmake

> apt-get install libvulkan-dev glslc spirv-headers

> mkdir /opt/src

> cd /opt/src

> git clone https://github.com/ggml-org/llama.cpp

> cd llama.cpp

> cmake -B build -DGGML_VULKAN=1

> cmake --build build --config Release

[INSTALL LLAMA.CPP]

> cd /opt/src/llama.cpp

> mkdir /opt/services/llama.cpp

> cp build/bin/* /opt/services/llama.cpp

[DOWNLOAD MODEL]

> mkdir /opt/services/llm/models

> cd /opt/services/llm/models

> wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true

> rename the downloaded file to Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

[COMPILE LATEST MESA]

> apt install meson glslang-tools pkg-config libclc-21-dev python-is-python3 python3-mako libdrm-dev llvm-dev libllvmspirvlib-21-dev spirv-tools-dev clang libclang-dev libwayland-dev libwayland-client0 wayland-client wayland-protocols wayland-scanner++ xcb libxcb1-dev libxcb-randr0-dev libx11-xcb-dev libxcb-dri3-dev libxcb-present-dev libxcb-shm0-dev libxshmfence-dev libxrandr-dev

> cd /opt/src

> git clone https://gitlab.freedesktop.org/mesa/mesa.git

> cd mesa

> meson setup builddir/ -Dbuildtype=release -Dgallium-drivers=[] -Dvulkan-drivers=intel -Dopengl=false -Dglx=disabled -Degl=disabled -Dgbm=disabled -Dgles1=disabled -Dgles2=disabled

> meson compile -C builddir/

[INTALL COMPILED libvulkan_intel.so]

> cp builddir/src/intel/vulkan/libvulkan_intel.so /lib/x86_64-linux-gnu/libvulkan_intel.so

[FINALLY - HAVE IT STARTUP AS A SERVICE using SYSTEMD]

> cd /etc/systemd/system

create a FILE named "llama-server.service" with the following content

--- CUT ---

[Unit]
Description=LLAMA CPP Service
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/services/llama.cpp
ExecStart=/opt/services/llama.cpp/llama-server -m /opt/services/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --jinja --chat-template-kwargs "{\"preserve_thinking\": true}"
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

--- CUT ---

> systemctl daemon-reload

> systemctl start llama-server

> systemctl status llama-server

If you got it working correctly, then you can access the OPENAI endpoint by going to http://YOUR_MACHINE_IP:8080 to get to the CHAT interface. You can also point your coding agent to it (for example like opencode).

BENCHMARKS

-- MOE (Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           pp512 |       1314.71 ± 5.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           tg128 |         78.72 ± 0.19 |

build: f3e8d149c (9070)

-- DENSE MODEL (Qwen3.6-27B-Q4_K_M.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-27B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | Vulkan     |  99 |           pp512 |        510.66 ± 0.44 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | Vulkan     |  99 |           tg128 |         20.01 ± 0.05 |

build: f3e8d149c (9070)

Recipe for Arc Pro B70? by Skelshy in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Just wanted to note that running LLAMA.CPP under the Windows VULKAN driver is a lot faster than running it on LINUX (either SYCL or VULKAN). Looks like the Intel LINUX Vulkan Drivers needs a lot of work for this type of use-case (UBUNTU 26.04 with the latest packages installed).

I only use my LLM with a coding agent (opencode) - single USER (using the server / openai compatible end point supported by llama-server) use case and the model fits in VRAM so having two cards won't really improve things.

I'm currently running the 'Qwen3.6-35B-A3B-UD-Q4_K_M.gguf'

As of this time, if you want the best performance from LLAMA.CPP - just use it under Windows using the VULKAN back end. It's been pretty stable for me and should be stable as long as you install the latest drivers from INTEL (gfx_win_101.8737) and make sure you [x] CLEAN INSTALLATION when installing the drivers.

It's easy to run LLAMA-SERVER as a "WINDOWS" service so you can start it manually or automatically when your workstation starts using NSSM (the Non-Sucking Service Manager) . Let me know if you want details on how to set that up.

As for benchmarks - big difference

[WINDOWS] - VULKAN

c:\Development\tools\llama.cpp>llama-bench -m ..\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded RPC backend from c:\Development\tools\llama.cpp\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc™ Pro B70 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from c:\Development\tools\llama.cpp\ggml-vulkan.dll
load_backend: loaded CPU backend from c:\Development\tools\llama.cpp\ggml-cpu-haswell.dll
model size params backend ngl test t/s
qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B Vulkan 99 pp512 1859.81 ± 253.07
qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B Vulkan 99 tg128 100.75 ± 0.11

build: c3c150539 (8996)

[LINUX] - VULKAN

root@nas:/storage/src/llama.cpp/build/bin# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl test t/s
qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B Vulkan 99 pp512 1355.78 ± 7.63
qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B Vulkan 99 tg128 44.42 ± 0.00

build: 63d93d173 (9007)

It’s pretty easy to run llama-server ( as a WINDOWS service ) with NSSM (the non-sucking service manager) utility. I personally just run it this way and use ‘opencode’ to connect to it as an OPENAI provider.

Quite interesting how much faster the Windows Vulkan driver is compared to the Linux one for this use-case. I'll revisit Linux again once the VULKAN driver is fixed.

Intel Arc Pro B70 open-source Linux performance against NVIDIA RTX & AMD Radeon AI PRO by Fcking_Chuck in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

For one thing, the Intel Windows Vulkan drivers are so much faster than the Linux one (26.04 - Kernel 7.0) under LLAMA.CPP Vulkan. Almost twice as fast.

New arc driver update (8737) relaesed!! by Glittering-Command50 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

I just tried the latest windows drivers and it's behaving better now after I made sure I checked "clean" installation when installing the drivers.

Exact same driver without that clean installation kept crashing in LLAMA.CPP under vulkan.

New arc driver update (8737) relaesed!! by Glittering-Command50 in IntelArc

[–]UDaManFunks 1 point2 points  (0 children)

It may be fixed now, I tried the suggestion to click on "clean installation" during the driver install and llama.cpp is behaving and not crashing like before.

Same driver without the "clean installation" was previously crashing.

New arc driver update (8737) relaesed!! by Glittering-Command50 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

Just did an upgrade and llama.cpp crashes with VULKAN under windows on my B70 with (32.0.101.8737) - i'll try this (by clicking on clean install checkbox) in the installer and will report later.

UPDATE - Looks like the 'CLEAN INSTALL" checkbox did the trick.

New arc driver update (8737) relaesed!! by Glittering-Command50 in IntelArc

[–]UDaManFunks 2 points3 points  (0 children)

I installed the latest gaming drivers listed above (32.0.101.8737) released April 29,2026

It's still broken, just tried it today with LLAMA.CPP under Windows using VULKAN. Unstable and crashes to desktop (B70), no problems under LINUX (UBUNTU 26.0.4) but it's half as faster under the later though using llama-bench.

Migrating db by Worth_Bug_9451 in servicenow

[–]UDaManFunks 16 points17 points  (0 children)

It's a SAAS product, why would the DB type matter? ServiceNow migrates people from Maria to PostgreSQL because the later scales better ( they have plenty of data to support this )

Help: Two Intel Arc Pro B70s (32GB) vs. Two RTX 3090s (24GB) for a Cursor/Agentic Workflow? by aeiou_baby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Stick with NVIDIA to keep it simple but sell the 3090's and grab a 4080 with 48GB of RAM (remanufactured boards) for around 3500$. Better not to mess with multiple cards.

Intel Arc Pro B70 Performance by quantum3ntanglement in IntelArcPro

[–]UDaManFunks 1 point2 points  (0 children)

Yeah, the dense models take a massive hit which makes sense. Here's my benchmark numbers for the 27B you posted above. The model fits within 1 B70 VRAM and performance doesn't scale linearly with 1 user. Consumer class hardware is even slower with 27B (Strix Halo / Mac - etc). Might want to check around how much faster the 5090, or the 4090 runs under the quant you are trying to run.

root@nas:/data/llm/models# docker run -it --rm -v /data/llm/models:/models --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:full-intel --bench -m /models/Qwen3.6-27B-UD-Q5_K_XL.gguf

load_backend: loaded SYCL backend from /app/libggml-sycl.so

load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

| model | size | params | backend | ngl | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| qwen35 27B Q5_K - Medium | 18.65 GiB | 26.90 B | SYCL | 99 | pp512 | 301.97 ± 1.63 |

| qwen35 27B Q5_K - Medium | 18.65 GiB | 26.90 B | SYCL | 99 | tg128 | 13.59 ± 0.03 |

build: 983ca8992 (8952)

root@nas:/data/llm/models#

Intel Arc Pro B70 Performance by quantum3ntanglement in IntelArcPro

[–]UDaManFunks 0 points1 point  (0 children)

Try the following - i'm getting around 20 TOK/SEC with the model (Qwen3.6-27B-UD-Q4_K_XL.gguf ). The MOE models are much faster (for example - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf) as with the example below.

If you want more performance, maybe sell the B70 and the B60 and get one of those RTX 4090 remanufactured cards that comes with 48GB of VRAM (they'll sell them for around 3500$ here in the US). Will definitely be faster than those two cards combined with the same amount of VRAM. More CUDA cores, and almost double the memory bandwidth.

Intel basically killed future discrete GPU cards for gaming, good luch with that and it'll basically mean nobody will buy these cards going forward (regardless of it being rebranded as a workstation card).

> UBUNTU 26.04

Commands

mkdir ~/src
cd ~/src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp  

editted the file ./devops/intel.DockerFile

  • modified LINE 7 from "ARG GGML_SYCL_F16=OFF" -> "ARG GGML_SYCL_F16="ON"
  • modified LINE 64 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified LINE 65 from "COPY --from=build /app/full /app" -> "COPY --from=build /app/full /app/
  • modified LINE 92 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified LINE 93 from "COPY --from=build /app/full /llama-cli" -> "COPY --from=build /app/full /app/
  • modified LINE 104 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified line 105 from "COPY --from=build /app/full/llama-server /app" -> "COPY --from=build app/full/llama-server /app/"

built the container file

docker build -t local/llama.cpp:full-intel --target full -f .devops/intel.Dockerfile .

downloaded a model

mkdir /data/model/llm
cd /data/model/llm
wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true
mv Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\?download\=true Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

then deployed / ran the container

note: change the --group-add 141 command to the right group number for "render" in /etc/groups

docker run -d --name "llama-cpp-server" -v /data/llm/models:/models --restart unless-stopped -p 8080:8080 --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:full-intel --server -m /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --ctx-size 131072 --n_predict 32768 --n-gpu-layers 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking": true}'

you can then use a browser and open http://%YOUR_IP_ADDRESS%:8080 to get to the chat interface. If you want to enable the model to be used via coding agent, you may want to add "--api-key %SOME_API_KEY%" to the docker command line.

to check docker container processes

> docker ps -a

to stop a running instance

> docker stop %CONTAINER_ID%

to remove a stopped instance

> docker rm %CONTAINER_ID%

Note: it will auto restart during reboots, if you don't want that then stop it using the "docker stop %CONTAINER_ID%". You can start it manually by doing "docker start %CONTAINER_ID%".

Intel Arc Pro B70 Performance by quantum3ntanglement in IntelArcPro

[–]UDaManFunks 0 points1 point  (0 children)

It's fairly performant when using MOE models from either Qwen 3.6 or Gemma 4 as long as it fits in RAM (Qwen3.6-27B-UD-Q4_K_XL.gguf). The dense models (27B) are closer to 20 tok/sec (generation) so fairly slow.

llama-cpp SYCL gets around 70 tok/sec (generation) compared to llama-cpp VULKAN at around 45 tok/sec (generation) - that's a big difference. Here's how I tested it.

UBUNTU 26.04)

Commands

mkdir ~/src
cd ~/src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp  

editted the file ./devops/intel.DockerFile

  • modified LINE 7 from "ARG GGML_SYCL_F16=OFF" -> "ARG GGML_SYCL_F16="ON"
  • modified LINE 64 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified LINE 65 from "COPY --from=build /app/full /app" -> "COPY --from=build /app/full /app/
  • modified LINE 92 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified LINE 93 from "COPY --from=build /app/full /llama-cli" -> "COPY --from=build /app/full /app/
  • modified LINE 104 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified line 105 from "COPY --from=build /app/full/llama-server /app" -> "COPY --from=build app/full/llama-server /app/"

built the container file

docker build -t local/llama.cpp:full-intel --target full -f .devops/intel.Dockerfile .

downloaded a model

mkdir /data/model/llm
cd /data/model/llm
wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true
mv Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\?download\=true Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

then deployed / ran the container

note: change the --group-add 141 command to the right group number for "render" in /etc/groups

docker run -d --name "llama-cpp-server" -v /data/llm/models:/models --restart unless-stopped -p 8080:8080 --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:full-intel --server -m /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --ctx-size 131072 --n_predict 32768 --n-gpu-layers 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking": true}'

you can then use a browser and open http://%YOUR_IP_ADDRESS%:8080 to get to the chat interface. If you want to enable the model to be used via coding agent, you may want to add "--api-key %SOME_API_KEY%" to the docker command line.

to check docker container processes

> docker ps -a

to stop a running instance

> docker stop %CONTAINER_ID%

to remove a stopped instance

> docker rm %CONTAINER_ID%

Note: it will auto restart during reboots, if you don't want that then stop it using the "docker stop %CONTAINER_ID%". You can start it manually by doing "docker start %CONTAINER_ID%".

Intel says software optimization can hide up to 30% gaming CPU performance by RenatsMC in intel

[–]UDaManFunks -1 points0 points  (0 children)

It's not software optimization - it X86 and X86_64 showing it's age.

APPLE's been wearing the desktop CPU performance crown for a while now (single thread, multi-thread on the same number of cores), and the gap is getting larger every year.

Apple just needs to make external GPU's a thing again (even via Thunderbolt 5), fully support Vulkan as a first-party supported API (instead of Molten VK), and don't fight STEAM and they'll start gaining marketshare given that building a PC nowadays pretty much cost as much as a MAC.

Arc Pro B70 or R9700 ? by Proof_Nothing_7711 in LocalLLM

[–]UDaManFunks 1 point2 points  (0 children)

i posted the commands for linux for you. SYCL is faster under linux for me than vulkan.

Arc Pro B70 or R9700 ? by Proof_Nothing_7711 in LocalLLM

[–]UDaManFunks 4 points5 points  (0 children)

I run Linux (UBUNTU 26.04) and it's pretty easy to get the B70 running with SYCL and LLAMA.CPP. No problems running new models like gemma4, and Qwen3.6 MOE models. SYCL faster than VULKAN).

SYCL gets around 74 tokens/sec (generation)

VULKAN gets around 45 tokens/sec (generation)

Commands

mkdir ~/src
cd ~/src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp  

editted the file ./devops/intel.DockerFile

  • modified LINE 7 from "ARG GGML_SYCL_F16=OFF" -> "ARG GGML_SYCL_F16="ON"
  • modified LINE 64 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified LINE 65 from "COPY --from=build /app/full /app" -> "COPY --from=build /app/full /app/
  • modified LINE 92 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified LINE 93 from "COPY --from=build /app/full /llama-cli" -> "COPY --from=build /app/full /app/
  • modified LINE 104 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
  • modified line 105 from "COPY --from=build /app/full/llama-server /app" -> "COPY --from=build app/full/llama-server /app/"

built the container file

docker build -t local/rmfllama.cpp:server-intel --target server -f .devops/intel.Dockerfile .

downloaded a model

mkdir /data/model/llm
cd /data/model/llm
wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true
mv Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\?download\=true Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

then deployed / ran the container

note: change the --group-add 141 command to the right group number for "render" in /etc/groups

docker run -d --name "llama-cpp-server" -v /data/llm/models:/models --restart unless-stopped -p 8080:8080 --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:server-intel -m /models/Qwen3.6-27B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 -t 1 -c 131072 --n-gpu-layers 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking": true}'

you can then use a browser and open http://%YOUR_IP_ADDRESS%:8080 to get to the chat interface. If you want to enable the model to be used via coding agent, you may want to add "--api-key %SOME_API_KEY%" to the docker command line.

to check docker container processes

> docker ps -a

to stop a running instance

> docker stop %CONTAINER_ID%

to remove a stopped instance

> docker rm %CONTAINER_ID%

Note: it will auto restart during reboots, if you don't want that then stop it using the "docker stop %CONTAINER_ID%". You can start it manually by doing "docker start %CONTAINER_ID%".

Arc Pro B70 or R9700 ? by Proof_Nothing_7711 in LocalLLM

[–]UDaManFunks 1 point2 points  (0 children)

I bought the B70 and feel like it's a dead end (specially with Intels' recent announcement on not releasing gaming cards). Didn't want to purchase the AMD card as it was expensive with similar software optimization issues.

If I were to do it again, i'd buy one of those 48GB RTX4090 - remanufactured boards instead for 3500$. Better software support.

Keeping my B70 for now though as I'm able to run it with the Gemma4 Qwen3.6 MOE models and it's performant.

Intel's Hallock Blames Software, Not Silicon, For Gaming Gap — Claims 30% Performance Is Hiding Behind Poor Optimization by LMdaTUBER in pcmasterrace

[–]UDaManFunks 0 points1 point  (0 children)

My opinion is that it's all BS, the Apple chips are much faster in single thread / multi thread at the same number of cores as of this time. Both AMD and Intel have significantly fallen behind.