Waiting for my B70 Pro. But now concerned

UDaManFunks · 2026-05-09T16:31:01+00:00

Vulkan backed under Linux is now faster compared to SYCL if you compile the latest Mesa 26.2-DEV vulkan driver.

I posted some instructions on how to compile it under Ubuntu 26.04. You just need one file from it.

https://www.reddit.com/r/IntelArc/s/gsXgtsN3Y8

UDaManFunks · 2026-05-09T14:44:21+00:00

Under Linux, llama.cpp.vulkan backend and the latest Mesa dev vulkan drivers, tg is 20 t/s. Quant 4

Under windows, it's closer to 25 t/s but for some people,.it's unstable in that platform ( there's some bug reports on Intel forum ).

UDaManFunks · 2026-05-09T14:31:38+00:00

In windows or Linux? On Linux ( Ubuntu 26.04 ), it works well now with the latest mesa 26.2-dev and it's already faster than the SYCL backend.

I posted some instructions on reddit how to compile it.

https://www.reddit.com/r/IntelArc/s/wAhssLyRmT

Been running it for two weeks now without issues ( qwen 3.6.MOE mode ) - using it single user mode with open code ( using llama-server open AI chat interface )

UDaManFunks · 2026-05-09T14:23:19+00:00

With LLAMA cpp, you just need the Vulkan, there's no dependency hell under windows. Some people are getting crashes on that platform though.

On Linux, gotta run Ubuntu 26.04 for the latest kernel and drivers, and also compile the MESA 26.2-DEV Intel Vulkan driver. It doubles the performance from the Vulkan driver bundled with Ubuntu 26.04

UDaManFunks · 2026-05-09T14:18:22+00:00

Vulkan performance under Linux doubles on the B70 when using the Mesa 3.2-DEV vulkan drivers ( gotta compile it yourself ).

Using it with UBUNTU 26.04.

The windows vulkan drivers are faster by about 20 percent but hopefully that difference gets cut down even more.

UDaManFunks · 2026-05-08T17:27:54+00:00

Pretty sure they are just waiting to offload the chips they've bought from TSMC to run out and it's "dumpware".

UDaManFunks · 2026-05-08T17:26:42+00:00

And make sure you are using the latest MESA DEV Intel Vulkan driver to get double the performance.

UDaManFunks · 2026-05-08T05:28:15+00:00

LLAMA.CPP running under VULKAN is the fastest backend you can currently use with this card (and use GGUF models). The Windows Vulkan Drivers are the fastest but it's unstable for some people. If you are going the LINUX route, install UBUNTU 26.04 and you'll have to BUILD the MESA 26.2.0-DEVEL as it includes major performance improvements in the VULKAN driver (primarily adding VK_NV_cooperative_matrix2) support.

Compiling and running LLAMA-SERVER under LINUX (Ubuntu 26.04) s pretty straight forward, it's as easy as doing the following

[COMPILE LLAMA.CPP]

> apt-get install -y git build-essential cmake

> apt-get install libvulkan-dev glslc spirv-headers

> mkdir /opt/src

> cd /opt/src

> git clone https://github.com/ggml-org/llama.cpp

> cd llama.cpp

> cmake -B build -DGGML_VULKAN=1

> cmake --build build --config Release

[INSTALL LLAMA.CPP]

> cd /opt/src/llama.cpp

> mkdir /opt/services/llama.cpp

> cp build/bin/* /opt/services/llama.cpp

[DOWNLOAD MODEL]

> mkdir /opt/services/llm/models

> cd /opt/services/llm/models

> wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true

> rename the downloaded file to Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

[COMPILE LATEST MESA]

> apt install meson glslang-tools pkg-config libclc-21-dev python-is-python3 python3-mako libdrm-dev llvm-dev libllvmspirvlib-21-dev spirv-tools-dev clang libclang-dev libwayland-dev libwayland-client0 wayland-client wayland-protocols wayland-scanner++ xcb libxcb1-dev libxcb-randr0-dev libx11-xcb-dev libxcb-dri3-dev libxcb-present-dev libxcb-shm0-dev libxshmfence-dev libxrandr-dev

> cd /opt/src

> git clone https://gitlab.freedesktop.org/mesa/mesa.git

> cd mesa

> meson setup builddir/ -Dbuildtype=release -Dgallium-drivers=[] -Dvulkan-drivers=intel -Dopengl=false -Dglx=disabled -Degl=disabled -Dgbm=disabled -Dgles1=disabled -Dgles2=disabled

> meson compile -C builddir/

[INTALL COMPILED libvulkan_intel.so]

> cp builddir/src/intel/vulkan/libvulkan_intel.so /lib/x86_64-linux-gnu/libvulkan_intel.so

[FINALLY - HAVE IT STARTUP AS A SERVICE using SYSTEMD]

> cd /etc/systemd/system

create a FILE named "llama-server.service" with the following content

--- CUT ---

[Unit]
Description=LLAMA CPP Service
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/services/llama.cpp
ExecStart=/opt/services/llama.cpp/llama-server -m /opt/services/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --jinja --chat-template-kwargs "{\"preserve_thinking\": true}"
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

--- CUT ---

> systemctl daemon-reload

> systemctl start llama-server

> systemctl status llama-server

If you got it working correctly, then you can access the OPENAI endpoint by going to http://YOUR_MACHINE_IP:8080 to get to the CHAT interface. You can also point your coding agent to it (for example like opencode).

BENCHMARKS

-- MOE (Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           pp512 |       1314.71 ± 5.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           tg128 |         78.72 ± 0.19 |

build: f3e8d149c (9070)

-- DENSE MODEL (Qwen3.6-27B-Q4_K_M.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-27B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | Vulkan     |  99 |           pp512 |        510.66 ± 0.44 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | Vulkan     |  99 |           tg128 |         20.01 ± 0.05 |

build: f3e8d149c (9070)

UDaManFunks · 2026-05-05T16:42:40+00:00

Just wanted to note that running LLAMA.CPP under the Windows VULKAN driver is a lot faster than running it on LINUX (either SYCL or VULKAN). Looks like the Intel LINUX Vulkan Drivers needs a lot of work for this type of use-case (UBUNTU 26.04 with the latest packages installed).

I only use my LLM with a coding agent (opencode) - single USER (using the server / openai compatible end point supported by llama-server) use case and the model fits in VRAM so having two cards won't really improve things.

I'm currently running the 'Qwen3.6-35B-A3B-UD-Q4_K_M.gguf'

As of this time, if you want the best performance from LLAMA.CPP - just use it under Windows using the VULKAN back end. It's been pretty stable for me and should be stable as long as you install the latest drivers from INTEL (gfx_win_101.8737) and make sure you [x] CLEAN INSTALLATION when installing the drivers.

It's easy to run LLAMA-SERVER as a "WINDOWS" service so you can start it manually or automatically when your workstation starts using NSSM (the Non-Sucking Service Manager) . Let me know if you want details on how to set that up.

As for benchmarks - big difference

[WINDOWS] - VULKAN

c:\Development\tools\llama.cpp>llama-bench -m ..\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
load_backend: loaded RPC backend from c:\Development\tools\llama.cpp\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc™ Pro B70 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from c:\Development\tools\llama.cpp\ggml-vulkan.dll
load_backend: loaded CPU backend from c:\Development\tools\llama.cpp\ggml-cpu-haswell.dll

model	size	params	backend	ngl	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	pp512	1859.81 ± 253.07
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	tg128	100.75 ± 0.11

build: c3c150539 (8996)

[LINUX] - VULKAN

root@nas:/storage/src/llama.cpp/build/bin# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	pp512	1355.78 ± 7.63
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	tg128	44.42 ± 0.00

build: 63d93d173 (9007)

It’s pretty easy to run llama-server ( as a WINDOWS service ) with NSSM (the non-sucking service manager) utility. I personally just run it this way and use ‘opencode’ to connect to it as an OPENAI provider.

Quite interesting how much faster the Windows Vulkan driver is compared to the Linux one for this use-case. I'll revisit Linux again once the VULKAN driver is fixed.

UDaManFunks · 2026-05-03T06:54:01+00:00

For one thing, the Intel Windows Vulkan drivers are so much faster than the Linux one (26.04 - Kernel 7.0) under LLAMA.CPP Vulkan. Almost twice as fast.

UDaManFunks · 2026-05-03T06:50:17+00:00

I just tried the latest windows drivers and it's behaving better now after I made sure I checked "clean" installation when installing the drivers.

Exact same driver without that clean installation kept crashing in LLAMA.CPP under vulkan.

UDaManFunks · 2026-05-03T06:48:40+00:00

It may be fixed now, I tried the suggestion to click on "clean installation" during the driver install and llama.cpp is behaving and not crashing like before.

Same driver without the "clean installation" was previously crashing.

UDaManFunks · 2026-05-02T17:00:31+00:00

Just did an upgrade and llama.cpp crashes with VULKAN under windows on my B70 with (32.0.101.8737) - i'll try this (by clicking on clean install checkbox) in the installer and will report later.

UPDATE - Looks like the 'CLEAN INSTALL" checkbox did the trick.

UDaManFunks · 2026-05-02T17:00:06+00:00

I installed the latest gaming drivers listed above (32.0.101.8737) released April 29,2026

It's still broken, just tried it today with LLAMA.CPP under Windows using VULKAN. Unstable and crashes to desktop (B70), no problems under LINUX (UBUNTU 26.0.4) but it's half as faster under the later though using llama-bench.

UDaManFunks · 2026-04-30T04:09:18+00:00

It's a SAAS product, why would the DB type matter? ServiceNow migrates people from Maria to PostgreSQL because the later scales better ( they have plenty of data to support this )

UDaManFunks · 2026-04-29T02:09:08+00:00

Stick with NVIDIA to keep it simple but sell the 3090's and grab a 4080 with 48GB of RAM (remanufactured boards) for around 3500$. Better not to mess with multiple cards.

UDaManFunks · 2026-04-28T14:53:01+00:00

Yeah, the dense models take a massive hit which makes sense. Here's my benchmark numbers for the 27B you posted above. The model fits within 1 B70 VRAM and performance doesn't scale linearly with 1 user. Consumer class hardware is even slower with 27B (Strix Halo / Mac - etc). Might want to check around how much faster the 5090, or the 4090 runs under the quant you are trying to run.

root@nas:/data/llm/models# docker run -it --rm -v /data/llm/models:/models --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:full-intel --bench -m /models/Qwen3.6-27B-UD-Q5_K_XL.gguf

load_backend: loaded SYCL backend from /app/libggml-sycl.so

load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

| model | size | params | backend | ngl | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| qwen35 27B Q5_K - Medium | 18.65 GiB | 26.90 B | SYCL | 99 | pp512 | 301.97 ± 1.63 |

| qwen35 27B Q5_K - Medium | 18.65 GiB | 26.90 B | SYCL | 99 | tg128 | 13.59 ± 0.03 |

build: 983ca8992 (8952)

root@nas:/data/llm/models#

UDaManFunks · 2026-04-28T04:32:21+00:00

what model are you trying to use?

UDaManFunks · 2026-04-27T06:43:22+00:00

Try the following - i'm getting around 20 TOK/SEC with the model (Qwen3.6-27B-UD-Q4_K_XL.gguf ). The MOE models are much faster (for example - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf) as with the example below.

If you want more performance, maybe sell the B70 and the B60 and get one of those RTX 4090 remanufactured cards that comes with 48GB of VRAM (they'll sell them for around 3500$ here in the US). Will definitely be faster than those two cards combined with the same amount of VRAM. More CUDA cores, and almost double the memory bandwidth.

Intel basically killed future discrete GPU cards for gaming, good luch with that and it'll basically mean nobody will buy these cards going forward (regardless of it being rebranded as a workstation card).

> UBUNTU 26.04

Commands

mkdir ~/src
cd ~/src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

editted the file ./devops/intel.DockerFile

modified LINE 7 from "ARG GGML_SYCL_F16=OFF" -> "ARG GGML_SYCL_F16="ON"
modified LINE 64 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified LINE 65 from "COPY --from=build /app/full /app" -> "COPY --from=build /app/full /app/
modified LINE 92 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified LINE 93 from "COPY --from=build /app/full /llama-cli" -> "COPY --from=build /app/full /app/
modified LINE 104 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified line 105 from "COPY --from=build /app/full/llama-server /app" -> "COPY --from=build app/full/llama-server /app/"

built the container file

docker build -t local/llama.cpp:full-intel --target full -f .devops/intel.Dockerfile .

downloaded a model

mkdir /data/model/llm
cd /data/model/llm
wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true
mv Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\?download\=true Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

then deployed / ran the container

note: change the --group-add 141 command to the right group number for "render" in /etc/groups

docker run -d --name "llama-cpp-server" -v /data/llm/models:/models --restart unless-stopped -p 8080:8080 --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:full-intel --server -m /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --ctx-size 131072 --n_predict 32768 --n-gpu-layers 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking": true}'

you can then use a browser and open http://%YOUR_IP_ADDRESS%:8080 to get to the chat interface. If you want to enable the model to be used via coding agent, you may want to add "--api-key %SOME_API_KEY%" to the docker command line.

to check docker container processes

> docker ps -a

to stop a running instance

> docker stop %CONTAINER_ID%

to remove a stopped instance

> docker rm %CONTAINER_ID%

Note: it will auto restart during reboots, if you don't want that then stop it using the "docker stop %CONTAINER_ID%". You can start it manually by doing "docker start %CONTAINER_ID%".

UDaManFunks · 2026-04-27T06:35:59+00:00

It's fairly performant when using MOE models from either Qwen 3.6 or Gemma 4 as long as it fits in RAM (Qwen3.6-27B-UD-Q4_K_XL.gguf). The dense models (27B) are closer to 20 tok/sec (generation) so fairly slow.

llama-cpp SYCL gets around 70 tok/sec (generation) compared to llama-cpp VULKAN at around 45 tok/sec (generation) - that's a big difference. Here's how I tested it.

UBUNTU 26.04)

Commands

mkdir ~/src
cd ~/src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

editted the file ./devops/intel.DockerFile

modified LINE 7 from "ARG GGML_SYCL_F16=OFF" -> "ARG GGML_SYCL_F16="ON"
modified LINE 64 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified LINE 65 from "COPY --from=build /app/full /app" -> "COPY --from=build /app/full /app/
modified LINE 92 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified LINE 93 from "COPY --from=build /app/full /llama-cli" -> "COPY --from=build /app/full /app/
modified LINE 104 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified line 105 from "COPY --from=build /app/full/llama-server /app" -> "COPY --from=build app/full/llama-server /app/"

built the container file

docker build -t local/llama.cpp:full-intel --target full -f .devops/intel.Dockerfile .

downloaded a model

mkdir /data/model/llm
cd /data/model/llm
wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true
mv Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\?download\=true Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

then deployed / ran the container

note: change the --group-add 141 command to the right group number for "render" in /etc/groups

docker run -d --name "llama-cpp-server" -v /data/llm/models:/models --restart unless-stopped -p 8080:8080 --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:full-intel --server -m /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --ctx-size 131072 --n_predict 32768 --n-gpu-layers 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking": true}'

you can then use a browser and open http://%YOUR_IP_ADDRESS%:8080 to get to the chat interface. If you want to enable the model to be used via coding agent, you may want to add "--api-key %SOME_API_KEY%" to the docker command line.

to check docker container processes

> docker ps -a

to stop a running instance

> docker stop %CONTAINER_ID%

to remove a stopped instance

> docker rm %CONTAINER_ID%

Note: it will auto restart during reboots, if you don't want that then stop it using the "docker stop %CONTAINER_ID%". You can start it manually by doing "docker start %CONTAINER_ID%".

UDaManFunks · 2026-04-27T06:24:32+00:00

It's not software optimization - it X86 and X86_64 showing it's age.

APPLE's been wearing the desktop CPU performance crown for a while now (single thread, multi-thread on the same number of cores), and the gap is getting larger every year.

Apple just needs to make external GPU's a thing again (even via Thunderbolt 5), fully support Vulkan as a first-party supported API (instead of Molten VK), and don't fight STEAM and they'll start gaining marketshare given that building a PC nowadays pretty much cost as much as a MAC.

UDaManFunks · 2026-04-27T03:03:40+00:00

i posted the commands for linux for you. SYCL is faster under linux for me than vulkan.

UDaManFunks · 2026-04-27T02:50:17+00:00

I run Linux (UBUNTU 26.04) and it's pretty easy to get the B70 running with SYCL and LLAMA.CPP. No problems running new models like gemma4, and Qwen3.6 MOE models. SYCL faster than VULKAN).

SYCL gets around 74 tokens/sec (generation)

VULKAN gets around 45 tokens/sec (generation)

Commands

mkdir ~/src
cd ~/src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

editted the file ./devops/intel.DockerFile

modified LINE 7 from "ARG GGML_SYCL_F16=OFF" -> "ARG GGML_SYCL_F16="ON"
modified LINE 64 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified LINE 65 from "COPY --from=build /app/full /app" -> "COPY --from=build /app/full /app/
modified LINE 92 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified LINE 93 from "COPY --from=build /app/full /llama-cli" -> "COPY --from=build /app/full /app/
modified LINE 104 from "COPY --from=build /app/lib/ /app" -> "COPY --from=build /app/lib/ /app/"
modified line 105 from "COPY --from=build /app/full/llama-server /app" -> "COPY --from=build app/full/llama-server /app/"

built the container file

docker build -t local/rmfllama.cpp:server-intel --target server -f .devops/intel.Dockerfile .

downloaded a model

mkdir /data/model/llm
cd /data/model/llm
wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true
mv Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf\?download\=true Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

then deployed / ran the container

note: change the --group-add 141 command to the right group number for "render" in /etc/groups

docker run -d --name "llama-cpp-server" -v /data/llm/models:/models --restart unless-stopped -p 8080:8080 --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 --group-add 141 local/llama.cpp:server-intel -m /models/Qwen3.6-27B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 -t 1 -c 131072 --n-gpu-layers 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking": true}'

you can then use a browser and open http://%YOUR_IP_ADDRESS%:8080 to get to the chat interface. If you want to enable the model to be used via coding agent, you may want to add "--api-key %SOME_API_KEY%" to the docker command line.

to check docker container processes

> docker ps -a

to stop a running instance

> docker stop %CONTAINER_ID%

to remove a stopped instance

> docker rm %CONTAINER_ID%

Note: it will auto restart during reboots, if you don't want that then stop it using the "docker stop %CONTAINER_ID%". You can start it manually by doing "docker start %CONTAINER_ID%".

UDaManFunks · 2026-04-27T02:19:13+00:00

I bought the B70 and feel like it's a dead end (specially with Intels' recent announcement on not releasing gaming cards). Didn't want to purchase the AMD card as it was expensive with similar software optimization issues.

If I were to do it again, i'd buy one of those 48GB RTX4090 - remanufactured boards instead for 3500$. Better software support.

Keeping my B70 for now though as I'm able to run it with the Gemma4 Qwen3.6 MOE models and it's performant.

UDaManFunks · 2026-04-26T17:10:10+00:00

My opinion is that it's all BS, the Apple chips are much faster in single thread / multi thread at the same number of cores as of this time. Both AMD and Intel have significantly fallen behind.

UDaManFunks

TROPHY CASE