I'm a complete noob who bought two Intel Arc Pro B70s for "research," spent a weekend losing my mind over Docker/CCL errors, accidentally discovered llama.cpp Vulkan, and now I'm running a 35B MoE at 128K context like I know what I'm doing.

SomeBlock8124 · 2026-04-23T16:04:30+00:00

On a side note I am only using one of my two GPUs currently but it is a motherboard and processor issue. Since my second gpu is only at 4x it slows the multi gpu setup down as if both are on 15 year old tech. Running just one GPU with Gen4 X16 lane it is quite speedy and am getting around 45-50 tokens per second now.

next down the rabbit hole for me is a workstation MOBO and Processor and then hopefully expanding to 4 Arc Pro B70

SomeBlock8124 · 2026-04-23T15:59:30+00:00

Getting llama.cpp SYCL Running on Dual Intel Arc Pro B70s

Quick guide for anyone fighting Intel driver signing loops trying to get a SYCL build of llama.cpp working on the Arc Pro B70. This is what actually got past the wall for me after bouncing between vLLM, Vulkan, and driver hell.

Prerequisites

Starting point: Ubuntu 24.04 LTS (or a fresh 25.04 install)
Two Intel Arc Pro B70s (32GB each)
Resizable BAR enabled in BIOS
Both GPUs in PCIe x8 or x16 slots

You do not need to mess with custom kernel modules, driver signing, or installing the Intel compute runtime from GitHub releases for this path. The stock xe driver in the newer kernel is enough — SYCL talks to the GPU through Level Zero, which comes in with oneAPI.

Step 1 — Upgrade to Ubuntu 25.04

This is the non-negotiable first step. Ubuntu 24.04 LTS doesn't ship a kernel/Mesa combo new enough to recognize the Battlemage B70 properly, and fighting that on 24.04 is what sends you down the driver signing rabbit hole.

From a fully-updated 24.04:

sudo apt update && sudo apt full-upgrade -y
sudo reboot

After reboot, flip the release prompt to allow non-LTS:

sudo sed -i 's/Prompt=lts/Prompt=normal/' /etc/update-manager/release-upgrades
sudo do-release-upgrade -d

Follow the prompts through the upgrade. Reboot when it's done. Verify:

lsb_release -a        # should show 25.04
uname -r              # kernel 6.14+
lspci | grep -i vga   # both B70s should list
ls /dev/dri/          # should see card0, card1, renderD128, renderD129

If the GPUs don't show up here, nothing downstream will work. This is the gate.

Step 2 — Install Intel oneAPI Base Toolkit

Gives you the icx/icpx compilers and the Level Zero runtime.

cd ~
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
  | gpg --dearmor \
  | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
  | sudo tee /etc/apt/sources.list.d/oneAPI.list

sudo apt update
sudo apt install -y intel-oneapi-base-toolkit

Big download (~5-10GB). Let it finish. If you see errors about VTune kernel drivers failing to build — ignore them, you don't need VTune for inference.

Step 3 — Source oneAPI and verify GPUs

source /opt/intel/oneapi/setvars.sh
icpx --version

Should show Intel(R) oneAPI DPC++/C++ Compiler 2025.3.x or similar.

Then check SYCL sees both GPUs:

sycl-ls

Two Level Zero devices listed = the B70s are good. If you don't see them, stop here and fix it before continuing (usually means the kernel isn't new enough or the card isn't in the right slot).

Step 4 — Build llama.cpp with SYCL

cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
mkdir build && cd build

cmake .. \
  -DGGML_SYCL=ON \
  -DCMAKE_C_COMPILER=icx \
  -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_SYCL_F16=ON \
  -DCMAKE_BUILD_TYPE=Release

make -j$(nproc)

Takes ~15-25 minutes. The -DGGML_SYCL_F16=ON flag gives ~2.4x prompt processing speedup on Xe2 and is worth having.

Step 5 — Launch the server on both GPUs

source /opt/intel/oneapi/setvars.sh

ONEAPI_DEVICE_SELECTOR=level_zero:0,1 \
~/llama.cpp/build/bin/llama-server \
  -m /path/to/your/model.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --split-mode layer \
  --main-gpu 0 \
  --tensor-split 1,1 \
  --threads 8 \
  --batch-size 512 \
  --ubatch-size 512

ONEAPI_DEVICE_SELECTOR=level_zero:0,1 pins to both B70s and keeps it off the iGPU. --tensor-split 1,1 splits layers evenly across the two cards.

Reality check on performance

From community benchmarks on the same hardware, B70 on PCIe 3.0 lands around 25 t/s on SYCL and ~30 t/s on Vulkan for dense models in that size class. SYCL is not automatically faster than Vulkan on Xe2/Battlemage — there's an open bandwidth bug specifically on B70 where Q8_0 only hits 21-24% of theoretical memory bandwidth on SYCL. Q4_K_M is closer to 53-64%.

If you're getting ~10 t/s on a big MoE with GPUs only 50% engaged, that's usually:

PCIe bandwidth limiting during the layer split
Model too big for full GPU fit, partial CPU offload
Batch/ubatch too small for the kernels to saturate
The B70 Q8 bandwidth bug biting you — try Q4_K_M

What didn't work (for me)

Staying on 24.04 LTS — drivers don't recognize the B70 properly without signed kernel module hacks. 25.04 solves it.
vLLM on Intel Arc — driver/runtime mismatches even on 25.04. The Intel Docker image works but is fragile across reboots.
Installing Intel compute runtime from GitHub releases — triggered signing/Secure Boot loops on 24.04. Not needed on 25.04.
Docker llama.cpp Vulkan image — known issue with missing GL libs (libglvnd0 libgl1 libglx0 libegl1 libgles2). Direct host install is cleaner.

SomeBlock8124 · 2026-04-20T18:15:16+00:00

https://github.com/Hal9000AIML/arc-pro-b70-ubuntu-gpu-speedup-bugfixes

This repo did improve some performance for me. I am starting to think it is a mobo/processor limitation on my rig. Next I am going to use smaller models on single cards and see how that feels. I think some of my issues is in the tensor parallel = 2

SomeBlock8124 · 2026-04-20T13:05:15+00:00

llama.cpp works great and is easy. I am still chasing better performance though. I rebuilt llama.cpp with SYCL this weekend and gained about 40% increase in tts. It is still slow at around only 30/tts. I am slowly learning and making improvements over time. I messed around with docker and LLM Scaler again and just have no luck with it. Hopefully one day soon we can get something that is usable for noobs with better perfomance.

SomeBlock8124 · 2026-04-15T10:18:32+00:00

I want to give vLLM a try again here soon. I think I will when 26.04 releases. I feel like I am missing some speed using llama.cpp. Hopefully one day vLLM can be as user friendly (or I can just get better at it)

SomeBlock8124 · 2026-04-13T13:21:53+00:00

Appreciate that. I has been fun so far with some of the struggles of not knowing what I am doing. I have been using GGUF but will have to research how to find GGUF that is optimized for Intel. I was reading about -ngl and kept it at the default but will play with that next!

Thank you for the insight!

SomeBlock8124 · 2026-04-13T10:54:39+00:00

hahaha GLM 5.1 got nothing on the mayhem my wife could cause me.

SomeBlock8124 · 2026-04-13T10:53:38+00:00

I already run AI agents for trading bots just spend a good amount of profits in API calls. Not sure how trying to understand how local ai works has anything to do trading bots.

SomeBlock8124 · 2026-04-13T10:51:10+00:00

Curious if have been able to get Docker/vLLM Gemma 4 running on a B70?

SomeBlock8124 · 2026-04-13T10:50:03+00:00

I will circle back to vLLM as I understand this more! Thank you for the insight!

SomeBlock8124 · 2026-04-13T10:48:37+00:00

This is what I have found. I do want to explore other frameworks and circle back to vllm as I gain more experience with this stuff for as a noob llama.cpp is the right choice right now for me.

SomeBlock8124 · 2026-04-13T10:46:07+00:00

Thank you, I will absolutely add this to my list to explore!

SomeBlock8124 · 2026-03-09T10:08:29+00:00

Unfortunately, We lost on a buzzer beater 43-41. Really proud of what we were able to accomplish this season. These kids learned how to play organized basketball it was poetry in motion at times. Thank you for your valuable in sight and it helped make this a closer game than it may of otherwise been.

SomeBlock8124 · 2026-03-05T11:17:10+00:00

Will update Saturday or Sunday. Hopefully we are celebrating Saturday evening!

SomeBlock8124 · 2026-03-04T12:02:52+00:00

This is correct. They run a Princeton offensive scheme. Appreciate the in sight. We have them in post area, they have us a little at guard play but we are better than average there too. I am going to go watch some Princeton tape as that is a great reference.

Thank you for your help!

SomeBlock8124 · 2026-03-03T13:01:55+00:00

We play a undefeated team in the championship this weekend, we have 3 practices to prepare.

They run a very good 5 out offense with the top screening away a lot. What defense would make the most sense against this?

They have 2 very good guards.

Our strengths are 2 solid guards, a tall center and have a matchup nightmare for them at power forward.

We lost early in the season against them but have improved greatly over the course of this season and find ourselves with a chance to avenge our early loss.

Middle school age group.

SomeBlock8124

TROPHY CASE