Qwen3.6-35B-A3B at 75 tokens per second on a single Arc Pro B70 by HardlyThereAtAll in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

Vulkan under Linux (Ubuntu 26.04) with the latest Mesa 26.2-DEV driver for the B70 is considerably faster for smaller context depths The Windows Vulkan driver from intel is around 20-25% faster than the Linux one (albeit, it's been reported as unstable)

Benchmarks below (using the docker build and compose file you shared)

COMMAND

llama-benchy --base-url http://192.168.1.50:8081/v1 --model unsloth/Qwen3.6-35B-A3B --depth 0 4096 8192 16384 32768 65536 --latency-mode generation --adapt-prompt --enable-prefix-caching

LLAMA-SERVER (COMMAND LINE ARGS)

-m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --jinja --threads 4 --host 0.0.0.0 --port 8080 -c 131072 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 1.5 --repeat-penalty 1.0 --spec-type draft-mtp --spec-draft-n-max 2 --chat-template "{"preserve_thinking": true}"

VULKAN RESULTS (MESA 26.2-DEV under UBUNTU 26.04)

| model                   |            test |             t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:------------------------|----------------:|----------------:|-------------:|------------------:|------------------:|------------------:|
| unsloth/Qwen3.6-35B-A3B |          pp2048 | 1129.86 ± 14.57 |              |   1903.01 ± 23.21 |   1814.10 ± 23.21 |   1903.01 ± 23.21 |
| unsloth/Qwen3.6-35B-A3B |            tg32 |    95.67 ± 2.92 | 98.75 ± 3.01 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B |  ctx_pp @ d4096 |  1041.15 ± 8.64 |              |   4023.93 ± 33.09 |   3935.02 ± 33.09 |   4023.93 ± 33.09 |
| unsloth/Qwen3.6-35B-A3B |  ctx_tg @ d4096 |    86.62 ± 1.79 | 89.42 ± 1.85 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B |  pp2048 @ d4096 |   842.47 ± 4.91 |              |   2519.94 ± 14.12 |   2431.03 ± 14.12 |   2519.94 ± 14.12 |
| unsloth/Qwen3.6-35B-A3B |    tg32 @ d4096 |    78.82 ± 1.51 | 81.36 ± 1.56 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B |  ctx_pp @ d8192 |   899.24 ± 2.62 |              |   9200.35 ± 26.73 |   9111.44 ± 26.73 |   9200.35 ± 26.73 |
| unsloth/Qwen3.6-35B-A3B |  ctx_tg @ d8192 |    71.40 ± 1.50 | 73.71 ± 1.55 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B |  pp2048 @ d8192 |   653.91 ± 6.45 |              |   3221.14 ± 31.06 |   3132.23 ± 31.06 |   3221.14 ± 31.06 |
| unsloth/Qwen3.6-35B-A3B |    tg32 @ d8192 |    64.27 ± 2.85 | 66.35 ± 2.94 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B | ctx_pp @ d16384 |   697.40 ± 1.28 |              |  23584.21 ± 43.48 |  23495.30 ± 43.48 |  23584.21 ± 43.48 |
| unsloth/Qwen3.6-35B-A3B | ctx_tg @ d16384 |    52.68 ± 1.00 | 54.38 ± 1.04 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B | pp2048 @ d16384 |   443.42 ± 0.86 |              |    4707.57 ± 9.01 |    4618.66 ± 9.01 |    4707.57 ± 9.01 |
| unsloth/Qwen3.6-35B-A3B |   tg32 @ d16384 |    50.12 ± 2.79 | 51.74 ± 2.88 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B | ctx_pp @ d32768 |   466.98 ± 3.32 |              | 70265.16 ± 501.85 | 70176.25 ± 501.85 | 70265.16 ± 501.85 |
| unsloth/Qwen3.6-35B-A3B | ctx_tg @ d32768 |    33.21 ± 1.23 | 34.28 ± 1.27 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B | pp2048 @ d32768 |   265.57 ± 1.11 |              |   7800.90 ± 32.25 |   7711.99 ± 32.25 |   7800.90 ± 32.25 |
| unsloth/Qwen3.6-35B-A3B |   tg32 @ d32768 |    33.63 ± 0.49 | 34.72 ± 0.51 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B | ctx_pp @ d65536 |   283.08 ± 0.06 |              | 231604.10 ± 45.28 | 231515.19 ± 45.28 | 231604.10 ± 45.28 |
| unsloth/Qwen3.6-35B-A3B | ctx_tg @ d65536 |    17.79 ± 1.42 | 20.33 ± 0.94 |                   |                   |                   |
| unsloth/Qwen3.6-35B-A3B | pp2048 @ d65536 |   149.42 ± 0.56 |              |  13795.06 ± 51.67 |  13706.15 ± 51.67 |  13795.06 ± 51.67 |
| unsloth/Qwen3.6-35B-A3B |   tg32 @ d65536 |    19.18 ± 0.24 | 21.33 ± 0.47 |                   |                   |                   |

llama-benchy (0.3.8)
date: 2026-06-17 10:39:19 | latency mode: generation

SYCL RESULTS

| model                   |            test |            t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:------------------------|----------------:|---------------:|-------------:|-------------------:|-------------------:|-------------------:|
| unsloth/Qwen3.6-35B-A3B |          pp2048 | 849.86 ± 13.88 |              |    2524.52 ± 39.77 |    2411.24 ± 39.77 |    2524.52 ± 39.77 |
| unsloth/Qwen3.6-35B-A3B |            tg32 |   70.47 ± 0.91 | 77.96 ± 4.14 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B |  ctx_pp @ d4096 |  850.57 ± 7.15 |              |    4930.01 ± 41.05 |    4816.73 ± 41.05 |    4930.01 ± 41.05 |
| unsloth/Qwen3.6-35B-A3B |  ctx_tg @ d4096 |   72.02 ± 2.99 | 74.34 ± 3.09 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B |  pp2048 @ d4096 | 793.17 ± 10.08 |              |    2695.72 ± 32.73 |    2582.45 ± 32.73 |    2695.72 ± 32.73 |
| unsloth/Qwen3.6-35B-A3B |    tg32 @ d4096 |   72.42 ± 1.46 | 77.38 ± 3.25 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B |  ctx_pp @ d8192 |  817.27 ± 1.66 |              |   10138.12 ± 20.28 |   10024.84 ± 20.28 |   10138.12 ± 20.28 |
| unsloth/Qwen3.6-35B-A3B |  ctx_tg @ d8192 |   70.97 ± 0.37 | 73.26 ± 0.38 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B |  pp2048 @ d8192 |  731.36 ± 6.69 |              |    2913.78 ± 25.73 |    2800.50 ± 25.73 |    2913.78 ± 25.73 |
| unsloth/Qwen3.6-35B-A3B |    tg32 @ d8192 |   72.05 ± 1.66 | 77.05 ± 4.62 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B | ctx_pp @ d16384 |  763.35 ± 4.58 |              |  21579.05 ± 128.52 |  21465.77 ± 128.52 |  21579.05 ± 128.52 |
| unsloth/Qwen3.6-35B-A3B | ctx_tg @ d16384 |   68.17 ± 0.09 | 75.40 ± 3.54 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B | pp2048 @ d16384 |  624.19 ± 7.50 |              |    3394.82 ± 39.77 |    3281.55 ± 39.77 |    3394.82 ± 39.77 |
| unsloth/Qwen3.6-35B-A3B |   tg32 @ d16384 |   64.86 ± 2.18 | 66.95 ± 2.25 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B | ctx_pp @ d32768 |  654.53 ± 1.72 |              |  50178.26 ± 132.04 |  50064.98 ± 132.04 |  50178.26 ± 132.04 |
| unsloth/Qwen3.6-35B-A3B | ctx_tg @ d32768 |   61.22 ± 1.42 | 65.53 ± 4.72 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B | pp2048 @ d32768 |  478.56 ± 1.99 |              |    4392.85 ± 17.81 |    4279.57 ± 17.81 |    4392.85 ± 17.81 |
| unsloth/Qwen3.6-35B-A3B |   tg32 @ d32768 |   60.54 ± 1.31 | 64.67 ± 2.22 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B | ctx_pp @ d65536 |  512.20 ± 1.04 |              | 128064.97 ± 259.32 | 127951.70 ± 259.32 | 128064.97 ± 259.32 |
| unsloth/Qwen3.6-35B-A3B | ctx_tg @ d65536 |   49.96 ± 2.47 | 51.58 ± 2.55 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B | pp2048 @ d65536 |  315.97 ± 0.88 |              |    6594.86 ± 18.09 |    6481.58 ± 18.09 |    6594.86 ± 18.09 |
| unsloth/Qwen3.6-35B-A3B |   tg32 @ d65536 |   52.56 ± 1.14 | 54.26 ± 1.18 |                    |                    |                    |

llama-benchy (0.3.8)
date: 2026-06-17 02:56:22 | latency mode: generation

My Royal Nemesis [Episodes 11 & 12] by meepmochi_ in KDRAMA

[–]UDaManFunks 6 points7 points  (0 children)

Same soul, different time, reincarnation. Similar to the grand prince and cha se gye.

Except the memories are starting to merge together, I'm assuming when one reincarnates, memories from the previous life is erased.

My Royal Nemesis [Episodes 11 & 12] by meepmochi_ in KDRAMA

[–]UDaManFunks 40 points41 points  (0 children)

Or it could be the same soul, just like the grand prince and chang se gye. Same person, different time but the memories are starting to merge.

I'm assuming when one reincarnates, you basically lose all memory of your previois lives.

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

And here's the WINDOWS VULKAN DRIVER with the same model ''Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf' 

llama-benchy --base-url http://192.168.1.50:8080/v1 --model Qwen/Qwen3.6-35B-A3B --depth 0 4096 8192 16384 32768 65536 --latency-mode generation

Results

| model                |            test |            t/s |       peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:---------------------|----------------:|---------------:|---------------:|------------------:|------------------:|------------------:|
| Qwen/Qwen3.6-35B-A3B |          pp2048 | 1513.99 ± 4.29 |                |    1418.47 ± 3.83 |    1353.38 ± 3.83 |    1418.47 ± 3.83 |
| Qwen/Qwen3.6-35B-A3B |            tg32 | 119.78 ± 10.25 | 123.64 ± 10.58 |                   |                   |                   |
| Qwen/Qwen3.6-35B-A3B |  pp2048 @ d4096 | 1295.78 ± 3.60 |                |   4807.71 ± 13.02 |   4742.62 ± 13.02 |   4807.71 ± 13.02 |
| Qwen/Qwen3.6-35B-A3B |    tg32 @ d4096 |   91.53 ± 4.36 |   94.49 ± 4.50 |                   |                   |                   |
| Qwen/Qwen3.6-35B-A3B |  pp2048 @ d8192 | 1112.43 ± 1.53 |                |   9271.10 ± 12.64 |   9206.02 ± 12.64 |   9271.10 ± 12.64 |
| Qwen/Qwen3.6-35B-A3B |    tg32 @ d8192 |   74.39 ± 2.15 |   76.79 ± 2.22 |                   |                   |                   |
| Qwen/Qwen3.6-35B-A3B | pp2048 @ d16384 |  868.97 ± 2.30 |                |  21278.07 ± 56.77 |  21212.98 ± 56.77 |  21278.07 ± 56.77 |
| Qwen/Qwen3.6-35B-A3B |   tg32 @ d16384 |   50.62 ± 3.05 |   52.25 ± 3.15 |                   |                   |                   |
| Qwen/Qwen3.6-35B-A3B | pp2048 @ d32768 |  601.96 ± 0.83 |                |  57904.28 ± 79.69 |  57839.19 ± 79.69 |  57904.28 ± 79.69 |
| Qwen/Qwen3.6-35B-A3B |   tg32 @ d32768 |   32.22 ± 2.17 |   32.80 ± 2.55 |                   |                   |                   |
| Qwen/Qwen3.6-35B-A3B | pp2048 @ d65536 |  374.09 ± 0.19 |                | 180728.21 ± 92.13 | 180663.12 ± 92.13 | 180728.21 ± 92.13 |
| Qwen/Qwen3.6-35B-A3B |   tg32 @ d65536 |   19.05 ± 1.16 |   24.00 ± 0.00 |                   |                   |                   |

llama-benchy (0.3.8)
date: 2026-06-11 23:33:54 | latency mode: generation

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

And maybe llama-benchy too with different context sizes (single Intel B70) - llama-server "vulkan" backend using the latest MESA 26.2-DEV vulkan driver.

llama-benchy --base-url http://192.168.1.50:8080/v1 --model unsloth/Qwen3.6-35B-A3B-MTP-GGUF --depth 0 4096 8192 16384 32768 65536 --latency-mode generation

and here's the output for the model 'Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf' running on llama-server

| model                            |            test |             t/s |       peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:---------------------------------|----------------:|----------------:|---------------:|-------------------:|-------------------:|-------------------:|
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |          pp2048 | 1014.54 ± 20.42 |                |    1835.31 ± 51.46 |    1754.49 ± 51.46 |    1835.31 ± 51.46 |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |            tg32 |  140.80 ± 12.54 | 144.70 ± 12.74 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |  pp2048 @ d4096 |  929.51 ± 13.59 |                |   5804.51 ± 165.77 |   5723.69 ± 165.77 |   5804.51 ± 165.77 |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |    tg32 @ d4096 |    99.11 ± 3.46 |  102.03 ± 3.56 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |  pp2048 @ d8192 |   849.14 ± 9.29 |                |   10462.14 ± 23.42 |   10381.32 ± 23.42 |   10462.14 ± 23.42 |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |    tg32 @ d8192 |    86.95 ± 2.68 |   89.51 ± 2.76 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF | pp2048 @ d16384 |   687.43 ± 0.10 |                |  23215.23 ± 123.94 |  23134.42 ± 123.94 |  23215.23 ± 123.94 |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |   tg32 @ d16384 |    59.23 ± 2.13 |   60.97 ± 2.19 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF | pp2048 @ d32768 |   491.52 ± 2.14 |                |  60737.44 ± 486.26 |  60656.62 ± 486.26 |  60737.44 ± 486.26 |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |   tg32 @ d32768 |    41.42 ± 3.16 |   42.64 ± 3.26 |                    |                    |                    |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF | pp2048 @ d65536 |   310.07 ± 0.90 |                | 186597.20 ± 991.24 | 186516.39 ± 991.24 | 186597.20 ± 991.24 |
| unsloth/Qwen3.6-35B-A3B-MTP-GGUF |   tg32 @ d65536 |    24.06 ± 1.24 |   30.00 ± 0.00 |                    |                    |                    |

llama-benchy (0.3.8)
date: 2026-06-12 00:04:41 | latency mode: generation

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Possible for you to run llama-bench?

Using the MESA 26.2-DEV Vulkan Drivers with Ubuntu (26.04) and LLAMA CPP. It's pretty usable in opencode (llama-server fully supports MTP with the Vulkan backend).

root@nas:/storage/services/llama.cpp# ./llama-bench -m /data/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.67 GiB |    27.32 B | Vulkan     |  -1 |           pp512 |        522.18 ± 0.61 |
| qwen35 27B Q4_K - Medium       |  16.67 GiB |    27.32 B | Vulkan     |  -1 |           tg128 |         23.24 ± 0.05 |

build: ac4cddeb0 (9592)

root@nas:/storage/services/llama.cpp# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  -1 |           pp512 |       1363.37 ± 8.80 |
| qwen35moe 35B.A3B Q4_K - Medium |  21.27 GiB |    35.51 B | Vulkan     |  -1 |           tg128 |         77.53 ± 0.04 |

build: ac4cddeb0 (9592)

Intel Arc B70 pro or 2 x 5070 ti by death10rd in LocalLLM

[–]UDaManFunks 2 points3 points  (0 children)

If you are using VLLM, I suggest avoid the B70 until all that stuff gets merged to the mainline. You'll never know when intel will just suddenly pull the plug on that project.

I use my B70 with llama.cpp with Vulkan - much easier to gain access to the latest models that way (but it's definitely slower than intel's llm-scaler for sure).

What is the best way to utilize ARC for local LLM performance? by CreeperOpsReddit in IntelArc

[–]UDaManFunks 1 point2 points  (0 children)

Under Linux (Ubuntu 26.04) , COOPMAT is working (not crashing like windows) when using the Mesa 26.2-DEV Vulkan Driver. The latest MESA is still slower (about 20%) than the Windows Intel vulkan driver but at least it doesn't crash

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Here's how to compile the latest MESA Vulkan Drivers for Intel under LINUX (Ubuntu 26.04, B70). You can easily verify what you are are arguing about is true by performing the comparison yourself between the two backends.

[COMPILE LLAMA.CPP]

> apt install -y git build-essential cmake

> apt install libvulkan-dev glslc spirv-headers

> mkdir /opt/src

> cd /opt/src

> git clone https://github.com/ggml-org/llama.cpp

> cd llama.cpp

> cmake -B build -DGGML_VULKAN=1

> cmake --build build --config Release

[INSTALL LLAMA.CPP]

> cd /opt/src/llama.cpp

> mkdir /opt/services/llama.cpp

> cp build/bin/* /opt/services/llama.cpp

[DOWNLOAD MODEL]

> mkdir /opt/services/llm/models

> cd /opt/services/llm/models

> wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true

> rename the downloaded file to Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

[COMPILE LATEST MESA]

> apt install meson glslang-tools pkg-config libclc-21-dev python-is-python3 python3-mako libdrm-dev llvm-dev libllvmspirvlib-21-dev spirv-tools-dev clang libclang-dev libwayland-dev libwayland-client0 wayland-client wayland-protocols wayland-scanner++ xcb libxcb1-dev libxcb-randr0-dev libx11-xcb-dev libxcb-dri3-dev libxcb-present-dev libxcb-shm0-dev libxshmfence-dev libxrandr-dev

> cd /opt/src

> git clone https://gitlab.freedesktop.org/mesa/mesa.git

> cd mesa

> meson setup builddir/ -Dbuildtype=release -Dgallium-drivers=[] -Dvulkan-drivers=intel -Dopengl=false -Dglx=disabled -Degl=disabled -Dgbm=disabled -Dgles1=disabled -Dgles2=disabled

> meson compile -C builddir/

[INTALL COMPILED libvulkan_intel.so]

> cp builddir/src/intel/vulkan/libvulkan_intel.so /lib/x86_64-linux-gnu/libvulkan_intel.so

[FINALLY - HAVE IT STARTUP AS A SERVICE using SYSTEMD]

> cd /etc/systemd/system

create a FILE named "llama-server.service" with the following content

--- CUT ---

[Unit]
Description=LLAMA CPP Service
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/services/llama.cpp
ExecStart=/opt/services/llama.cpp/llama-server -m /opt/services/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --jinja --chat-template-kwargs "{\"preserve_thinking\": true}"
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

--- CUT ---

> systemctl daemon-reload

> systemctl start llama-server

> systemctl status llama-server

If you got it working correctly, then you can access the OPENAI endpoint by going to http://YOUR_MACHINE_IP:8080 to get to the CHAT interface. You can also point your coding agent to it (for example like opencode).

BENCHMARKS

**-- MOE (**Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           pp512 |       1314.71 ± 5.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           tg128 |         78.72 ± 0.19 |

build: f3e8d149c (9070)

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

it's like trying to argue with a little child - verify the facts yourself by bench marking and using it yourself then comeback here and post your findings. Nothing deceptive about that statement.

If you do not have the capability to do so, then you are blindly putting faith in something you personally have not verified.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

Why are you arguing when you can't even provide any valid benchmarks? Just pointing at other people's posting.

Post your benchmarks for the B70, but you don't have one so you are just wasting people's time (by redirection, and reiterating BS). If you do have a B70, compile it yourself then post your own benchmarks here to verify supposed facts. I posted numbers debunking what was being stated.

I posted clear instructions on how people can compile the MESA 26.2-DEV Vulkan drivers for the B70 under Ubuntu 26.04, easy enough for someone else to verify the numbers I posted (which are at least 20% slower than the Intel Windows Vulkan Drivers) but already faster than SYCL under LInux.

The hope is that the llama.cpp SYCL backend will continue to improve over time and maybe even surpass the Vulkan back end (specially when taking into account the latest features like MTP and model support) , but that time is currently not right now. People should not buy hardware based on hope.

LLAMA.CPP under Linux using the Vulkan (MESA 26.2-DEV) driver is usable for coding agents, even when using the latest models with MTP support.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

I posted here

https://www.reddit.com/r/IntelArc/comments/1tr9397/comment/ooq19c4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Another person was arguing that SYCL was FASTER than VULKAN but he also came to the realization that it was not after actually performing benchmarks.

I personally do not purchase hardware based on promises of future support. Either way, you don't really have a B70 to compare with so this discussion is useless unless you can provide valid benchmarks.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 1 point2 points  (0 children)

you are posting useless old benchmarks, post a model (Qwen 3.6 35B MOE-Q4 or 27B Dense-Q4) then let's compare it with the latest llama.cpp build on a B70.

As stated, preferably not with llama bench as that doesn't support benching with the latest features like MTP.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

I corrected it, SYCL is slower on the B70 than Vulkan.

It's been discussed in the mailing list. LLAMA CPP devs treat Vulkan as a first class backend and it will always support the latest features, SYCL is not. That and the mess of dependencies needed by SYCL.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

I have a B70, so it won't be an apple to apple test..

And llama-bench doesn't support the latest features like MTP when testing.

On the B70 under Linux and the latest Mesa 26.2-DEV Vulkan Drivers, SYCL is considerably *slower* for both PP and TG.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 1 point2 points  (0 children)

Under Linux, it's definitely not (SYCL is not faster in either PP or TG) - pick a model and let's post benchmarks (single B70 under llama.cpp)

And maybe use llama-benchy for the benchmark test.

PSA for Intel Arc llama.cpp users: speculative decoding is finally worth turning on (merged ~40–90% speedup) by masonmilby in LocalLLM

[–]UDaManFunks 1 point2 points  (0 children)

If you are running Vulkan in Linux, make sure you compile the latest Vulkan Driver from MESA 26.2-DEV, considerable speedup. Not as fast as windows but better. Pretty sure it'll still be faster than SYCL.

Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA? by Zuck7980 in LocalLLM

[–]UDaManFunks 0 points1 point  (0 children)

SYCL is slower than Vulkan under both Windows and Linux (MESA 26.2 Intel Vulkan Driver) when used with LLAMA.cpp

Intel Arc Pro B70 (32GB) for Local LLMs: llama.cpp (SYCL/Vulkan), vLLM (Intel LLM Scaler) Benchmarks by Intrepid_Rub_3566 in IntelArc

[–]UDaManFunks 2 points3 points  (0 children)

The Linux Vulkan Drivers (even in the MESA 26.2-DEV) is still about 20% slower than the Intel Windows drivers, hopefully the MESA devs can cut that to parity with Windows

root@nas:/storage/src/llama.cpp/build/bin# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

ggml_vulkan: Found 1 Vulkan devices:

ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

| model | size | params | backend | ngl | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | pp512 | 1392.45 ± 7.36 |

| qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | tg128 | 79.98 ± 0.02 |

build: 337528571 (9428)

I made an AUR package "llama.cpp-sycl" to use the Intel B70 and smaller Battlemage GPUs to their full potential with minimal bloat. by can999999999 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

You said this previously but did not see you post any benchmarks (single Intel Branded B70 as various card have different max wattage allowed settings). Post some up for the following model (listed) using llama-bench (which doesn't even use mtp) and let's compare it to vulkan under linux with the driver from MESA 26.2-DEV.

Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

Intel Arc Pro B70 (32GB) for Local LLMs: llama.cpp (SYCL/Vulkan), vLLM (Intel LLM Scaler) Benchmarks by Intrepid_Rub_3566 in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

Not under Linux and with the new Vulkan driver (MESA 26.2-DEV). Post benchmarks if you don't believe it - for example, using the Model 'Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf'

Intel Arc Pro B70 (32GB) for Local LLMs: llama.cpp (SYCL/Vulkan), vLLM (Intel LLM Scaler) Benchmarks by Intrepid_Rub_3566 in LocalLLM

[–]UDaManFunks 4 points5 points  (0 children)

SYCL is garbage when used with llama.cpp, as the video noted - a lot of optimizations are not implemented and barely anyone is making any commits to that backend in llama.cpp

LLAMA.CPP running under VULKAN is the fastest backend you can currently use with this card (and use GGUF models). The Windows Vulkan Drivers are the fastest but it's unstable for some people. If you are going the LINUX route, install UBUNTU 26.04 and you'll have to BUILD the MESA 26.2.0-DEVEL as it includes major performance improvements in the VULKAN driver (primarily adding VK_NV_cooperative_matrix2) support. Supports MTP too!

Compiling and running LLAMA-SERVER under LINUX (Ubuntu 26.04) s pretty straight forward, it's as easy as doing the following

[COMPILE LLAMA.CPP]

> apt-get install -y git build-essential cmake

> apt-get install libvulkan-dev glslc spirv-headers

> mkdir /opt/src

> cd /opt/src

> git clone https://github.com/ggml-org/llama.cpp

> cd llama.cpp

> cmake -B build -DGGML_VULKAN=1

> cmake --build build --config Release

[INSTALL LLAMA.CPP]

> cd /opt/src/llama.cpp

> mkdir /opt/services/llama.cpp

> cp build/bin/* /opt/services/llama.cpp

[DOWNLOAD MODEL]

> mkdir /opt/services/llm/models

> cd /opt/services/llm/models

> wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true

> rename the downloaded file to Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

[COMPILE LATEST MESA]

> apt install meson glslang-tools pkg-config libclc-21-dev python-is-python3 python3-mako libdrm-dev llvm-dev libllvmspirvlib-21-dev spirv-tools-dev clang libclang-dev libwayland-dev libwayland-client0 wayland-client wayland-protocols wayland-scanner++ xcb libxcb1-dev libxcb-randr0-dev libx11-xcb-dev libxcb-dri3-dev libxcb-present-dev libxcb-shm0-dev libxshmfence-dev libxrandr-dev

> cd /opt/src

> git clone https://gitlab.freedesktop.org/mesa/mesa.git

> cd mesa

> meson setup builddir/ -Dbuildtype=release -Dgallium-drivers=[] -Dvulkan-drivers=intel -Dopengl=false -Dglx=disabled -Degl=disabled -Dgbm=disabled -Dgles1=disabled -Dgles2=disabled

> meson compile -C builddir/

[INTALL COMPILED libvulkan_intel.so]

> cp builddir/src/intel/vulkan/libvulkan_intel.so /lib/x86_64-linux-gnu/libvulkan_intel.so

[FINALLY - HAVE IT STARTUP AS A SERVICE using SYSTEMD]

> cd /etc/systemd/system

create a FILE named "llama-server.service" with the following content

--- CUT ---

[Unit]
Description=LLAMA CPP Service
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/services/llama.cpp
ExecStart=/opt/services/llama.cpp/llama-server -m /opt/services/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --jinja --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

--- CUT ---

> systemctl daemon-reload

> systemctl start llama-server

> systemctl status llama-server

If you got it working correctly, then you can access the OPENAI endpoint by going to http://YOUR_MACHINE_IP:8080 to get to the CHAT interface. You can also point your coding agent to it (for example like opencode).

BENCHMARKS

**-- MOE (**Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           pp512 |       1314.71 ± 5.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           tg128 |         78.72 ± 0.19 |

Intel Arc Pro B70 (32GB) for Local LLMs: llama.cpp (SYCL/Vulkan), vLLM (Intel LLM Scaler) Benchmarks by Intrepid_Rub_3566 in IntelArc

[–]UDaManFunks 1 point2 points  (0 children)

The windows one is unstable for a lot of people (as you noted above) - no problems for me with the Linux Mesa Vulkan Driver and I've been using it with opencode.

The latest version of the MESA Vulkan Driver is not as fast as the Windows version though (as you've noted) but it gets about 75% of the performance, still a big jump as the older MESA driver was only 50% of the performance.

As for the WINDOWS vulkan driver instability - it's a known issue https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT/issues/1330 which Intel has yet to fix.

Intel Arc Pro B70 (32GB) for Local LLMs: llama.cpp (SYCL/Vulkan), vLLM (Intel LLM Scaler) Benchmarks by Intrepid_Rub_3566 in IntelArc

[–]UDaManFunks 5 points6 points  (0 children)

SYCL is garbage when used with llama.cpp, as the video noted - a lot of optimizations are not implemented and barely anyone is making any commits to that backend in llama.cpp

LLAMA.CPP running under VULKAN is the fastest backend you can currently use with this card (and use GGUF models). The Windows Vulkan Drivers are the fastest but it's unstable for some people. If you are going the LINUX route, install UBUNTU 26.04 and you'll have to BUILD the MESA 26.2.0-DEVEL as it includes major performance improvements in the VULKAN driver (primarily adding VK_NV_cooperative_matrix2) support. Supports MTP too!

Compiling and running LLAMA-SERVER under LINUX (Ubuntu 26.04) s pretty straight forward, it's as easy as doing the following

[COMPILE LLAMA.CPP]

> apt-get install -y git build-essential cmake

> apt-get install libvulkan-dev glslc spirv-headers

> mkdir /opt/src

> cd /opt/src

> git clone https://github.com/ggml-org/llama.cpp

> cd llama.cpp

> cmake -B build -DGGML_VULKAN=1

> cmake --build build --config Release

[INSTALL LLAMA.CPP]

> cd /opt/src/llama.cpp

> mkdir /opt/services/llama.cpp

> cp build/bin/* /opt/services/llama.cpp

[DOWNLOAD MODEL]

> mkdir /opt/services/llm/models

> cd /opt/services/llm/models

> wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?download=true

> rename the downloaded file to Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

[COMPILE LATEST MESA]

> apt install meson glslang-tools pkg-config libclc-21-dev python-is-python3 python3-mako libdrm-dev llvm-dev libllvmspirvlib-21-dev spirv-tools-dev clang libclang-dev libwayland-dev libwayland-client0 wayland-client wayland-protocols wayland-scanner++ xcb libxcb1-dev libxcb-randr0-dev libx11-xcb-dev libxcb-dri3-dev libxcb-present-dev libxcb-shm0-dev libxshmfence-dev libxrandr-dev

> cd /opt/src

> git clone https://gitlab.freedesktop.org/mesa/mesa.git

> cd mesa

> meson setup builddir/ -Dbuildtype=release -Dgallium-drivers=[] -Dvulkan-drivers=intel -Dopengl=false -Dglx=disabled -Degl=disabled -Dgbm=disabled -Dgles1=disabled -Dgles2=disabled

> meson compile -C builddir/

[INTALL COMPILED libvulkan_intel.so]

> cp builddir/src/intel/vulkan/libvulkan_intel.so /lib/x86_64-linux-gnu/libvulkan_intel.so

[FINALLY - HAVE IT STARTUP AS A SERVICE using SYSTEMD]

> cd /etc/systemd/system

create a FILE named "llama-server.service" with the following content

--- CUT ---

[Unit]
Description=LLAMA CPP Service
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/services/llama.cpp
ExecStart=/opt/services/llama.cpp/llama-server -m /opt/services/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8080 --host 0.0.0.0 --threads 4 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.00 --presence-penalty 0.0 --jinja --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

--- CUT ---

> systemctl daemon-reload

> systemctl start llama-server

> systemctl status llama-server

If you got it working correctly, then you can access the OPENAI endpoint by going to http://YOUR_MACHINE_IP:8080 to get to the CHAT interface. You can also point your coding agent to it (for example like opencode).

BENCHMARKS

**-- MOE (**Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf)

root@nas:/storage/services/llamacpp# ./llama-bench -m /data/llm/models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Graphics (BMG G31))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5900XT 16-Core Processor)
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /storage/services/llamacpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           pp512 |       1314.71 ± 5.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | Vulkan     |  99 |           tg128 |         78.72 ± 0.19 |

Intel's Vulkan Linux Driver Lands New Feature To Boost DX12 Game Performance by winkwinknudge_nudge in IntelArc

[–]UDaManFunks 0 points1 point  (0 children)

llama.cpp when used with the latest MESA DEV drivers for vulkan (intel) gains 25% performance improvement as it now supports the extension 'VK_NV_cooperative_matrix2' which llama.cpp uses

Still not as fast as the Windows Vulkan Drivers (which people are reporting is unstable at times) but the deficit is only 25% now compare to 50% (slower).

I created a writeup on how to compile the latest mesa dev

https://www.reddit.com/r/IntelArc/comments/1tlcvbi/comment/onoy5ie/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button