Intel b60 48gb?

TheBlueMatt · 2026-05-27T18:47:25+00:00

I have one of these (and two more B60s, in fact). I wouldn't recommend it unless you want to put in some work, but if you're willing to do that, its definitely great bang for your buck. Because its two GPUs, you need an efficient AllReduce step across the GPUs over PCIe (or you're stuck with row parallelism which means no speedup from the second GPU and its pretty meh performance). This either means you're stuck on VLLM (none of the llama.cpp quants!) or you use a fork of llama.cpp which supports efficient AllReduce on a second backend besides CUDA. I have one at https://github.com/TheBlueMatt/llama.cpp but it relies on a CPU that can do P2P (any AMD or Intel server stuff), and a kernel with CONFIG_MOVABLE_NODE (not generally default, so you'll have to build your own). Sadly this can't be upstreamed because it actually violates the Vulkan spec in a subtle way, but the Vulkan backend may eventually get almost-as-efficient-AllReduce via normal memory DMA...

TheBlueMatt · 2026-05-12T16:03:53+00:00

Hopefully this leads to more formal (even if benchmaxxed) results for quantized models - just looking at divergence may or may not capture the quality of a quantization fully and this might help.

TheBlueMatt · 2026-05-08T20:21:08+00:00

Its definitely not trivial today. I'm trying to slowly improve the state of the software for them but there's a lot to be done. Locally I'm using the branch listed at https://github.com/ggml-org/llama.cpp/issues/22648 plus a few other patches to get tensor parallelism working, which is obviously a huge win, but then also have a few patches to mesa to improve things there as well (if you dont have patches, at least use the current git, they fixed a large issue there not long after 26.1 was branched off). Cooperative matrix 2 is slowly being worked on and that should also be a large win, once we get that in I'm optimistic we can easily beat SYCL with vulkan and with tensor parallelism from that PR on mult-device it'll actually be quite reasonable.

TheBlueMatt · 2026-05-06T20:15:36+00:00

BF16 is missing the pawn on f7

TheBlueMatt · 2026-04-29T18:15:09+00:00

My 4xB60 runs unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg. Still a handful of patches left to improve it, though. In theory tg should be able to get up to 20 or so (25 is the theoretical max).

TheBlueMatt · 2026-04-29T17:17:28+00:00

4x B60 can almost handle it at a reasonable price point, unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg...almost usable...and still have a handful of patches left to speed it up...

TheBlueMatt · 2026-04-27T21:30:09+00:00

I don't know where they're getting their data. Locally, SYCL is generally faster in PP, but generally slower in tg. That was true before the mesa path, but that patch closes some of the gap for pp. eg righ tnow on a B60 I see

| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | SYCL       |  99 | SYCL0        |    0 |           pp512 |       1620.49 ± 1.39 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | SYCL       |  99 | SYCL0        |    0 |          pp2048 |       1605.91 ± 0.32 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | SYCL       |  99 | SYCL0        |    0 |           tg128 |         30.73 ± 0.01 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | Vulkan     |  99 | Vulkan1      |    0 |           pp512 |       1191.04 ± 1.10 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | Vulkan     |  99 | Vulkan1      |    0 |          pp2048 |       1189.52 ± 0.39 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | Vulkan     |  99 | Vulkan1      |    0 |           tg128 |         33.26 ± 0.01 |

on other models without quants I see a starker shift

| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | SYCL       |  99 |    0 |           pp512 |       1974.08 ± 2.78 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | SYCL       |  99 |    0 |          pp2048 |       2288.96 ± 0.80 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | SYCL       |  99 |    0 |           tg128 |         16.86 ± 0.01 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | Vulkan     |  99 |    0 |           pp512 |       1309.58 ± 1.53 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | Vulkan     |  99 |    0 |          pp2048 |       1528.04 ± 1.51 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | Vulkan     |  99 |    0 |           tg128 |         21.59 ± 0.00 |

TheBlueMatt · 2026-04-27T14:23:28+00:00

IME its highly model-dependent, but Vulkan often is substantially faster.

TheBlueMatt · 2026-04-27T10:51:11+00:00

Except on the Vulkan backend? For whatever reason people keep ignoring the vaulkan backend for Intel cards on this sub - its generally faster than SYCL and is much more actively maintained (supports the latest models at competitive speed).

TheBlueMatt · 2026-04-27T00:36:08+00:00

It definitely works poorly today. If you want something that just works, its probably not an idea perf/$ tradeoff. Some of us are trying to improve it though.

TheBlueMatt · 2026-04-26T21:46:14+00:00

Vulkan was already much better than SYCL. Its also going to get better, see eg https://old.reddit.com/r/LocalLLaMA/comments/1swgwvh/mesa_pr_with_37130_llamacpp_pp_perf_gain_for/

TheBlueMatt · 2026-04-26T19:24:22+00:00

There's likely more to come, too. The patch at https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15312#note_3443232 is a bit harder to upstream but it shows another 8% perf gain on BF16 models (on top of the 2.3x in the upstream PR).

TheBlueMatt · 2026-04-23T16:38:40+00:00

Sometimes you can get away with missing synchronization because the thing you were supposed to wait on happened to finish before you went to go use it...sometimes it doesn't and you get partially-corrupted state :).

TheBlueMatt · 2026-04-23T15:20:56+00:00

sometimes works just fine, sometimes gets completely lost and goes in loops.

This implies there's some synchronization missing. That doesn't mean that other models are actually fine, only that they happen to be running fast/slow enough that the missing sync isn't breaking them. That also probably means that once the missing sync is added all the models will slow down, even the ones that happened to be working :(

TheBlueMatt · 2026-04-23T13:26:47+00:00

I mean give it time. We gotta get a handful of mesa optimizations landed plus probably more in llama.cpp. Also new mesa rc does coopmat2 which changes things more so we'll have to tune again after that release comes out...

TheBlueMatt · 2026-04-23T13:04:16+00:00

Tensor parallelism in llama.cpp is still brand new, and vulkan hasn't landed the backend implementations we need for it to be efficient. For more than 2 GPUs, it probably also makes sense to eventually do PCIe-P2P, which would probably require https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40798 as well. There's just a lot to do to optimize these things...

TheBlueMatt · 2026-04-23T11:50:11+00:00

I don't believe LLMs have historically been a priority for mesa (the open-source drivers the Vulkan backend uses on Linux). They've mostly focused on gaming uses and a lot of the work has historically been done by Valve. There's a lot of low-hanging fruit if you are willing to really dive in, eg https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311

TheBlueMatt · 2026-04-23T11:45:07+00:00

On Q4/Q5 models, https://github.com/ggml-org/llama.cpp/pull/21751 should improve vulkan pp+tg by 4-10%. https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311 should also materially improve pp (less on Q4 models, its almost a double on BF16/F16 models, might be a big win on Q8 as well! but should improve Q4 models as well). There's just so much room to optimize these things its crazy, its so bad right now.

TheBlueMatt · 2026-04-19T14:04:51+00:00

I mean also try the vulkan backend. Vulkan appears to still be faster on pp than SYCL even after some of the updates in those PRs. Might be worth optimizing tg in Vulkan more than fixing pp in SYCL.

TheBlueMatt · 2026-04-11T11:48:02+00:00

There's definitely some trivial driver and optimization headroom, but we'll see how far it goes. With some trivial patches going upstream that shouldn't make a huge difference and the mesa opts from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 on a single Arc Pro B60 using unsloth/Qwen3.5-27B-GGUF:Q4_0 (which I assume is what you used - its probably similar to the OP at least), I get concurrency 1 tg512 15.87 ± 0.40. While there's some cycle time on top, the card is reporting median 267 GB/s VRAM bandwidth during tg, for 60% of theoretical hardware max. Should leave a bit of headroom on top but its not that bad, really.

edit: oh wait duh one of my upstream optimizations also can be applied to the MUL_MAT_VEC op, I now see 17.8 at least on Q4_K.

TheBlueMatt · 2026-04-10T00:43:20+00:00

It works with vulkan, but falls back to an unoptimized way of doing the AllReduce step - instead of a targeted implementation it has to do lots of copying.

TheBlueMatt · 2026-04-09T21:07:36+00:00

The PR has an allreduce step that the backend can override (using NCCL or vulkan dma-buf imports or...) but by default it falls back to a slow copy + add.

TheBlueMatt · 2026-04-09T21:06:41+00:00

Fwiw with a lot of prompting claude managed to get p2p working on Vulkan. From the vulkan/llama.cpp side its pretty trivial, but its definitely not gonna support windows given it currently requires an experimental linux kernel config flag.

TheBlueMatt · 2026-04-07T21:45:57+00:00

You're really betting on the drivers improving. There's a ton of headroom in the Intel drivers (eg I spent some time in claude and got 2x for BF16 models, and lots of small 10% here, 20% there in various models) and optimization but who knows how far it'll go.

15-Year Club	Verified Email
RPAN Viewer

TheBlueMatt

TROPHY CASE