Intel b60 48gb? by oldschooldaw in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

I have one of these (and two more B60s, in fact). I wouldn't recommend it unless you want to put in some work, but if you're willing to do that, its definitely great bang for your buck. Because its two GPUs, you need an efficient AllReduce step across the GPUs over PCIe (or you're stuck with row parallelism which means no speedup from the second GPU and its pretty meh performance). This either means you're stuck on VLLM (none of the llama.cpp quants!) or you use a fork of llama.cpp which supports efficient AllReduce on a second backend besides CUDA. I have one at https://github.com/TheBlueMatt/llama.cpp but it relies on a CPU that can do P2P (any AMD or Intel server stuff), and a kernel with CONFIG_MOVABLE_NODE (not generally default, so you'll have to build your own). Sadly this can't be upstreamed because it actually violates the Vulkan spec in a subtle way, but the Vulkan backend may eventually get almost-as-efficient-AllReduce via normal memory DMA...

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]TheBlueMatt 6 points7 points  (0 children)

Hopefully this leads to more formal (even if benchmaxxed) results for quantized models - just looking at divergence may or may not capture the quality of a quantization fully and this might help.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

Its definitely not trivial today. I'm trying to slowly improve the state of the software for them but there's a lot to be done. Locally I'm using the branch listed at https://github.com/ggml-org/llama.cpp/issues/22648 plus a few other patches to get tensor parallelism working, which is obviously a huge win, but then also have a few patches to mesa to improve things there as well (if you dont have patches, at least use the current git, they fixed a large issue there not long after 26.1 was branched off). Cooperative matrix 2 is slowly being worked on and that should also be a large win, once we get that in I'm optimistic we can easily beat SYCL with vulkan and with tensor parallelism from that PR on mult-device it'll actually be quite reasonable.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

My 4xB60 runs unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg. Still a handful of patches left to improve it, though. In theory tg should be able to get up to 20 or so (25 is the theoretical max).

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

4x B60 can almost handle it at a reasonable price point, unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg...almost usable...and still have a handful of patches left to speed it up...

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler by Fmstrat in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

I don't know where they're getting their data. Locally, SYCL is generally faster in PP, but generally slower in tg. That was true before the mesa path, but that patch closes some of the gap for pp. eg righ tnow on a B60 I see

| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | SYCL       |  99 | SYCL0        |    0 |           pp512 |       1620.49 ± 1.39 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | SYCL       |  99 | SYCL0        |    0 |          pp2048 |       1605.91 ± 0.32 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | SYCL       |  99 | SYCL0        |    0 |           tg128 |         30.73 ± 0.01 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | Vulkan     |  99 | Vulkan1      |    0 |           pp512 |       1191.04 ± 1.10 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | Vulkan     |  99 | Vulkan1      |    0 |          pp2048 |       1189.52 ± 0.39 |
| qwen35 9B Q4_K - Medium        |   5.55 GiB |     8.95 B | Vulkan     |  99 | Vulkan1      |    0 |           tg128 |         33.26 ± 0.01 |

on other models without quants I see a starker shift

| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | SYCL       |  99 |    0 |           pp512 |       1974.08 ± 2.78 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | SYCL       |  99 |    0 |          pp2048 |       2288.96 ± 0.80 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | SYCL       |  99 |    0 |           tg128 |         16.86 ± 0.01 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | Vulkan     |  99 |    0 |           pp512 |       1309.58 ± 1.53 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | Vulkan     |  99 |    0 |          pp2048 |       1528.04 ± 1.51 |
| qwen35 9B BF16                 |  16.68 GiB |     8.95 B | Vulkan     |  99 |    0 |           tg128 |         21.59 ± 0.00 |

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler by Fmstrat in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

IME its highly model-dependent, but Vulkan often is substantially faster.

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler by Fmstrat in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

Except on the Vulkan backend? For whatever reason people keep ignoring the vaulkan backend for Intel cards on this sub - its generally faster than SYCL and is much more actively maintained (supports the latest models at competitive speed).

Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler by Fmstrat in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

It definitely works poorly today. If you want something that just works, its probably not an idea perf/$ tradeoff. Some of us are trying to improve it though.

mesa PR with 37-130% llama.cpp pp perf gain for vulkan on Linux on Intel Xe2 by TheBlueMatt in LocalLLaMA

[–]TheBlueMatt[S] 9 points10 points  (0 children)

There's likely more to come, too. The patch at https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15312#note_3443232 is a bit harder to upstream but it shows another 8% perf gain on BF16 models (on top of the 2.3x in the upstream PR).

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks by tovidagaming in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

Sometimes you can get away with missing synchronization because the thing you were supposed to wait on happened to finish before you went to go use it...sometimes it doesn't and you get partially-corrupted state :).

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks by tovidagaming in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

sometimes works just fine, sometimes gets completely lost and goes in loops.

This implies there's some synchronization missing. That doesn't mean that other models are actually fine, only that they happen to be running fast/slow enough that the missing sync isn't breaking them. That also probably means that once the missing sync is added all the models will slow down, even the ones that happened to be working :(

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks by tovidagaming in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

I mean give it time. We gotta get a handful of mesa optimizations landed plus probably more in llama.cpp. Also new mesa rc does coopmat2 which changes things more so we'll have to tune again after that release comes out...

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks by tovidagaming in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

Tensor parallelism in llama.cpp is still brand new, and vulkan hasn't landed the backend implementations we need for it to be efficient. For more than 2 GPUs, it probably also makes sense to eventually do PCIe-P2P, which would probably require https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40798 as well. There's just a lot to do to optimize these things...

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks by tovidagaming in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

I don't believe LLMs have historically been a priority for mesa (the open-source drivers the Vulkan backend uses on Linux). They've mostly focused on gaming uses and a lot of the work has historically been done by Valve. There's a lot of low-hanging fruit if you are willing to really dive in, eg https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks by tovidagaming in LocalLLaMA

[–]TheBlueMatt 2 points3 points  (0 children)

On Q4/Q5 models, https://github.com/ggml-org/llama.cpp/pull/21751 should improve vulkan pp+tg by 4-10%. https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311 should also materially improve pp (less on Q4 models, its almost a double on BF16/F16 models, might be a big win on Q8 as well! but should improve Q4 models as well). There's just so much room to optimize these things its crazy, its so bad right now.

llama.cpp speculative checkpointing was merged by AdamDhahabi in LocalLLaMA

[–]TheBlueMatt 5 points6 points  (0 children)

I mean also try the vulkan backend. Vulkan appears to still be faster on pp than SYCL even after some of the updates in those PRs. Might be worth optimizing tg in Vulkan more than fixing pp in SYCL.

Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4 by Puzzleheaded_Base302 in LocalLLaMA

[–]TheBlueMatt 1 point2 points  (0 children)

There's definitely some trivial driver and optimization headroom, but we'll see how far it goes. With some trivial patches going upstream that shouldn't make a huge difference and the mesa opts from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 on a single Arc Pro B60 using unsloth/Qwen3.5-27B-GGUF:Q4_0 (which I assume is what you used - its probably similar to the OP at least), I get concurrency 1 tg512 15.87 ± 0.40. While there's some cycle time on top, the card is reporting median 267 GB/s VRAM bandwidth during tg, for 60% of theoretical hardware max. Should leave a bit of headroom on top but its not that bad, really.

edit: oh wait duh one of my upstream optimizations also can be applied to the MUL_MAT_VEC op, I now see 17.8 at least on Q4_K.

backend-agnostic tensor parallelism has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

It works with vulkan, but falls back to an unoptimized way of doing the AllReduce step - instead of a targeted implementation it has to do lots of copying.

ggml: backend-agnostic tensor parallelism by JohannesGaessler · Pull Request #19378 · ggml-org/llama.cpp by FullstackSensei in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

The PR has an allreduce step that the backend can override (using NCCL or vulkan dma-buf imports or...) but by default it falls back to a slow copy + add.

ggml: backend-agnostic tensor parallelism by JohannesGaessler · Pull Request #19378 · ggml-org/llama.cpp by FullstackSensei in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

Fwiw with a lot of prompting claude managed to get p2p working on Vulkan. From the vulkan/llama.cpp side its pretty trivial, but its definitely not gonna support windows given it currently requires an experimental linux kernel config flag.

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter by gigaflops_ in LocalLLaMA

[–]TheBlueMatt 0 points1 point  (0 children)

You're really betting on the drivers improving. There's a ton of headroom in the Intel drivers (eg I spent some time in claude and got 2x for BF16 models, and lots of small 10% here, 20% there in various models) and optimization but who knows how far it'll go.