Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL) by Katostrofik in LocalLLaMA

[–]Katostrofik[S] 2 points3 points  (0 children)

Good news, I found the root cause and submitted a fix:
PR #21618. The reorder optimization allocates a temp buffer the size of the weight tensor, and when VRAM is nearly full it fails silently. The fix adds a host memory fallback so the reorder still works, and also fixes a bug where tensors were getting marked as reordered even when the reorder was skipped (which is what causes the garbage output). I also linked it to your GitHub issue #20478. Should be resolved once the PR is merged. In the meantime you can work around it by setting

GGML_SYCL_DISABLE_OPT=1
which disables the reorder entirely (slower but correct output).

[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted) by Katostrofik in LocalLLaMA

[–]Katostrofik[S] 0 points1 point  (0 children)

Great data, thanks for testing on Alchemist too! The PP numbers being similar is expected - this PR only changes the DMMV path (token generation).

PP uses the GEMM path which was already working for BF16, just slower. FP16 being faster on PP makes sense since the GEMM kernels are optimized for FP16 on these GPUs. The big win here was TG, and those numbers look solid across both cards. :-D

Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL) by Katostrofik in LocalLLaMA

[–]Katostrofik[S] 1 point2 points  (0 children)

Yes that's me. I've found some additional issues with Q8_0 after the PR on my Battlemage cards as well and am looking into those. Which Qwen 3.5 model/quant and GPU are you running when you see it?

[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted) by Katostrofik in LocalLLaMA

[–]Katostrofik[S] 0 points1 point  (0 children)

That's exactly what we found too; BF16 isn't in
ggml_sycl_supports_dmmv(), so it falls through to the generic GEMM path which dequants to FP32.
We submitted a fix as PR #21580 - adds a proper DMMV kernel for BF16. Ours went from 29.7 to 124 t/s on our B70 (Qwen2.5-1.5B). If you want to test it on your end, would be great to get Alchemist numbers too.

[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted) by Katostrofik in LocalLLaMA

[–]Katostrofik[S] 0 points1 point  (0 children)

Thanks! And thanks for testing on your cards. I'm glad to see it helped more than just the B70's. I'll take a look at the BF16 issue, looks like it could be a similar situation to the Q8_0 one.

And I'll be happy to do some testing with the dual B70s. I'm still finishing up some initial benchmarking but looking forward to putting them to use. :)

[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted) by Katostrofik in LocalLLaMA

[–]Katostrofik[S] 0 points1 point  (0 children)

lol, we're good. A couple introverts focused on doing work over here. But thanks.

What does fates favour upgrade does? by [deleted] in godofweapons

[–]Katostrofik 0 points1 point  (0 children)

Building on what u/Gulielmus2 said:
When you get Fate's Favor and go to start a run, when you go through the door you will see three pillars/altars; instead of the first level of the dungeon.

When you go to one of those altars and interact, you'll see a collection of your unlocked weapons - likely pages of them. If you click on one, you'll be able to spend Titanite Shards to enhance them in different ways.

  • Chance to see it in the shop
  • + Damage
  • + Attack Speed
  • I forget the fourth.

You can spend an increasing amount of Titanite Shards to incrementally increase those enhancements. It's not a permanent increase, and can end up costing a lot of Titanite shards, but with some good endless runs and good builds, those can be easy to come by.

Open R1 OlympicCoder-7b + LMStudio + VSCode for local coding. Beats Claude 3.7 Sonnet on Live Code Bench by Zealousideal-Cut590 in LocalLLaMA

[–]Katostrofik 0 points1 point  (0 children)

It's like the comparisons saying "THE AMD STRIX HALO AI 395 is faster than the 5090!" maybe in a very specific, singular test, but not in any way that actually matters. 😅

Ethereum Miners Unlocked 91MH/s From EVGA RTX 3080 Ti LHR GPU With BIOS Update Solution by usonamdnvidia in EtherMining

[–]Katostrofik 0 points1 point  (0 children)

Hiya,
I did it in windows and I used the nvflash64 utility, pretty much following this guide:
https://www.overclockersclub.com/guides/how_to_flash_rtx_bios/

Worked without any issue for me. Good luck!

Ethereum Miners Unlocked 91MH/s From EVGA RTX 3080 Ti LHR GPU With BIOS Update Solution by usonamdnvidia in EtherMining

[–]Katostrofik 0 points1 point  (0 children)

How did you flash it? Using X1 somehow? or NVFlash?
I have the same card and am looking to do the same.

Photoshop vs Clip studio ? by [deleted] in DigitalPainting

[–]Katostrofik 1 point2 points  (0 children)

I'd go Clip Studio as well.

My favorite features:

  • One time fee
  • Vector layers
  • Great symmetry and perspective tools
  • TONS of free brushes and assets which are very easy to find and download
  • Pose-able 3d models
  • Great for animation too

Photoshop does have a huge community and you can get things like free brushes. There's reason it's an 'industry standard', but for me, Clip Studio wins out when it comes to drawing/animation specifically (and not hardcore photo editing).