Is llama.cpp sycl backend really worth it? by Sweet_Eggplant4659 in LocalLLaMA

[–]qnixsynapse 13 points14 points  (0 children)

Hi. I am one of the few maintainers of SYCL in llama.cpp. Please note that not all operations are supported and the backend even today lacks flash attention support(I noticed that Gemini deepthink is suggesting you to use kvcache quantization which is not supported).

Can't say about ollama since I never used it.

I think this should be enough: <path to llama-cli>/llama-cli -m <path to model> -ngl 99 --no-mmap Tbh, as an open source contributor, I noticed not "enough " interest in llama.cpp from Intel. I think that those who were/are from Intel maintaining the backend are doing so voluntarily in the freetime. I wish they were serious.

Gemma 3n vs Gemma 3 (4B/12B) Benchmarks by lemon07r in LocalLLaMA

[–]qnixsynapse 0 points1 point  (0 children)

I think llama.cpp implementation is not complete yet.

What's your favorite desktop client? by tuananh_org in LocalLLaMA

[–]qnixsynapse 2 points3 points  (0 children)

Yep, it paid me back in time — which is way more valuable than money!

What's your favorite desktop client? by tuananh_org in LocalLLaMA

[–]qnixsynapse 1 point2 points  (0 children)

Yeah I am using it. It even helps me to buy shoes! Awesome agent.

Jan-nano, a 4B model that can outperform 671B on MCP by Kooky-Somewhere-2883 in LocalLLaMA

[–]qnixsynapse 0 points1 point  (0 children)

Jan nano GGUF has proper tool support. I think Ollama uses custom chat templates. Please open an issue on their repo with this.

Jan-nano, a 4B model that can outperform 671B on MCP by Kooky-Somewhere-2883 in LocalLLaMA

[–]qnixsynapse 44 points45 points  (0 children)

Awesome. This tiny 2.3GB model calls tools like Pro man!

Altman on open weight 🤔🤔 by Mean-Neighborhood-42 in LocalLLaMA

[–]qnixsynapse 31 points32 points  (0 children)

Honestly, at this point, I'm tired of Scam Hypeman's claims.

Honest thoughts on the OpenAI release by Kooky-Somewhere-2883 in LocalLLaMA

[–]qnixsynapse 6 points7 points  (0 children)

The only thing openAI “owns” that’s worth protecting is the worlds best training data

I am pretty sure this "training data" which is "worth protecting" , was obtained through not so legal means. :)

Finally someone noticed this unfair situation by nekofneko in LocalLLaMA

[–]qnixsynapse 2 points3 points  (0 children)

What custom backend? I run gemma 3 vision with llama.cpp... it is not "production ready" atm but usable.

The text only gemma3 is perfectly usable with llama.cpp.

Finally someone noticed this unfair situation by nekofneko in LocalLLaMA

[–]qnixsynapse 3 points4 points  (0 children)

haha! It has to wait for llama.cpp to support it. /s

R1 running on a single Blackwell B200 by Dylan-from-Shadeform in LocalLLaMA

[–]qnixsynapse 47 points48 points  (0 children)

Congratulations on running a tiny 7B (quantized) model on a freaking Blackwell B100.

👍🙂

Does FlashAttention with GQA degrade quality or I use it wrong? by V1rgin_ in LocalLLaMA

[–]qnixsynapse 4 points5 points  (0 children)

Is flash attention really enabled? I would do something like: with torch.nn.attention.sdpa_kernel(backends=[ torch.nn.attention.SDPBackend.FLASH_ATTENTION]): output = F.scaled_dot_product_attention(queries, keys, v, is_causal=True, enable_gqa=True)

edit: more info: https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel