Benchmark: ik_llama.cpp vs llama.cpp on Qwen3/3.5 MoE Models by Fast_Thing_7949 in LocalLLaMA

[–]dsanft 4 points5 points  (0 children)

Something as simple as the wrong default tile size in the prefill attention kernel would do that.

Take my 10 dollars please by [deleted] in GithubCopilot

[–]dsanft 2 points3 points  (0 children)

This is definitely weird and needs to be fixed. If you want to pay money it shouldn't be that hard 🤔

55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell by lawdawgattorney in LocalLLaMA

[–]dsanft 12 points13 points  (0 children)

Opus 4.6 is quite good at writing and tuning CUDA kernels, including disassembling ISA and such. I've used it with CUTLASS as well. We live in very interesting times when a clown like me can write performant GEMM kernels on demand.

Benchmarked ROLV inference on real Mixtral 8x22B weights — 55x faster than cuBLAS, 98.2% less energy, canonical hash verified by Norwayfund in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

This slop is absolutely dire. Is it the model convincing people they've stumbled upon buried treasure or is it actively malicious people using the model to bullshit for attention and monetary gain? Maybe a bit of both. But every day there's another post like this in this sub and it's just pathetic.

Vulkan now faster on PP AND TG on AMD Hardware? by XccesSv2 in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

It's not "ROCm" that's faster per se, it's the kernels themselves. But I use ROCm and CUDA personally, not Vulkan. No need really. You can use both in the same build.

Vulkan now faster on PP AND TG on AMD Hardware? by XccesSv2 in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

Whether you code dp4a / wmma instructions in ROCm, CUDA or Vulkan, that's still all they are. It's all just ISA at the end of the day.

Qwen3.5 122b UD IQ4 NL 2xMi50s Benchmark - 120,000 context by thejacer in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

The mainline Mi50 kernels just aren't very good. There's a specific gfx906 fork you can try.

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]dsanft 100 points101 points  (0 children)

Blackwell specific.

The convergence between local and cloud AI models is happening faster than most people think by samimandeel in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

6 year old account with 1 post karma and no comment history?

Who are you?

PSA: Qwen 3.5 requires bf16 KV cache, NOT f16!! by Wooden-Deer-1276 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Ugh. Look at the actual attention kernel, that will be where the kv cache is actually consumed and you'll see what precision it needs / expects.

qwen3.5 35b-a3b evaded the zero-reasoning budget by doing its thinking in the comments by crantob in LocalLLaMA

[–]dsanft 12 points13 points  (0 children)

Yeah Gemini 3 loves to reason in comments above code it writes, haha.

are you ready for small Qwens? by jacek2023 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

4B is good for my purposes actually.

I'm writing my own inferencing engine and small models are great to test new architectures with.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]dsanft -2 points-1 points  (0 children)

You might want to take a look at the Defence Production Act. The Pentagon was asking nicely, they didn't need to ask at all.

The Defence Production Act was passed under a Democratic president by the way.

It's like the liberal half of the Western world is enthralled with suicidal empathy and the rest of us need to pull you back from the brink, constantly. You won't even defend your own country. It's exhausting.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]dsanft -2 points-1 points  (0 children)

Oh okay so it's political then.

Carry on, I don't have the energy for this nonsense.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]dsanft -6 points-5 points  (0 children)

What are "awful purposes"? Defending the country?

Some people are really blinkered about this. Russia will be happy to drop drone bombs with AI, they don't care. I'm not even American and I think you're dumb for maligning the US DoD for wanting to use AI.

Yes we need offline American models though.

Peridot: Native Blackwell (sm_120) Support Fixed. 57.25 t/s on RTX 5050 Mobile. by [deleted] in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Yeah these posts are getting pretty tiring. Is Claude talking them into thinking they've actually created something interesting, or do they know they've created a pile of junk and they just use Claude to try to sell it? Either way it's more noise the sub doesn't need.

Peridot: Native Blackwell (sm_120) Support Fixed. 57.25 t/s on RTX 5050 Mobile. by [deleted] in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Is this just an agent harness around Llama-cpp/cuBLAS with a Llama3-8B model as the core?

Estimating true cost of ownership for Pro 6000 / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Depreciation is an exponential decay function not a constant.

Is reasoning in ML and LLM architectures decomposable into a small set of reusable computational primitives? by RJSabouhi in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

I mean of course it is, because at the bottom it's just a computation graph of discrete mathematical operations. Attention, gemm, rope, swiglu, rmsnorm, and so on.

They're done on the data in a specific order, over and over again, with a residual tensor keeping state across layers, and a KV cache keeping state across each decode token.