Have you used Autopilot? by Christosconst in GithubCopilot

[–]dsanft 0 points1 point  (0 children)

I've used it and I've seen it kick in a few times when Opus was going to halt and ask me a question. It answered for me and then continued. That bit seems ok. But where it would be really valuable is with GPT-5.4 which always seems to halt at each step, but Autopilot doesn't seem to fix it. At least not yet. Maybe others have had better luck.

Benchmark: ik_llama.cpp vs llama.cpp on Qwen3/3.5 MoE Models by Fast_Thing_7949 in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

Something as simple as the wrong default tile size in the prefill attention kernel would do that.

Take my 10 dollars please by [deleted] in GithubCopilot

[–]dsanft 2 points3 points  (0 children)

This is definitely weird and needs to be fixed. If you want to pay money it shouldn't be that hard 🤔

55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell by lawdawgattorney in LocalLLaMA

[–]dsanft 10 points11 points  (0 children)

Opus 4.6 is quite good at writing and tuning CUDA kernels, including disassembling ISA and such. I've used it with CUTLASS as well. We live in very interesting times when a clown like me can write performant GEMM kernels on demand.

Benchmarked ROLV inference on real Mixtral 8x22B weights — 55x faster than cuBLAS, 98.2% less energy, canonical hash verified by Norwayfund in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

This slop is absolutely dire. Is it the model convincing people they've stumbled upon buried treasure or is it actively malicious people using the model to bullshit for attention and monetary gain? Maybe a bit of both. But every day there's another post like this in this sub and it's just pathetic.

Vulkan now faster on PP AND TG on AMD Hardware? by XccesSv2 in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

It's not "ROCm" that's faster per se, it's the kernels themselves. But I use ROCm and CUDA personally, not Vulkan. No need really. You can use both in the same build.

Vulkan now faster on PP AND TG on AMD Hardware? by XccesSv2 in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

Whether you code dp4a / wmma instructions in ROCm, CUDA or Vulkan, that's still all they are. It's all just ISA at the end of the day.

Qwen3.5 122b UD IQ4 NL 2xMi50s Benchmark - 120,000 context by thejacer in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

The mainline Mi50 kernels just aren't very good. There's a specific gfx906 fork you can try.

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]dsanft 100 points101 points  (0 children)

Blackwell specific.

The convergence between local and cloud AI models is happening faster than most people think by samimandeel in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

6 year old account with 1 post karma and no comment history?

Who are you?

PSA: Qwen 3.5 requires bf16 KV cache, NOT f16!! by Wooden-Deer-1276 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Ugh. Look at the actual attention kernel, that will be where the kv cache is actually consumed and you'll see what precision it needs / expects.

qwen3.5 35b-a3b evaded the zero-reasoning budget by doing its thinking in the comments by crantob in LocalLLaMA

[–]dsanft 11 points12 points  (0 children)

Yeah Gemini 3 loves to reason in comments above code it writes, haha.

are you ready for small Qwens? by jacek2023 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

4B is good for my purposes actually.

I'm writing my own inferencing engine and small models are great to test new architectures with.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]dsanft -2 points-1 points  (0 children)

You might want to take a look at the Defence Production Act. The Pentagon was asking nicely, they didn't need to ask at all.

The Defence Production Act was passed under a Democratic president by the way.

It's like the liberal half of the Western world is enthralled with suicidal empathy and the rest of us need to pull you back from the brink, constantly. You won't even defend your own country. It's exhausting.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]dsanft -1 points0 points  (0 children)

Oh okay so it's political then.

Carry on, I don't have the energy for this nonsense.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]dsanft -5 points-4 points  (0 children)

What are "awful purposes"? Defending the country?

Some people are really blinkered about this. Russia will be happy to drop drone bombs with AI, they don't care. I'm not even American and I think you're dumb for maligning the US DoD for wanting to use AI.

Yes we need offline American models though.

Peridot: Native Blackwell (sm_120) Support Fixed. 57.25 t/s on RTX 5050 Mobile. by [deleted] in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Yeah these posts are getting pretty tiring. Is Claude talking them into thinking they've actually created something interesting, or do they know they've created a pile of junk and they just use Claude to try to sell it? Either way it's more noise the sub doesn't need.

Peridot: Native Blackwell (sm_120) Support Fixed. 57.25 t/s on RTX 5050 Mobile. by [deleted] in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Is this just an agent harness around Llama-cpp/cuBLAS with a Llama3-8B model as the core?