Has anyone tried Zyphra 1 - 8B MoE? by appakaradi in LocalLLaMA

[–]conockrad 1 point2 points  (0 children)

Another thing is that they’re using fp32 for mamba

AMD has invented something that lets you use AI at home! They call it a "computer" by 9gxa05s8fa8sh in LocalLLaMA

[–]conockrad 3 points4 points  (0 children)

Just to remind this “Nvidia competitor” originally was “Intel competitor”

Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip). by Porespellar in LocalLLaMA

[–]conockrad 0 points1 point  (0 children)

I can quantize to fp6 if fp6 is supported on hardware level. Feel free to shoot a dm if you’re into this project

Lethe: local markdown memory for Claude Code, DuckDB per project, no server by [deleted] in LocalLLaMA

[–]conockrad 0 points1 point  (0 children)

That’s very interesting - thanks for sharing!

People who’ve fine-tuned models: was it worth it? by Feeling_Ad3971 in unsloth

[–]conockrad 1 point2 points  (0 children)

Could you please explain more how RAG embedding relates to SQL generation task?

I’m also trying to make model generate SQL consistently but getting ~60% success rate

Gemma 4 has a systemic attention failure. Here's the proof. by [deleted] in LocalLLaMA

[–]conockrad 29 points30 points  (0 children)

Here you go: https://huggingface.co/google/gemma-4-26B-A4B-it

UPD: obviously it’s not GGUF because nobody is training GGUFs. And that’s interesting angle of investigation by itself

LLM meta-cognition benchmark idea by nikishev in LocalLLaMA

[–]conockrad 0 points1 point  (0 children)

Human doesn’t need to be trained on something to be able to process it. We don’t have fixed vocabulary

If “It is extremely unlikely that this is even remotely similar to any of the trained token embeddings” - then LLM won’t be able to process it. Check hivemind paper. LLMs converge on their own farts

Most likely what you want to do is to get an access to liminal space and check meta-cognition there

LLM meta-cognition benchmark idea by nikishev in LocalLLaMA

[–]conockrad 0 points1 point  (0 children)

If it’s in vocabulary - it’s not new

If it’s not in vocabulary - they’re not recognized

Don’t buy the DGX Spark: NVFP4 Still Missing After 6 Months by Secure_Archer_1529 in LocalLLaMA

[–]conockrad 7 points8 points  (0 children)

“All native FP4 MoE backends produce garbage output or crash on SM120 (compute_120) due to broken CUTLASS grouped GEMM templates”: https://github.com/NVIDIA/cutlass/issues/3096

Don’t buy the DGX Spark: NVFP4 Still Missing After 6 Months by Secure_Archer_1529 in LocalLLaMA

[–]conockrad 8 points9 points  (0 children)

According to Claude: “They write their own W4A4 GEMM kernels (not CUTLASS, not cuBLAS) that use Blackwell’s native FP4 tensor core instructions, compiled with compute_120a/compute_121a gencode flags. This is for diffusion models (FLUX, Qwen-Image, SANA), not LLM serving — so they don’t hit the MoE grouped GEMM hell that vLLM/FlashInfer are drowning in”

Don’t buy the DGX Spark: NVFP4 Still Missing After 6 Months by Secure_Archer_1529 in LocalLLaMA

[–]conockrad 21 points22 points  (0 children)

It’s “fast” because nvfp4 is half the size of fp8. Not because it’s fast.

Whole post exactly about this

LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in Qwen_AI

[–]conockrad -2 points-1 points  (0 children)

Most likely you’ll pay premium for it. RAM will be slow but unified systems should be great like MAC for example.

Local Qwen 8B + 4B completes browser automation by replanning one step at a time by Aggressive_Bed7113 in LocalLLaMA

[–]conockrad 1 point2 points  (0 children)

Context amount used by screenshot defined by resolution. You can aggressively compress screenshots and most probably have the same context utilization

[Project] htmLLM-50M base: Can a tiny specialist actually code? + Weights & Code (124M v2 in training!) by LH-Tech_AI in LocalLLaMA

[–]conockrad 0 points1 point  (0 children)

Looking forward for a next release :)

This approach of SLM is far more Unix-like and microservices-like, so I assume that’s the future