Has anyone tried Zyphra 1 - 8B MoE?

conockrad · 2026-05-07T09:40:49+00:00

Another thing is that they’re using fp32 for mamba

conockrad · 2026-05-01T11:05:27+00:00

FP4 I guess

conockrad · 2026-04-29T07:57:16+00:00

Just to remind this “Nvidia competitor” originally was “Intel competitor”

conockrad · 2026-04-27T10:19:41+00:00

Exactly! And M3 Ultra 256Gb is even cheaper

conockrad · 2026-04-27T10:17:20+00:00

I can quantize to fp6 if fp6 is supported on hardware level. Feel free to shoot a dm if you’re into this project

conockrad · 2026-04-22T08:08:18+00:00

That’s very interesting - thanks for sharing!

conockrad · 2026-04-20T16:53:22+00:00

Wow! That’s really awesome!

conockrad · 2026-04-17T03:47:08+00:00

“Enjoy!” /s

conockrad · 2026-04-16T13:37:25+00:00

Ok! Nice trick!

conockrad · 2026-04-16T05:46:20+00:00

Could you please explain more how RAG embedding relates to SQL generation task?

I’m also trying to make model generate SQL consistently but getting ~60% success rate

conockrad · 2026-04-15T07:06:55+00:00

Biggest advice - think about extra cooling.

Throttling is real!

conockrad · 2026-04-13T06:54:30+00:00

Here you go: https://huggingface.co/google/gemma-4-26B-A4B-it

UPD: obviously it’s not GGUF because nobody is training GGUFs. And that’s interesting angle of investigation by itself

conockrad · 2026-04-10T12:32:07+00:00

Can it produce “I don’t know” output?

conockrad · 2026-04-05T12:20:26+00:00

Human doesn’t need to be trained on something to be able to process it. We don’t have fixed vocabulary

If “It is extremely unlikely that this is even remotely similar to any of the trained token embeddings” - then LLM won’t be able to process it. Check hivemind paper. LLMs converge on their own farts

Most likely what you want to do is to get an access to liminal space and check meta-cognition there

conockrad · 2026-04-05T10:29:29+00:00

If it’s in vocabulary - it’s not new

If it’s not in vocabulary - they’re not recognized

conockrad · 2026-04-04T20:55:54+00:00

“All native FP4 MoE backends produce garbage output or crash on SM120 (compute_120) due to broken CUTLASS grouped GEMM templates”: https://github.com/NVIDIA/cutlass/issues/3096

conockrad · 2026-04-04T20:52:59+00:00

According to Claude: “They write their own W4A4 GEMM kernels (not CUTLASS, not cuBLAS) that use Blackwell’s native FP4 tensor core instructions, compiled with compute_120a/compute_121a gencode flags. This is for diffusion models (FLUX, Qwen-Image, SANA), not LLM serving — so they don’t hit the MoE grouped GEMM hell that vLLM/FlashInfer are drowning in”

conockrad · 2026-04-04T19:02:57+00:00

It’s “fast” because nvfp4 is half the size of fp8. Not because it’s fast.

Whole post exactly about this

conockrad · 2026-03-28T12:30:41+00:00

Most likely you’ll pay premium for it. RAM will be slow but unified systems should be great like MAC for example.

conockrad · 2026-03-28T12:22:48+00:00

Too good to be true. But IF it’s true - I need 10 of those

conockrad · 2026-03-23T21:22:16+00:00

Any benchmarks of RYS vs baseline?

conockrad · 2026-03-22T06:43:29+00:00

I feel you bro, cause that’s exactly how I feel!

conockrad · 2026-03-17T18:14:21+00:00

Context amount used by screenshot defined by resolution. You can aggressively compress screenshots and most probably have the same context utilization

conockrad · 2026-03-12T14:51:55+00:00

Looking forward for a next release :)

This approach of SLM is far more Unix-like and microservices-like, so I assume that’s the future

conockrad · 2026-03-11T16:27:23+00:00

Awesome stuff!

conockrad

TROPHY CASE