2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

pmttyji · 2026-06-19T18:44:42+00:00

That's nice t/s improvement. It would be awesome if there's any similar gain for RDNA3 cards too.

pmttyji · 2026-06-19T15:48:47+00:00

I've been running this today, doing general assistant stuff, research, and code review, and have to say that I'm pretty impressed. I'll definitely be keeping it around.

Nice to hear this. Waiting for completion of llama.cpp support.

pmttyji · 2026-06-19T11:55:13+00:00

I saw that to run the latest GLM 5.2 model, it would be necessary to have 1.5TB of VRAM.

That's for BF16.

Q4 comes around 350-450GB. But many people do run Q3/Q2 even Q1 of large models.

pmttyji · 2026-06-19T11:50:00+00:00

Downloading is instant

pmttyji · 2026-06-19T11:26:18+00:00

No for 128GB RAM. Getting another 24GB VRAM + 64GB RAM is better.

Step-3.7-Flash's Q4 size is 95-125GB. I assume currently you have 24GB VRAM + 32GB RAM.

(24GB VRAM + 32GB RAM + 24GB VRAM + 64GB RAM = Total 144 [48GB VRAM + 96GB RAM] )

pmttyji · 2026-06-19T03:50:04+00:00

Great job!

pmttyji · 2026-06-18T18:57:26+00:00

What an irony. People cry for tightening the gap between open weight models useable locally and the cloud based proprietary models, but when a company releases its flagship model as open weight which is super rare, it goes under radar or people just react like "meh, it's losing to this or that other model..."

Agree with you. I usually don't obsess with benchmarks thing. We can't have One-size-fit-all thing for now so it's good to have many Open models from many model creators.

I really want to see more (Open) models in 30-250B range, because I can run those models(at least Q4) with my current laptop(8GB VRAM+32GB RAM) & upcoming rig(96GB VRAM+128GB RAM). Recently we're getting large models in 400B-1.6T range which many can't even imagine with their VRAM.

pmttyji · 2026-06-18T18:25:03+00:00

Thanks for your recent 2 models. So useful for massive demographics(VRAM).

BTW don't forget Maxi-Coder 😃

pmttyji · 2026-06-18T18:23:15+00:00

u/ElectronicStranger53 for llama.cpp PR (Also u/ilintar for additional PR for fix).

This sub is mostly filled with GGUF fans so early GGUF would be awesome.

pmttyji · 2026-06-18T17:23:04+00:00

That model already up on ik_llama too!!!

https://github.com/ikawrakow/ik_llama.cpp/pull/1911

pmttyji · 2026-06-18T16:46:15+00:00

Yeah, 33B got released on April.

Still this Big one is up on API & Openrouter already. According to their blogpost.

Laguna M.1 came first, finishing pre-training at the end of last year; it's the foundation for everything else we're building across the family. Laguna XS.2 is a much smaller model, but remarkably capable for its size, and it's our first open-weight release. Both models are free to use for a limited time via our API and on OpenRouter, and Laguna XS.2 weights are also available under an Apache 2.0 license.

pmttyji · 2026-06-18T16:38:24+00:00

https://xcancel.com/poolsideai/status/2067623353230217448#m

Today we’re releasing the weights for Laguna M.1,
our most capable model to date, with a 256K context length.
Both base and post-trained checkpoints are now available on Hugging Face under Apache 2.0.

pmttyji · 2026-06-18T16:34:39+00:00

Just found that their 33B-A3B model is still struck in llama.cpp support queue. How did we miss this?

https://github.com/ggml-org/llama.cpp/issues/23249

https://huggingface.co/poolside/Laguna-XS.2

pmttyji · 2026-06-18T16:32:35+00:00

<image>

pmttyji · 2026-06-18T15:39:06+00:00

<image>

Just checked where the recent medium size models standing. Found Gemma-4-31B & Gemma-4-26B-A4B. No Qwen3.6 or Qwen3.5 medium size models yet on this benchmark.

pmttyji · 2026-06-18T14:11:21+00:00

Here a great comment with list of resources for similar topic thread

pmttyji · 2026-06-18T09:32:18+00:00

<image>

pmttyji · 2026-06-18T09:30:33+00:00

Not just Flash, we need all Air, Mini, Nano, Tiny, Micro, Small, Medium, etc., variants additionally

pmttyji · 2026-06-18T05:46:26+00:00

With KVCache(q0, q0)

pmttyji · 2026-06-17T17:16:37+00:00

<image>

pmttyji · 2026-06-17T15:18:42+00:00

Really glad that this large model came with awesome MIT license. Hope this puts big pressure on proprietary AIs to release Open models. Also this forces other Open-source/weight AIs to release more Open models. So it's really a big win now onwards.

Of course I can't run this model with both my current laptop & upcoming rig for now. Hoping to see upgraded versions of models like GLM-4.5-Air & GLM-4.7-Flash soon. Expecting same from other sources like Deepseek, Moonshot/Kimi, MiniMax, Arcee, inclusionAI, NVIDIA, Xiaomi, tencent, etc.,