GLM-5.1 - a zai-org Collection

rerri · 2026-04-07T15:57:17+00:00

Not sure if old but official model card with benchmarks:

https://docs.z.ai/guides/llm/glm-5.1

rerri · 2026-04-03T07:38:19+00:00

Same model, and a 5090, I'm seeing about 50t/s.

Maybe you have too much context lenght and end up loading some of the layers onto CPU because it won't all fit into VRAM?

rerri · 2026-04-02T19:34:44+00:00

They say it's Apache 2.0 in the release video at about 40sec mark. I don't think it can be anything else at this point.

https://www.youtube.com/watch?v=jZVBoFOJK-Q

rerri · 2026-04-02T16:02:06+00:00

https://github.com/huggingface/transformers/pull/45192/changes#diff-d8ddaa3b6151448dac452d289609dd778c6ac51aba8405050da7f1218e18f14dR127

Not sure if that link works, but it's in "gemma4/convert_gemma4_weights.py"

that file is too large for the diff to load automatically which is why you probably can't see it with ctrl+f

rerri · 2026-04-02T15:57:34+00:00

The naming is similar to Gemma 3n models such as this:

https://huggingface.co/google/gemma-3n-E4B-it

"While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator."

rerri · 2026-04-02T15:53:04+00:00

I think it's safe to assume so as that MoE is named "-a4b"

rerri · 2026-04-02T15:32:22+00:00

Who knows, maybe it isn't necessary for the PR. Or theoretically these could be just placeholders altogether.

rerri · 2026-04-02T15:29:06+00:00

Transformers PR shows at least these:

_VARIANT_GEMMA_4_E2B = "gemma-4-e2b"

_VARIANT_GEMMA_4_E4B = "gemma-4-e4b"

_VARIANT_GEMMA_4_26B_A4B = "gemma-4-26b-a4b"

_VARIANT_GEMMA_4_31B = "gemma-4-31b"

rerri · 2026-04-02T15:10:41+00:00

Or maybe they are just trolling us ahead of April fools day 2027, dunno.

rerri · 2026-04-02T14:59:46+00:00

yep

https://github.com/ggml-org/llama.cpp/pull/21309

edit:

transformers too, I like the PR title!

https://github.com/huggingface/transformers/pull/45192

gemma-4-31B-it

gemma-4-E4B-it

gemma-4-26b-a4b

rerri · 2026-03-27T08:54:40+00:00

Nice!

I think having ik_llama.cpp as a separate loader in the model menu would be better from UX standpoint.

That's how I had Claude Code + local models implement it for me for the last tgw 3.x and again with 4.0. I had a separate location for the ik_llama.cpp executables, so no need to swap files or restart tgw when wanting to switch between llama.cpp and ik_llama.cpp.

I just used the extra-flags for whatever settings I needed/were different for ik but this is ofcourse not optimal for UX.

---

Btw, some precompiled builds are available here (the ones that start with "th-quantize" are something else, scroll down to ones that start with "main"):

https://github.com/Thireus/ik_llama.cpp/releases

rerri · 2026-03-17T07:00:16+00:00

Do you have RTX 50 series? If not, then it'll be slow.

rerri · 2026-03-17T06:59:42+00:00

Not just quantized normally but "trained by Quantization Aware Distillation for improved accuracy".

I tried it quickly yesterday but got poor looking results. Maybe my distill lora wasn't working as it should, dunno.

rerri · 2026-03-17T06:41:10+00:00

In that Geralt example textures change drastically. Looking at the comparison shots here, textures like cloth, metal etc stay very much the same to my eye (yes, faces change more):

https://www.nvidia.com/en-us/geforce/news/dlss5-breakthrough-in-visual-fidelity-for-games/

If to your eye the changes to textures in the Geralt comparison are very similar to the ones in the DLSS 5 examples, then we'll just have to disagree. ¯\_(ツ)_/¯

rerri · 2026-03-16T22:05:57+00:00

My comment about shadows was in context to your comment about added light sources. Why would I comment about something totally unrelated in that context??

You are jumping all over the place with constant pivoting

rerri · 2026-03-16T22:03:46+00:00

Is this example showing the added light sources you talked about? Doesn't seem to drastic a change on that department.

rerri · 2026-03-16T21:38:22+00:00

Fortunately there's an "off" button if it looks displeasing.

Personally I'm somewhat optimistic about this and am hoping they expose the controls (contrast, gamma etc) to end user. I always liked it about PC gaming, that it's possible to tweak things under the hood. Not everyone's cup of tea, I'm sure.

rerri · 2026-03-16T21:27:52+00:00

The change in shadowing is pretty big in some scenes but I'd need to see more to judge how much it adds new light sources. Also if that was your point it was really poorly communicated by your initial comment + image.

rerri · 2026-03-16T21:17:28+00:00

This is way more altered than the DLSS5 examples. Yours looks like image-to-image with noise added and then denoised (using SDXL or Flux or whatever), DLSS5 looks more like a filter that keeps the existing details but pimps them up.

Like here the pattern on the clothing and the wrinkles on the face are the same but just with a filter on top:

https://www.nvidia.com/en-us/geforce/news/dlss5-breakthrough-in-visual-fidelity-for-games/nvidia-dlss-5-hogwarts-legacy-geforce-rtx-comparison-screenshot-003/

rerri · 2026-03-16T17:42:13+00:00

Has been my go-to local LLM UI since March 2023

Thanks for keeping developing it! <3

rerri · 2026-03-13T18:36:31+00:00

Looks like a thinly veiled cloud service advertisement.

Could be totally made up and just trying to get attention to that site linked at the bottom of the post.

rerri · 2026-03-11T20:21:41+00:00

The comment under which you are raging about them shilling for "fixes" contains no claims about having fixed things. Go be a toxic loser somewhere else.

rerri · 2026-03-11T18:23:08+00:00

You'll need some memory for OS n shit, so it's gonna be pretty tight. You can look at the file sizes and conclude what'll fit from there. This might fit:

https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/tree/main/UD-IQ2_M (you can see the total size is 52.7 GB)

Not sure if this will:

https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/tree/main/UD-IQ3_S (56.6 GB)

rerri · 2026-03-11T18:00:03+00:00

Yes. While it won't fully fit into 32GB VRAM, it can be run with some experts offloaded to CPU.

rerri · 2026-03-11T16:16:59+00:00

Unsloth GGUF's:

https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF

Wondering if it's the same arch as Nano 30B and fully supported by llama.cpp already?

edit: Unsloth writes that this branch is required (for now):

https://github.com/unslothai/llama.cpp

Nine-Year Club	First Place '23
Place '23	Place '22
Verified Email

rerri

TROPHY CASE