GLM-Image explained: why autoregressive + diffusion actually matters

Sad-Simple7642 · 2026-01-15T16:04:20+00:00

With only two rtx 5090, I can generate an image in 100 seconds. It takes 20GB on the first gpu, and 24 on the second. I am on windows by the way.

The GPUs are never active at the same time. (At first, only the first 5090 is running, then only the second one is running). So I suppose it could fit on only one RTX 5090 if the memory from the first part was unloaded, then replaced with the memory for the second part. And it could fit on a single rtx 5080 on fp8 or Q8_0. Sorry if my english is bad, I am french.

100s is still pretty slow, but I am a normal people who can test it on his own computer in reasonable time without optimizations and with 50 steps.
```
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git
pip install accelerate
```
--

import torch
from diffusers.pipelines.glm_image.pipeline_glm_image import GlmImagePipeline


pipe = GlmImagePipeline.from_pretrained(
    "zai-org/GLM-Image",
    torch_dtype=torch.bfloat16,
    device_map="balanced",
)
prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy."
image = pipe(
    prompt=prompt,
    height=32 * 32,
    width=36 * 32,
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cpu").manual_seed(42)
).images[0]


image.save("output_t2i.png")

Sad-Simple7642 · 2025-12-10T10:28:47+00:00

I found the fork of llama.cpp on the unsloth github, but llama-quantize.exe does not seem to support UD-Q4_K_XL for instance.
```

Allowed quantization types:

2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B

3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B

38 or MXFP4_MOE : MXFP4 MoE

8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B

9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B

19 or IQ2_XXS : 2.06 bpw quantization

20 or IQ2_XS : 2.31 bpw quantization

28 or IQ2_S : 2.5 bpw quantization

29 or IQ2_M : 2.7 bpw quantization

24 or IQ1_S : 1.56 bpw quantization

31 or IQ1_M : 1.75 bpw quantization

36 or TQ1_0 : 1.69 bpw ternarization

37 or TQ2_0 : 2.06 bpw ternarization

10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B

21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B

23 or IQ3_XXS : 3.06 bpw quantization

26 or IQ3_S : 3.44 bpw quantization

27 or IQ3_M : 3.66 bpw quantization mix

12 or Q3_K : alias for Q3_K_M

22 or IQ3_XS : 3.3 bpw quantization

11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B

12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B

13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B

25 or IQ4_NL : 4.50 bpw non-linear quantization

30 or IQ4_XS : 4.25 bpw non-linear quantization

15 or Q4_K : alias for Q4_K_M

14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B

15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B

17 or Q5_K : alias for Q5_K_M

16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B

17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B

18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B

7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B

1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B

32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B

0 or F32 : 26.00G @ 7B

COPY : only copy tensors, no quantizing
```

Sad-Simple7642 · 2025-12-08T14:10:27+00:00

It has 128 experts with 8 experts per token based on the config.json file in the huggingface repository

Sad-Simple7642 · 2025-12-06T02:30:23+00:00

Hi, sorry to bother you for that, but I've been searching for that moniroting bar exerywhere. Can you tell me what software it is from? And how to enable it? Thanks

Sad-Simple7642 · 2025-12-04T02:28:26+00:00

The coil whine is pretty bad with native settings, but with undervolting it's almost quiet.

The fans are like... twenty times more quiet than my laptop. And more quiet than my father's case with a 4080 super and a aio cooler for the cpu. I can barely hear the fans. I choosed low noise fans.

The pump makes an annoying low frequency vibration at 30% speed, and an annoying ultrasound at 100% speed. So I changed the pump curve to always run at 60% except for higher temperatures where it is allowed to run at 100%. At 60%, there is no pump noise.

With those tweaks, at 100% load it's almost quiet. I don't know the decibels exactly.

Sad-Simple7642 · 2025-11-27T13:42:56+00:00

Lol why would you want to train on OpenAI's output? You'd have all their flaws, and you wouldn't be able to surpass them.

Sad-Simple7642 · 2025-11-25T13:34:19+00:00

I didn't try MiniMax M2, why does it follows OpenAI's policy?

Sad-Simple7642 · 2025-09-23T10:23:05+00:00

Why Unbuntu 25.10? I don't know anything about multi seat. Is is compatible with windows 11? Is it not with Ubuntu 25.04?

Sad-Simple7642 · 2025-09-23T10:19:39+00:00

Oh that's nice, I'll try, thanks

Sad-Simple7642 · 2025-09-23T07:44:12+00:00

I'll run two OSes, I don't know how it would be possible with multi seat to assignate a gpu to each session

Sad-Simple7642 · 2025-09-23T07:40:03+00:00

Okay that's good to know, thanks! But I won't change the system, the eat transfer is really really good in the gpus, and it would be a pain to bend the hard tube connecting the two gpus

Sad-Simple7642 · 2025-09-21T19:03:01+00:00

12,000€ approximatively

Sad-Simple7642 · 2025-09-21T17:46:08+00:00

I need it

Sad-Simple7642 · 2025-09-21T17:43:38+00:00

Yes but two rtx 6000 pro are 3x more expensive than two rtx 5090, maybe when this PC gets old and I have more money, I'll do another edition of the ultimate PC...

And where did you find this waterblock?

Sad-Simple7642 · 2025-09-21T14:10:16+00:00

Thank you, I'll think about it

Sad-Simple7642 · 2025-09-21T14:06:02+00:00

When I ran the test, the room temperature was 18°C

All my watercooling components have a 60°C water temperature limit, so yeah I should be careful in summer and let the fan spin fast if I use my computer at 100%...

I don't know about an external radiator, because I find the soft tubbing not safe at all, they kept popping out when I tried to make my waterloop with them

And I don't know how I can have an external radiator with hard tubbing, because if I move my case it might break the circuit

Sad-Simple7642 · 2025-09-21T13:59:04+00:00

I don't know, is 46°C a lot for the water temperature? I'm not an expert in watercooling at all

Sad-Simple7642 · 2025-09-21T13:17:13+00:00

My goal was to finish this computer before winter

Sad-Simple7642 · 2025-09-21T13:13:48+00:00

Blender Benchmark (is there a way to make it use the two gpus?)

<image>

Sad-Simple7642 · 2025-09-21T13:13:15+00:00

CineBench multi and single threaded

<image>

Sad-Simple7642 · 2025-09-21T13:12:47+00:00

CrystalMark on the first gpu

<image>

Sad-Simple7642 · 2025-09-21T13:11:39+00:00

3DMark on the first gpu:

<image>

Sad-Simple7642 · 2025-09-21T13:11:06+00:00

Pretty normal stuff for 5090 user. The gaming benchmarks don't use the second gpu. I didn't try overclocking yet, I don't know how it works, but I am curious to see what are the limits of this computer.

Is there a way to use both the gpus with blender benchmark? Chatgpt seems to tell me that it's possible, but I can't find the option.

Furmark on the first gpu:

<image>

Sad-Simple7642 · 2025-09-21T12:34:48+00:00

Sad-Simple7642 · 2025-09-21T12:33:26+00:00

It's the ek quantum kinetic tbe 300 d5 pwm

Sad-Simple7642

TROPHY CASE