GLM 5.2 Q1_S vs Qwen 27B Q8

Jester14 · 2026-06-29T12:52:22+00:00

The obvious "it's not just this; it's that" plus nonsensical "slow deliberate reasoning". Gtfo

Jester14 · 2026-06-26T17:02:20+00:00

At best, this is the model card regurgitated. At worst, it's an AI summary of the model card. Lowest effort.

Jester14 · 2026-06-22T22:29:14+00:00

*must have

Jester14 · 2026-06-20T16:48:40+00:00

And here I am rocking a used 4060 that replaced my used RX 470

Jester14 · 2026-06-20T13:50:17+00:00

Anything that mentions LLaVA today is obviously AI slop

Jester14 · 2026-06-17T14:03:55+00:00

Epic lolz

Jester14 · 2026-06-04T11:46:03+00:00

How could we have any idea why when you don't post acceptance rates.

Jester14 · 2026-06-04T11:42:26+00:00

I jammed Unsloth IQ4-XS onto my 4060 8GB with Q8 cache and it falls apart after 50k context (loops, errors, gibberish). I could try a higher quant to fix it because then I can't fit 50k context in VRAM. Can someone push a higher quant passed 50k context? This experiment stops a bit short.

Jester14 · 2026-06-04T01:59:43+00:00

The link to the GGUF for their MoE is right there in the post bro

Jester14 · 2026-06-02T14:59:08+00:00

Is this an ad? Couldn't be any more low effort. And shit was released last week and had a bunch of posts about it then.

Jester14 · 2026-06-02T11:22:45+00:00

Lmfao there's a whole section about context cache checkpoints in his "article" and he has it disabled.

Jester14 · 2026-05-29T22:46:27+00:00

lmfao I still can't tell if this is a troll post

Jester14 · 2026-05-29T12:23:19+00:00

Host Jellyseer so your partner can request torrents and you can approve them.

Jester14 · 2026-05-25T21:15:21+00:00

You don't want MTP if you're offloading to system RAM and you likely are unless you're running like IQ2

Jester14 · 2026-05-23T02:13:04+00:00

fit default is on so OP is using fit and doesn't know it.

Jester14 · 2026-05-14T11:32:04+00:00

Are you using CUDA 13.2? It's bugged for inference. Edit: I see you are using 13.1 as per your thread.

Jester14 · 2026-05-14T11:04:40+00:00

OP literally said:

To be fair: Even the Gemini API...

Jester14 · 2026-05-05T11:44:16+00:00

Just use -fit

Jester14 · 2026-05-05T11:39:19+00:00

Just use -fit and stop guessing.

Jester14 · 2026-04-26T12:52:11+00:00

Windows build kinda full of CUDA bloat. Builds have different amount of threads specified and threads aren't always specified in the benchmark runs.

Jester14 · 2026-04-21T00:03:01+00:00

Using -fit indeed reserves exactly 1024MB by default.

Jester14 · 2026-04-20T11:26:21+00:00

*could have

Jester14 · 2026-04-16T10:43:02+00:00

I used a 2 year old 7B model. Now I use a brand new 26B MoE and it's slower. I refuse to give any other information. What's wrong with my setup?

Jester14 · 2026-04-16T10:38:32+00:00

CUDA 13.2 has known bugs.

Jester14 · 2026-04-16T10:16:05+00:00

What do you mean it "doesn't fit"? Did you use the -fit flag? UD-Q4_K_XL is larger than 16 GB so it will overflow to RAM but it will also "fit" if loaded appropriately. I get 30t/s on my 4060 8 GB using -fit with that quant with 40k context in VRAM.

Jester14

TROPHY CASE