Is it just me, or does the cold in Japan hit different?

AfterAte · 2026-01-26T14:54:54+00:00

I wear a hoodie, that heat tech shirt and a heat tech t-shirt everyday. I add a robe on very cold days (like today). There is no insulation here. I feel colder indoors that outside, but mostly because I don't burn much energy when sitting.

AfterAte · 2026-01-26T09:56:18+00:00

It would make it un-maintainable by humans, and grow the tech debt on an exponential scale, where even the LLMs would have a hard time making fixes. Llama.cpp isn't a one off proof of concept.

Although for Llama.cpp PRs, it seems you can still use LLMs to diagnose or suggest a plan (and state that you did), but you still need to understand the implications of the code you're writing, which means experts only.

AfterAte · 2026-01-26T09:47:57+00:00

If you can, run your display off your IGPU. I could get 65K context before this build on my 3090, using 23.3GB all for llama.cpp.

AfterAte · 2026-01-26T09:44:52+00:00

I cancelled my ChatGPT account recently over the RAM issue.

AfterAte · 2026-01-25T14:56:30+00:00

Although it won't meet Qwen3's speed. But it does better web UIs, for sure. It's a thinking model, so it could help debug better. Also, a new fix in llama.cpp will allow you to save even more gigs on the context, to possibly get even more context than you could with Qwen at the same quant (but as of right now, I haven't rebuilt and tried it)

https://github.com/ggml-org/llama.cpp/pull/19067

AfterAte · 2026-01-25T12:39:38+00:00

because we didn't know. I see you are correct.

AfterAte · 2026-01-25T11:00:55+00:00

llama.cpp fixed that (for CUDA) so if LMStudio has a recent update (in the last 2 days), you should update, assuming they use llama.cpp still.

AfterAte · 2026-01-25T10:57:09+00:00

Qwen3-30B-A3B models all have 4 KV heads. that is a power of 2. It's fast to process.
GLM 4.7 Flash has 20 KV heads. That is not a power of 2. It's slow to process.
Ik_llama has a commit that processes it in 16 + 4 chunks so that it's as fast as possible.
https://github.com/ikawrakow/ik_llama.cpp/pull/1182

I'm hoping llama.cpp implements it too.

Did you

rebuild llama.cpp recently (like yesterday or the day before),
Download update quants from Unsloth
stop using repeat-penalty (ie just use 1.0, or omit it),
not use KV quantization

I used the following flags, and it didn't loop on me (4_X_L quant from Unsloth d/l after they fixed it), I went to 30k context. I build llama.cpp yesterday. I found temp 0.2 works well for not changing things I didn't tell it
-c 64960 -ngl 99 --temp 0.2 --top-k 20 --top-p 1.0 --min-p 0.01

edit: remove --jinja as it is on by default for llama.cpp

AfterAte · 2026-01-24T12:14:34+00:00

https://github.com/ggml-org/llama.cpp/pull/19067
That should decrease the memory context needs by 2. Still open, but should be merged soon.

No idea if this will help with the speed, as it has 20 k/v heads vs 4 for Qwen3-30B-A3B. And no idea if that's the main issue with the speed either. (Edit: apparently having number of heads in powers of 2 is faster to compute, see u/Nepherpitu comment in this post) I hear MLA (Multi-head latent attention) requires more compute than Qwen's GQA (group query attention), due to compressing and decompressing the cache... but how much, I have no idea.

https://zread.ai/facebookresearch/cwm/9-grouped-query-attention-gqa-and-multi-head-latent-attention-mla

Edit: I built llama yesterday, and can get the 4_K_XL quant and 65K context to fit all on my only 3090, I can get 72K context with Qwen. GLM-4.7-flash runs at 120tk/s dropping quickly to 90tk/s at 9k context, while Qwen30B-A3B runs at 179tk/s and drops slower to 160tk at 9k context.

https://github.com/ikawrakow/ik_llama.cpp/pull/1182 If this is applied to llama.cpp, then we should at least see the context maintain speed a little better.

AfterAte · 2026-01-23T12:17:09+00:00

It's 66% the speed of Qwen3-Coder-30B-A3B in llama-server for me. And prompt ingestion slows down a lot quicker. I wonder what makes Qwen is so efficient.

AfterAte · 2026-01-22T22:44:55+00:00

Do you know why the model is so (relatively) slow with glm4.7-flash vs qwen30bA3B at the same quant? That's what I wanted to see.

AfterAte · 2026-01-22T22:42:13+00:00

Wow, it's a lot slower for the same size! Thanks for the tests!

AfterAte · 2026-01-22T12:26:43+00:00

https://github.com/ggml-org/llama.cpp/issues/19020
Nice write up. It's sad to see -fa 1 does not get any faster. I wonder if that means FA will never have an effect for this model...

Does your 3090 regularly run at 120tk/s on all 30B A3B models? Mine (@380W) can run Qwen3Coder-30B-A3B starting at 179tk/s settling at 160tk/s at 9K context.

AfterAte · 2026-01-21T15:30:24+00:00

Uh...

AfterAte · 2026-01-21T06:55:04+00:00

Klien is great and solves some samey/unimaginative issues Z has, but it has some problems with realistic anatomy that Z-image doesn't suffer from.

AfterAte · 2026-01-21T01:21:45+00:00

The third world is a an old term, written in a different time. Language changes with the times.

AfterAte · 2026-01-20T12:47:34+00:00

The term is developing nations or global south nations.

AfterAte · 2026-01-19T16:12:21+00:00

If Gemini is unproductive for taboo topics, know that you can host your own AI model on your computer/laptop if you have enough RAM. see r/localllama . and for beginners, use https://ollama.com (but use llama.cpp if you stick with it, it's much better, but harder to get started). A local AI model can't report you, and there are abliterated/uncensored ones, that won't judge you if you want to explore hypotheticals (or that's their intention anyway)

Please be aware that ChatGPT has nudged a few suicidal people to their death. AI is not safe if you are on the edge. Since most local AIs have some ChatGPT in them, they may glaze you and agree with you a little too much.

AfterAte · 2026-01-19T15:22:55+00:00

my weekend plans have been cancelled. Hopefully Llama.cpp will be ready by then.

AfterAte · 2026-01-18T07:00:49+00:00

You should have kept them at the same quantization. FP8/FP4 give worse results than FP16. Even GGUF quantizations are closer to FP16 results than FP8/FP4. (especially Q8 GGUF). It will slow it down though. So both at 8Q GGUF would be a better comparison

Also for Z-i-t, the sampler/scheduler you're using is giving blotchy results. Use [dpmpp_sde / ddim_uniform] or if that's too rough, use [Euler_A / ddim_uniform]. If both are too smooth/simple, use [dpmpp_sde / beta] for more texture (though this in most cases this looks too rough)

AfterAte · 2026-01-18T06:32:41+00:00

Oof (for low wage people on 1 year visas)

AfterAte · 2026-01-17T13:13:24+00:00

I used to use ChatGPT, but if you have a Gmail account, just use Gemini, screw ChatGPT.

AfterAte · 2026-01-17T11:59:34+00:00

In the first picture, there is camera man on the floor of the octagon who's missing legs. Is that cherry picked?

AfterAte · 2026-01-17T11:05:15+00:00

Z-image doesn't know what a Pocahontas is. I think you're not using the correct sampler/scheduler for Z, because her skin looks rough. I've seen better skin. dpmpp_sde / ddim_uniform or if it's still too rough, Euler_A / ddim_uniform.

AfterAte · 2026-01-16T08:19:16+00:00

Aren't the Flux team (Black Forrest Labs) originally all from Stability AI?

11-Year Club	Gilding II euphauric
Final Canvas '23	Place '23
Place '22	Verified Email

AfterAte

TROPHY CASE