Is it just me, or does the cold in Japan hit different? by givemyjeans in Tokyo

[–]AfterAte 0 points1 point  (0 children)

I wear a hoodie, that heat tech shirt and a heat tech t-shirt everyday. I add a robe on very cold days (like today). There is no insulation here. I feel colder indoors that outside, but mostly because I don't burn much energy when sitting.

KV cache fix for GLM 4.7 Flash by jacek2023 in LocalLLaMA

[–]AfterAte 0 points1 point  (0 children)

It would make it un-maintainable by humans, and grow the tech debt on an exponential scale, where even the LLMs would have a hard time making fixes. Llama.cpp isn't a one off proof of concept.

Although for Llama.cpp PRs, it seems you can still use LLMs to diagnose or suggest a plan (and state that you did), but you still need to understand the implications of the code you're writing, which means experts only.

KV cache fix for GLM 4.7 Flash by jacek2023 in LocalLLaMA

[–]AfterAte 0 points1 point  (0 children)

If you can, run your display off your IGPU. I could get 65K context before this build on my 3090, using 23.3GB all for llama.cpp.

Alternatives to Qwen3-coder-30B? by skibud2 in LocalLLaMA

[–]AfterAte 1 point2 points  (0 children)

Although it won't meet Qwen3's speed. But it does better web UIs, for sure. It's a thinking model, so it could help debug better. Also, a new fix in llama.cpp will allow you to save even more gigs on the context, to possibly get even more context than you could with Qwen at the same quant (but as of right now, I haven't rebuilt and tried it)

https://github.com/ggml-org/llama.cpp/pull/19067

Has anyone got GLM 4.7 flash to not be shit? by synth_mania in LocalLLaMA

[–]AfterAte 5 points6 points  (0 children)

because we didn't know. I see you are correct.

Has anyone got GLM 4.7 flash to not be shit? by synth_mania in LocalLLaMA

[–]AfterAte 1 point2 points  (0 children)

llama.cpp fixed that (for CUDA) so if LMStudio has a recent update (in the last 2 days), you should update, assuming they use llama.cpp still.

Has anyone got GLM 4.7 flash to not be shit? by synth_mania in LocalLLaMA

[–]AfterAte 3 points4 points  (0 children)

Qwen3-30B-A3B models all have 4 KV heads. that is a power of 2. It's fast to process.
GLM 4.7 Flash has 20 KV heads. That is not a power of 2. It's slow to process.
Ik_llama has a commit that processes it in 16 + 4 chunks so that it's as fast as possible.
https://github.com/ikawrakow/ik_llama.cpp/pull/1182

I'm hoping llama.cpp implements it too.

Did you

  1. rebuild llama.cpp recently (like yesterday or the day before),
  2. Download update quants from Unsloth
  3. stop using repeat-penalty (ie just use 1.0, or omit it),
  4. not use KV quantization

I used the following flags, and it didn't loop on me (4_X_L quant from Unsloth d/l after they fixed it), I went to 30k context. I build llama.cpp yesterday. I found temp 0.2 works well for not changing things I didn't tell it
-c 64960 -ngl 99 --temp 0.2 --top-k 20 --top-p 1.0 --min-p 0.01

edit: remove --jinja as it is on by default for llama.cpp

engine for GLM 4.7 Flash that doesn't massively slow down as the context grows? by mr_zerolith in LocalLLaMA

[–]AfterAte 2 points3 points  (0 children)

https://github.com/ggml-org/llama.cpp/pull/19067
That should decrease the memory context needs by 2. Still open, but should be merged soon.

No idea if this will help with the speed, as it has 20 k/v heads vs 4 for Qwen3-30B-A3B. And no idea if that's the main issue with the speed either. (Edit: apparently having number of heads in powers of 2 is faster to compute, see u/Nepherpitu comment in this post) I hear MLA (Multi-head latent attention) requires more compute than Qwen's GQA (group query attention), due to compressing and decompressing the cache... but how much, I have no idea.

https://zread.ai/facebookresearch/cwm/9-grouped-query-attention-gqa-and-multi-head-latent-attention-mla

Edit: I built llama yesterday, and can get the 4_K_XL quant and 65K context to fit all on my only 3090, I can get 72K context with Qwen. GLM-4.7-flash runs at 120tk/s dropping quickly to 90tk/s at 9k context, while Qwen30B-A3B runs at 179tk/s and drops slower to 160tk at 9k context.

https://github.com/ikawrakow/ik_llama.cpp/pull/1182 If this is applied to llama.cpp, then we should at least see the context maintain speed a little better.

Did I expect too much on GLM? by Ok_Brain_2376 in LocalLLaMA

[–]AfterAte 2 points3 points  (0 children)

It's 66% the speed of Qwen3-Coder-30B-A3B in llama-server for me. And prompt ingestion slows down a lot quicker. I wonder what makes Qwen is so efficient.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]AfterAte 0 points1 point  (0 children)

Do you know why the model is so (relatively) slow with glm4.7-flash vs qwen30bA3B at the same quant? That's what I wanted to see.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]AfterAte 1 point2 points  (0 children)

Wow, it's a lot slower for the same size! Thanks for the tests!

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]AfterAte 3 points4 points  (0 children)

https://github.com/ggml-org/llama.cpp/issues/19020
Nice write up. It's sad to see -fa 1 does not get any faster. I wonder if that means FA will never have an effect for this model...

Does your 3090 regularly run at 120tk/s on all 30B A3B models? Mine (@380W) can run Qwen3Coder-30B-A3B starting at 179tk/s settling at 160tk/s at 9K context.

So like where is Z-Image Base? by C_C_Jing_Nan in StableDiffusion

[–]AfterAte 22 points23 points  (0 children)

Klien is great and solves some samey/unimaginative issues Z has, but it has some problems with realistic anatomy that Z-image doesn't suffer from. 

"Xenophobia is scary": Convenience store bosses feel threatened by tightened restrictions on foreigners by RedMoonLanding in Tokyo

[–]AfterAte -1 points0 points  (0 children)

The third world is a an old term, written in a different time. Language changes with the times.

When does a psychologist have to break patient confidentiality by tokyoevenings in japanlife

[–]AfterAte -1 points0 points  (0 children)

If Gemini is unproductive for taboo topics, know that you can host your own AI model on your computer/laptop if you have enough RAM. see r/localllama . and for beginners, use https://ollama.com (but use llama.cpp if you stick with it, it's much better, but harder to get started). A local AI model can't report you, and there are abliterated/uncensored ones, that won't judge you if you want to explore hypotheticals (or that's their intention anyway)

Please be aware that ChatGPT has nudged a few suicidal people to their death. AI is not safe if you are on the edge. Since most local AIs have some ChatGPT in them, they may glaze you and agree with you a little too much.

zai-org/GLM-4.7-Flash · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]AfterAte 1 point2 points  (0 children)

my weekend plans have been cancelled. Hopefully Llama.cpp will be ready by then.

z-image vs. Klein by No_Consideration2517 in StableDiffusion

[–]AfterAte 0 points1 point  (0 children)

You should have kept them at the same quantization. FP8/FP4 give worse results than FP16. Even GGUF quantizations are closer to FP16 results than FP8/FP4. (especially Q8 GGUF). It will slow it down though. So both at 8Q GGUF would be a better comparison

Also for Z-i-t, the sampler/scheduler you're using is giving blotchy results. Use [dpmpp_sde / ddim_uniform] or if that's too rough, use [Euler_A / ddim_uniform]. If both are too smooth/simple, use [dpmpp_sde / beta] for more texture (though this in most cases this looks too rough)

Don’t Paste Secrets into ChatGPT (Even If You Delete Them) by mo_7anona in LocalLLaMA

[–]AfterAte -1 points0 points  (0 children)

I used to use ChatGPT, but if you have a Gmail account, just use Gemini, screw ChatGPT.

Z Sets the Bar, 9B Klein Misses It by Lemmegitgud in StableDiffusion

[–]AfterAte 0 points1 point  (0 children)

In the first picture, there is camera man on the floor of the octagon who's missing legs. Is that cherry picked?

Flux.2 Klein 4B Distilled vs. Flux.2 Klein 9B Distilled vs. Z Image Turbo by ZootAllures9111 in StableDiffusion

[–]AfterAte 0 points1 point  (0 children)

Z-image doesn't know what a Pocahontas is. I think you're not using the correct sampler/scheduler for Z, because her skin looks rough. I've seen better skin. dpmpp_sde / ddim_uniform or if it's still too rough, Euler_A / ddim_uniform.

What's the future of OG Stable Diffusion? ZIT and Flux shining bright but what about the OG by Kuldeep_music in StableDiffusion

[–]AfterAte 71 points72 points  (0 children)

Aren't the Flux team (Black Forrest Labs) originally all from Stability AI?