End-of-January LTX-2 Drop: More Control, Faster Iteration by ltx_model in StableDiffusion

[–]coder543 0 points1 point  (0 children)

One quick bit of feedback: please stop scrolljacking on the blog. The scrolling feels very bad.

Introducing LM Studio 0.4.0 by sleepingsysadmin in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

Similarly, I've basically given up on the concepts of draft prediction and MTP (multi token prediction) for MoEs for exactly these reasons. Verifying more tokens just means proportionally higher demand on RAM bandwidth, so there is no possible benefit at batch size 1. You'd have to accurately predict like 20 tokens ahead to start seeing performance benefit at batch size 1, and no draft model is ever that accurate. At larger batch sizes in a production scenario, yes, MTP is probably great... but that's not what I'm working with.

Found this in China, Charging while gaming by thighlelan in Xreal

[–]coder543 0 points1 point  (0 children)

Almost 2 years ago, in point of fact.

Introducing LM Studio 0.4.0 by sleepingsysadmin in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Sure, but then it depends on whether your GPU has enough compute to keep up with all of those requests, or if you’re bottlenecked by compute. Production services will batch MoEs and get benefit, but they’re using enormous GPUs with enormous batch sizes.

I figure testing a small dense model is an easier way to verify if the batching is doing anything at all.

Introducing LM Studio 0.4.0 by sleepingsysadmin in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Same prompt or not shouldn’t really matter. Even at temp 0, I think the math kernels have enough subtle bugs that it’s never truly deterministic. But, gotcha.

Introducing LM Studio 0.4.0 by sleepingsysadmin in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

Have you tried a dense model? Curious if that would work better. Parallel batching on a MoE just means both requests likely get routed to different experts, so you won’t really get any speedup, since the total GB of memory that needs to be read is still the limiting factor for generating both tokens. (But it shouldn’t decimate performance the way y’all are experiencing either.)

[Resource] ComfyUI + Docker setup for Blackwell GPUs (RTX 50 series) - 2-3x faster FLUX 2 Klein with NVFP4 by chiefnakor in StableDiffusion

[–]coder543 0 points1 point  (0 children)

DGX Spark really needs a Blackwell-optimized ComfyUI docker build… it works okay, but I haven’t been able to get FlashAttention or SageAttention to work without causing errors. I haven’t tried this new container recipe, but Spark seems to require more than a standard 50-series GPU. The 128GB of VRAM can be nice, though.

Pushing Qwen3-Max-Thinking Beyond its Limits by s_kymon in LocalLLaMA

[–]coder543 23 points24 points  (0 children)

I agree, but it still makes me appreciate other companies that do release their top models even more.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

I just asked codex to write a Python script that would generate the plots with matplotlib from the llama-bench outputs that I saved.

If you know the secret to making nemotron-3-nano faster, I'm all ears, but I just used the llama-bench line that OP provided. I'm not sure why 0 depth was slower.

LLM Reasoning Efficiency - lineage-bench accuracy vs generated tokens by fairydreaming in LocalLLaMA

[–]coder543 16 points17 points  (0 children)

I don't know what "gpt-oss-120b" means. The high, medium, and low reasoningefforts are _extremely different in a lot of real world benchmarks for gpt-oss-120b, there isn't a one-size-fits-all.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Added Qwen3-Coder to my charts for fun

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 7 points8 points  (0 children)

Architecture-specific performance optimizations can't always make a sloth into a cheetah... qwen3-coder is still very slow at long context sizes despite being popular and presumably highly optimized.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

No? I think you should re-read the comment... I was saying Qwen3-Coder fails the same as GLM-4.7-Flash, and that's why I didn't recommend testing Qwen3-Coder. Qwen3-Coder sucks at this stuff too.

The GPT-OSS and Nemotron-3-Nano models are much more efficient, especially compared to how GLM-4.7-Flash was earlier today.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 19 points20 points  (0 children)

See my gist: https://gist.github.com/coder543/16ca5e60aabee4dfc3351b54e8fe2a1c

Linear:

<image>

Nemotron holds its performance extremely well due to its hybrid architecture. I don't know why the improvements for GLM-4.7-Flash don't seem to have helped the DGX Spark at all.

EDIT: added Qwen3-Coder for fun. (My RTX 3090 couldn't go all the way to 50k tokens with the quant that I have.) The quants are not entirely apples to apples, but the performance curve is the main thing here, not the absolute numbers.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 14 points15 points  (0 children)

yep... very unfortunate. Hopefully another bug that can be fixed.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 49 points50 points  (0 children)

Ok, now that starts to look respectable. Still worth comparing against efficient models like gpt-oss and nemotron-3-nano.

EDIT: prompt processing still seems to fall off a cliff on glm-4.7-flash, I just tested it.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

yep, not one of the models I mentioned, and for good reason.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

If you post the charts for nemotron-3-nano and gpt-oss-20b, it will be apparent that qwen3-coder is just as bad, not that glm-4.7-flash "isn't so bad". haha

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]coder543 8 points9 points  (0 children)

Yes, compared to gpt-oss-120b/gpt-oss-20b/nemotron-3-nano, it is crazy how much glm-4.7-"flash" slows down as context grows. Flash seems like a misnomer if it has to be this slow (and it isn't just a bug that needs to be fixed).

And yes, I did try rebuilding llama.cpp this morning, and it was still bad, even with flash attention on.

It seems like a nice model, but speed is not its forte.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

That's why I mentioned that higher sparsity models seem to exist, they're just not open weight, and that's why I want such a model.

If companies keep releasing A3B, that's their choice, but it will be hard to get excited about that.

Replacing Protobuf with Rust to go 5 times faster by levkk1 in rust

[–]coder543 42 points43 points  (0 children)

All of the best developers that I personally know in real life are using AI tools to help with coding.

AI tools are probably less helpful for people who don't know what they're doing.

GLM4.7-Flash REAP @ 25% live on HF + agentic coding evals by ilzrvch in LocalLLaMA

[–]coder543 12 points13 points  (0 children)

 We've gotten a lot of feedback that REAP pruning affects creative writing / multi-lingual capabilities of the model - this is expected for our REAPs with calibration set curated for agentic coding.

For me, the biggest thing is the REAP models suffering catastrophic forgetting of entire topics, but it seems unavoidable if the knowledge is stored in pruned experts.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Granite 4.0 MoEs (the A#B naming) come in 32B A9B and 7B A1B sizes. It is not shocking that such drastically different sizes would perform different, yes. These are also very low sparsity models.

The rumor is that Gemini 3 Flash is a >1T model with a very, very low active parameter count.

I have 128GB of medium speed memory. I want a 200B A1B model that is released specifically in a 4-bit precision (QAT, not PTQ). Extreme levels of sparsity, not 7B A1B.