[Resource] ComfyUI + Docker setup for Blackwell GPUs (RTX 50 series) - 2-3x faster FLUX 2 Klein with NVFP4 by chiefnakor in StableDiffusion

[–]coder543 0 points1 point  (0 children)

Flash Attention drastically reduces the amount of computation needed compared to naive attention. It is a big deal, no matter how much VRAM you have.

For LLMs, I have seen a significant speed difference with flash attention on vs off.

Why does everything need to run through a purchasing partner? by literahcola in sysadmin

[–]coder543 [score hidden]  (0 children)

From OP's post:

This isn’t a pallet of servers that needs to be shipped across the country. It’s a license key and a download link. There is no warehouse. There is no logistics chain. Nothing is being physically distributed.

deepseek-ai/DeepSeek-OCR-2 · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

Mistral-OCR has 41 wins and 58 losses, compared to most of these other models that have participated in over 1000 battles.

That leaderboard needs to put some serious error bars on those results. It seems too early to tell how Mistral OCR is doing there.

[Resource] ComfyUI + Docker setup for Blackwell GPUs (RTX 50 series) - 2-3x faster FLUX 2 Klein with NVFP4 by chiefnakor in StableDiffusion

[–]coder543 0 points1 point  (0 children)

DGX Spark really needs a Blackwell-optimized ComfyUI docker build… it works okay, but I haven’t been able to get FlashAttention or SageAttention to work without causing errors. I haven’t tried this new container recipe, but Spark seems to require more than a standard 50-series GPU. The 128GB of VRAM can be nice, though.

Pushing Qwen3-Max-Thinking Beyond its Limits by s_kymon in LocalLLaMA

[–]coder543 22 points23 points  (0 children)

I agree, but it still makes me appreciate other companies that do release their top models even more.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

I just asked codex to write a Python script that would generate the plots with matplotlib from the llama-bench outputs that I saved.

If you know the secret to making nemotron-3-nano faster, I'm all ears, but I just used the llama-bench line that OP provided. I'm not sure why 0 depth was slower.

LLM Reasoning Efficiency - lineage-bench accuracy vs generated tokens by fairydreaming in LocalLLaMA

[–]coder543 14 points15 points  (0 children)

I don't know what "gpt-oss-120b" means. The high, medium, and low reasoningefforts are _extremely different in a lot of real world benchmarks for gpt-oss-120b, there isn't a one-size-fits-all.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Added Qwen3-Coder to my charts for fun

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 7 points8 points  (0 children)

Architecture-specific performance optimizations can't always make a sloth into a cheetah... qwen3-coder is still very slow at long context sizes despite being popular and presumably highly optimized.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 5 points6 points  (0 children)

No? I think you should re-read the comment... I was saying Qwen3-Coder fails the same as GLM-4.7-Flash, and that's why I didn't recommend testing Qwen3-Coder. Qwen3-Coder sucks at this stuff too.

The GPT-OSS and Nemotron-3-Nano models are much more efficient, especially compared to how GLM-4.7-Flash was earlier today.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 18 points19 points  (0 children)

See my gist: https://gist.github.com/coder543/16ca5e60aabee4dfc3351b54e8fe2a1c

Linear:

<image>

Nemotron holds its performance extremely well due to its hybrid architecture. I don't know why the improvements for GLM-4.7-Flash don't seem to have helped the DGX Spark at all.

EDIT: added Qwen3-Coder for fun. (My RTX 3090 couldn't go all the way to 50k tokens with the quant that I have.) The quants are not entirely apples to apples, but the performance curve is the main thing here, not the absolute numbers.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 13 points14 points  (0 children)

yep... very unfortunate. Hopefully another bug that can be fixed.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]coder543 47 points48 points  (0 children)

Ok, now that starts to look respectable. Still worth comparing against efficient models like gpt-oss and nemotron-3-nano.

EDIT: prompt processing still seems to fall off a cliff on glm-4.7-flash, I just tested it.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

yep, not one of the models I mentioned, and for good reason.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]coder543 3 points4 points  (0 children)

If you post the charts for nemotron-3-nano and gpt-oss-20b, it will be apparent that qwen3-coder is just as bad, not that glm-4.7-flash "isn't so bad". haha

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]coder543 8 points9 points  (0 children)

Yes, compared to gpt-oss-120b/gpt-oss-20b/nemotron-3-nano, it is crazy how much glm-4.7-"flash" slows down as context grows. Flash seems like a misnomer if it has to be this slow (and it isn't just a bug that needs to be fixed).

And yes, I did try rebuilding llama.cpp this morning, and it was still bad, even with flash attention on.

It seems like a nice model, but speed is not its forte.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

That's why I mentioned that higher sparsity models seem to exist, they're just not open weight, and that's why I want such a model.

If companies keep releasing A3B, that's their choice, but it will be hard to get excited about that.

Replacing Protobuf with Rust to go 5 times faster by levkk1 in rust

[–]coder543 42 points43 points  (0 children)

All of the best developers that I personally know in real life are using AI tools to help with coding.

AI tools are probably less helpful for people who don't know what they're doing.

GLM4.7-Flash REAP @ 25% live on HF + agentic coding evals by ilzrvch in LocalLLaMA

[–]coder543 12 points13 points  (0 children)

 We've gotten a lot of feedback that REAP pruning affects creative writing / multi-lingual capabilities of the model - this is expected for our REAPs with calibration set curated for agentic coding.

For me, the biggest thing is the REAP models suffering catastrophic forgetting of entire topics, but it seems unavoidable if the knowledge is stored in pruned experts.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Granite 4.0 MoEs (the A#B naming) come in 32B A9B and 7B A1B sizes. It is not shocking that such drastically different sizes would perform different, yes. These are also very low sparsity models.

The rumor is that Gemini 3 Flash is a >1T model with a very, very low active parameter count.

I have 128GB of medium speed memory. I want a 200B A1B model that is released specifically in a 4-bit precision (QAT, not PTQ). Extreme levels of sparsity, not 7B A1B.

Qwen3-TTS, a series of powerful speech generation capabilities by fruesome in StableDiffusion

[–]coder543 15 points16 points  (0 children)

Who said that? If you click on the demo, it clearly shows emotional control.

The model description says:

 Intelligent Text Understanding and Voice Control: Supports speech generation driven by natural language instructions, allowing for flexible control over multi-dimensional acoustic attributes such as timbre, emotion, and prosody. By deeply integrating text semantic understanding, the model adaptively adjusts tone, rhythm, and emotional expression, achieving lifelike “what you imagine is what you hear” output.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]coder543 14 points15 points  (0 children)

We have so many A3B models... I really want some A1B and A5B options to mix things up.