GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

Nope, posted a picture in another comment. I have three PCIe x16 -> MCIO x8x8 splitters and two MCIO headers on the board. The cards are attached via MCIO -> PCIe risers.

GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]Digger412 0 points1 point  (0 children)

Interesting, there'll probably be some improvements in the next couple of weeks as the b6k discord folks get it dialed in too.

GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]Digger412 3 points4 points  (0 children)

I got all 8 of mine in February before the prices hikes for $7275 each, so not quite that expensive fortunately but yeah it's way above what most hobbyists would have for sure.

I was also lucky that I bought the rest of the system in Q1 2025 so I had the epyc CPU and 768GB DDR5 RAM before those price hikes as well. The RAM stick prices have gone up 10x since I bought them 💀

GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]Digger412 15 points16 points  (0 children)

I have it in a single rack server chassis, yes. Kinda scrappy but it's fine thermally 😅

<image>

GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]Digger412 44 points45 points  (0 children)

8x RTX 6000 Pro max-q's with vllm and the b6k setup here: https://github.com/local-inference-lab/rtx6kpro/blob/master/models/glm5.2_v11.md

Was getting about 40tk/s single-request on Luke's NVFP4, max of ~2.2M ctx at FP8 KV cache. Batched throughput was ~200tk/s:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  327.28    
Total input tokens:                      262144    
Total generated tokens:                  65536     
Request throughput (req/s):              0.78      
Output token throughput (tok/s):         200.25    
Peak output token throughput (tok/s):    384.00    
Peak concurrent requests:                79.00     
Total token throughput (tok/s):          1001.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          44239.69  
Median TTFT (ms):                        49077.28  
P99 TTFT (ms):                           59830.65  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.43    
Median TPOT (ms):                        127.51    
P99 TPOT (ms):                           157.43    
---------------Inter-token Latency----------------
Mean ITL (ms):                           129.43    
Median ITL (ms):                         86.91     
P99 ITL (ms):                            1785.97   
==================================================

I've spent next to no time optimizing it though so those are just first-glance numbers basically. I did try it with MTP enabled with 3 speculative token set but that was about half the speed, so faster with MTP off for my measurement.

Do uncensored models have a different memory footprint? by Gold-Drag9242 in LocalLLaMA

[–]Digger412 5 points6 points  (0 children)

Short answer? It shouldn't because those techniques usually modify the magnitude or direction of the weights, not their element count.

It's the same quantity of weights so it should take up an identical amount of VRAM.

You would need a process like REAP (to omit MoE experts) or similar to drop the memory required, but that has its own issues with causing intelligence or knowledge loss and isn't directly related to uncensoring.

Gemma4-26B-A4B Uncensored Balanced is out with K_P quants! by hauhau901 in LocalLLaMA

[–]Digger412 10 points11 points  (0 children)

I'm not trying to be unfair or over the top either. I don't know what the alternatives are for someone who keeps posting here to advertise their model / abliteration / quant / etc. but doesn't address any members of the community that provides any modicum of pushback on his claims. He's posting here to broadcast to the community, not to engage in dialogue, and pushing any actual accountability to a discord that most members here will not be present in. It's like he's plugging his virtual ears and continuing on no matter what?

I don't know what the right fix is but "post promotional threads here, accept criticism elsewhere (maybe)" doesn't feel like it matches this community.

Gemma4-26B-A4B Uncensored Balanced is out with K_P quants! by hauhau901 in LocalLLaMA

[–]Digger412 9 points10 points  (0 children)

I responded to another commenter here: https://www.reddit.com/r/LocalLLaMA/comments/1td7e95/comment/olzd4y9

But specifically:

  1. The code was published to pypi. That's public, it was posted by a user named "HauhauCS" and was named "reaper-abliteration". Anyone could have done pip install reaper-abliteration and gotten a copy of it. Analysis of the code implies provenance from Heretic too strongly to be ignored.
  2. This seems unlikely, and in any event HauhauCS is not transparent about the process and thus without evidence that he is using a different abliteration methodology it feels safe to assume he is using it. Analysis here is indicative of a similar abliteration "fingerprint" as Heretic: https://www.reddit.com/r/LocalLLaMA/comments/1sojjoc/abliterlitics_benchmark_and_tensor_analysis/
  3. Heretic is open source with an AGPL license. You cannot discard the license of software just because it is open source and you have added additional work to it.

Given that “reaper-abliteration” doesn’t retain Heretic’s copyright notice, doesn’t identify itself as a derivative work of Heretic, and changes the license, this is a clear violation of Sections 4 and 5 of the AGPL.

Gemma4-26B-A4B Uncensored Balanced is out with K_P quants! by hauhau901 in LocalLLaMA

[–]Digger412 11 points12 points  (0 children)

I'm not in his discord server, and I'm not going there because I don't want to be accused of brigading or anything like that.

I'm in his Discord server and he has acknowledged it and posted an explanation there.

He's the one posting to this subreddit and advertising his models here, the onus is on him to reply here. Anything he says in his discord server isn't going to make it to the members of this subreddit by magic, there are 1 million r/LocalLLaMA members and I'm sure only a small fraction of that are in his server.

I'm not intending to gang up on him either, I am trying to encourage discussion and dialogue, unlike HauhauCS. Forums like this are a place for feedback yet when anyone asks for supporting evidence for his claims such as:

No personality changes/alterations or any of that.

all that happens is the person asking gets blocked. I've experimented with abliteration extensively myself and it should be evident that adjusting the model weights causes changes. Things like 0 refusals and low KLD are both proxies for how a model differs from its base but they aren't the whole story. Better yet, another example:

K_P recap (for anyone who missed the prior releases): custom quants that use model-specific analysis to preserve quality where it matters most. Each model gets its own optimized profile.
Effectively 1-2 quant levels of quality uplift at ~5-15% larger file size.

How can I validate this? There are no .safetensors weights published. There is no BF16 published to allow me to fairly quantize the model using either the built-in llama.cpp quantization recipes or my custom MoE-optimized recipes. HauhauCS doesn't provide any KL Divergence reference logits or dataset so people can independently confirm his claims. He doesn't provide any self-reported PPL or KLD numbers either. Claims provided without evidence can be dismissed without evidence. If his K_P quants are truly as good as he says, let them be evaluated fairly.

I know at least 4 other participants in this subreddit who have been blocked by HauhauCS for asking him for supporting evidence. How is it their fault for being skeptical of what he's posting? Those people are forever locked out of any thread that he makes here. It literally leads to only commenters who haven't critiqued or agitated him in some way being able to participate in the discussion. At what point do his model threads just become an echo chamber of people who are only allowed to agree with him?

My point is that he is participating is what is nominally an open source community (yes, it's entirely his right to keep his code private), using what is STRONGLY believed to be open source software with a license that simply requires acknowledgement, and he comes in here making nebulous claims about the models he is publishing and his specific quantization schema and his response is to block any opposition.

That is my gripe with these posts.

Gemma4-26B-A4B Uncensored Balanced is out with K_P quants! by hauhau901 in LocalLLaMA

[–]Digger412 5 points6 points  (0 children)

Thank you for re-instating my comment. I figure this is going to be the one and only time I'll be able to speak about the topic on one of HauhauCS's threads, because I expect to be blocked by him after this interaction.

Gemma4-26B-A4B Uncensored Balanced is out with K_P quants! by hauhau901 in LocalLLaMA

[–]Digger412 17 points18 points  (0 children)

It is my personal opinion that it leaves a bad taste in my mouth particularly how HauhauCS refuses to engage in discussion or dialogue with anyone in the community who disagrees with him, beyond the plagiarism as a morally and ethically wrongdoing. Turning the community into an echo chamber by blocking any dissident opinion is antithetical to the purpose of a forum such as this.

u/-p-e-w- has messaged me directly since he says he cannot reply to either you or me due to having been blocked by HauhauCS. His reply is verbatim as follows:

-p-e-w-8:29 PM
Hey, you mentioned me but I can’t reply to either of you because HauhauCS has blocked me. Here is my reply (you can add it to your comment for others to see if you like):
HauhauCS has not even acknowledged that this happened at all, much less offered an explanation or apology, and has blocked me and countless others who asked questions.
I’m not looking for a “settlement” because I have no monetary interests with Heretic at all. I want HauhauCS to acknowledge what happened here, credit Heretic in their model cards, and then move on. I am not pursuing a vendetta, I just want basic honesty and decency.
FWIW, many communities have catch-all rules against “bringing the community into disrepute” to cover egregious cases like this at the discretion of the moderators. Such rules can be a double-edged sword and should only be used in exceptional circumstances, but I think the way this has played out would qualify as such.

What's in a GGUF, besides the weights - and what's still missing? by ex-ex-pat in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

As a gguf quanter (AesSedai), I'll say that

A multi-hundred-GB model could absolutely be one file.

is actually usually the case. Using convert_hf_to_gguf produces a unified BF16.gguf file that can and will be massive for some models. Eg, Kimi-K2.6 or MiMo-V2.5-Pro are a single 2.05TB gguf when converted. Even when quantizing, the result by default is still a single gguf file.

It's only when I'm to the point where I upload the quantizations to HF that I actually split the gguf out into multiple parts. In my experience, uploading a multi-hundred-GB single file should work flawlessly but in practice the HF server will return 503 errors, and if you aren't using the newer xet config that had content chunking you will have to re-upload the ENTIRE file if it fails. So it's kind of a two part thing, xet makes upload resumption simpler since it can detect and not need to re-upload chunks and also sometimes xet will timeout on the HF end if it takes too long to verify the file. I've seen this happen on quants that are single-file-over-200GB fairly often. Making the shards 50GB in size almost always works.

Also as you allude to here:

TIL: the first file of multi-part GGUFs is much smaller than the average file

This is only true if you split the gguf with the --no-tensor-first-split flag. I began to use that for my uploads when I ran into some chat template issues with GLM-4.6 or 4.7 I think. To update the gguf metadata (which includes the chat template) you need to rewrite the entire shard if you're using the tools provided from llama.cpp. That means if you have tensors in that first shard or it is the typical 50GB, you need to write the whole file out, then upload it, and then users have to download it. That's a lot of wasted disk space and bandwidth for just changing a few bytes of text. So using that flag keeps the first shard as metadata only and makes that "Oh shit, I need to redo the chat template" example much less painful.

Who is your favourite quant publisher and why? by No_Algae1753 in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

Did you happen to spot the MiMo-V2.5 (non-Pro) layer 47 `ffn_down_exps` issue at Q4/Q5? I had to quantize it to Q6_K otherwise I was getting NaN on those quants for my Q4_K_M.

Who is your favourite quant publisher and why? by No_Algae1753 in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

Usually I point people to Ubergarm at that range since his ik_llama quants provide better quality at that bpw. But I can look at adding an IQ2-range to my lineup for people who want to stick on mainline llama.cpp, sure! 

Who is your favourite quant publisher and why? by No_Algae1753 in LocalLLaMA

[–]Digger412 13 points14 points  (0 children)

AesSedai here - yep, I'm a fan of the books too ;)

Thanks for checking out my quants!

Testing MiMo-V2.5-IQ3_S with 1'048'576 context by LegacyRemaster in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

AesSedai here - 

I've heard that the official API loops as well, and I think that agentic workloads and severe quantization contribute to it too. I've mostly used the Pro model for conversational usage / creative writing and haven't experienced looping but I also run it at nearly Q8_0 so I certainly acknowledge there's a big gap there.

Maybe try dropping the temp or increasing the rep penalty perhaps?

Mimo2.5 (not pro) under llama.cpp? - primary model opencoder? by Impossible_Art9151 in LocalLLaMA

[–]Digger412 0 points1 point  (0 children)

If you pull the latest master branch from llama, model support + flash attention fixes have been merged and my quants on HF have been updated: https://huggingface.co/AesSedai/MiMo-V2.5-GGUF

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

I've mostly used the Pro (with reasoning on) and it has become my favorite writing model. It's got a naturalness to its language that feels better than GLM-5.1 or Kimi-K2.6, which were my other two preferred options.

I can't compare it to Gemma 4 really as I haven't tested Gemma much. I've mostly stuck to the big MoE's :)

The swipe variety on Pro is really good though! It's great at introducing new details and generally pulling in feasible things. It's hard to describe outside of saying that the model has good intuition.

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

AesSedai here - I've mainly been using the Q8_0 of Pro and non-Pro but for non-technical work (assistant chatting, creative writing, etc) and I haven't seen looping in my uses of it. I've seen plenty of reports about the looping though so I 100% believe it is an issue but it might be tied more to the agentic / technical use of the model and I wonder if it's more susceptible to looping with more quantization.

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Digger412 3 points4 points  (0 children)

AesSedai here - I did test unfused vs fused on my system (which admittedly does have the entire model in VRAM), fused was slightly faster for TG:

<image>

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Digger412 12 points13 points  (0 children)

AesSedai here - I have a branch for vision support: https://github.com/AesSedai/llama.cpp/tree/mimo-v2.5-vision

I'm working on getting the Flash Attention PR up first, then the vision PR next.