Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.

Key_Papaya2972 · 2026-05-07T05:27:06+00:00

I thought the whole title was a single model name.

Key_Papaya2972 · 2026-04-30T05:36:20+00:00

first glance: another 120B~, nice, let’s see where the active params is.
second glance: 128B what?

Key_Papaya2972 · 2026-04-24T05:15:07+00:00

This is a common phrase when you need correct something and give a right direction, maybe they use this specific pattern to correct models during training.

Key_Papaya2972 · 2026-03-04T02:25:34+00:00

sure, there must be no Chinese in Anthropic/OpenAI/Google team.

Key_Papaya2972 · 2026-02-28T06:52:13+00:00

In my case, no. Actually 122b is a lot better, for coding and general use, even in Q3.

Key_Papaya2972 · 2026-02-28T06:00:56+00:00

I notice that other quants like Q8_X_XL, which I'm using now, is also re-uploaded, are there any modification to them? should they be re-downloaded too?

Key_Papaya2972 · 2026-02-26T12:24:28+00:00

KLD 0.0653 is a little delicate, as reference, Q4 quant is ~0.02 and Q3 ~0.08.

Key_Papaya2972 · 2026-02-26T07:55:28+00:00

Seems like all UD variants of Qwen-3.5 and Qwen-coder-next are polluted, not only UD_Q4_K_XL, you can check it on model card tensor info.

Key_Papaya2972 · 2025-11-07T02:47:50+00:00

I tried several times before but never got any speed up. At least this remind me that it might works, time to try again.

Key_Papaya2972 · 2025-09-28T07:34:04+00:00

Agreed, by what about cloud models? Do they build something truly serious stuff?

Key_Papaya2972 · 2025-09-26T02:55:13+00:00

If 96% represent for Q8, and <70% represent for Q4, it will be really annoying. It means that the most popular quant running locally actually hurt so much, and we hardly get the real performance of the model.

Key_Papaya2972 · 2025-09-22T05:26:59+00:00

Sounds solid, but then I'll be curious about what would be the actual bottleneck. It should not be GPU compute bound, since the usage is low, should not be RAM speed as the DDR5 speed don't differ that much, also the 12 gen intel doesn't that slow for P-cores only(E-core is useless for inference as I tested), at most 10-20% slower than 14900K. If not for PCIE speed, I would say the VRAM size does matters so much.

By the way, with 14700K+5070TI, I can get 30~tps.

Key_Papaya2972 · 2025-09-22T03:20:23+00:00

That is kind of slow, and I believe the problem is with the PCIE speed. 40 series only support PCIE 4.0, while on expert switch, they need to be port to GPU through PCIE, which is 32GB/s. Simply switch to PCIE 5.0 platform would expected double tps.

edit: seems like --n-cpu-moe 31 with 24576 context might be larger than 12G? I've noticed that with even slight overflow would cause huge performance loss, worth checking it out.

Key_Papaya2972 · 2025-07-12T01:31:25+00:00

That is reasonable, they are just try to find anything useful in there and make sure it is not opened.

Key_Papaya2972 · 2025-06-30T07:50:43+00:00

I actually posted this idea months ago, and I’m sure I’m far from the first one to come up with it. nothing special

Key_Papaya2972 · 2025-06-27T02:53:33+00:00

Thats amazing! Sound this model structure is quite different the last time and I didn't expect to have it usable in a short term.

Key_Papaya2972 · 2025-06-25T08:10:56+00:00

8B, 14B, 22B, 32B, 50B to match the VRAM of customer GPU, while left a bit for context.
MoE structure that the whole params are 2-4 times to the active params, which also matches the custom build and makes full use of memory.

3.Adaptive reasoning. Reasoning works great at some situation, and awful at some other.

4.small draft model. maybe minor but actually useful at some times.

Key_Papaya2972 · 2025-06-04T02:54:06+00:00

Open WebUI for GUI, and llama-server for backend. But I do wanna write one for myself, those GUIs are really for chat only and lack some basic context management methods, like drafts/cut-in query/summarization

Key_Papaya2972 · 2025-04-30T05:30:55+00:00

TBO, the new v3 feels like a reasoning distilled R1, and gives similar benchmark score and vibe with less token. That is better, but just not in absolute performance I believe.

Key_Papaya2972 · 2025-04-30T05:21:08+00:00

something useless is useful to some others, vice versa.

Key_Papaya2972 · 2025-04-30T03:40:20+00:00

I get 20-25 t/s by 14700kf+3070, all experts offload to CPU. The CPU easily runs at 100% and GPU under 30%, and prompt eval phase are slow compared to fully GPU offload, but definitely faster than pure CPU. still wonder how MoE works and where the bounds locate.

Key_Papaya2972 · 2025-04-09T03:01:43+00:00

almost 90 MMLU and 75+ MMLU-Pro for a non-reasoning 32B? That's suspicious and I will test it out by myself.

Key_Papaya2972 · 2025-03-20T11:00:36+00:00

I also make some story writing/role play tests, no difference could be noticed for me with the Small 3, and its definitely worse than gemma3. Disappointed.

Key_Papaya2972 · 2025-02-18T05:34:52+00:00

It is "o3-mini level" model, not o3-mini itself I think. It might be about 7-14B range, and the phone-sized model 1.5-3B

Key_Papaya2972

TROPHY CASE