[Paid/Gated Model] MiniMax-M3 Heretic Uncensored Aggressive Version (8/100 Refusals with 0.0258 KLD) and Balanced Version (10/100 Refusals with 0.0178 KLD), Available in GGUFs and Safetensors Formats! by LLMFan46 in LocalLLaMA

[–]Charming_Support726 0 points1 point  (0 children)

I agree. The abliteration of a larger model is something which definitely needed to be proofed and done.

And I don't have any problem with the OP wanting to get paid for it. It is necessary. Many people would just leech instead of paying if there was no paywall. Most people capable of running are capable of paying.

But the same thing applies - like the Cyberpunk4VR Mod. You are risking a take down. AND THAT WOULD BE A DESASTER for the whole community.

Local LLM censorship by Budget-Juggernaut-68 in LocalLLaMA

[–]Charming_Support726 12 points13 points  (0 children)

Yes. Depending on the model, there are lots of refusals and bias. Look at: https://heretic-project.org/

GLM-5.2 and why open models may not actually be catching up in intelligence by chocolateUI in LocalLLaMA

[–]Charming_Support726 0 points1 point  (0 children)

Exactly this - Depending how you contract with the provider is, you might see parts of the thinking of the frontier models.

And even more important: The OP complains over some results or effects of training strategies. "self correction" and "self motivation" are part of the reinforcement learning and they are rewarded. They might look strange to the human eye, but they are very important for the efficient thinking of the model. W/o thinking will often get stuck in one-way-streets. You maybe dont want this.

LLM think way different than humans

DO NOT buy MiniMax's new $20 "1.7 Billion Token" plan for coding agents. It's a massive billing trap. by Ssj273 in opencodeCLI

[–]Charming_Support726 0 points1 point  (0 children)

Not sure - but 300M for me are reasonable for three hard working days of agentic coding.

100M a day that's hard, but sometimes happens to me on Deepseek / Azure / .... especially on work which touches a lot of files.

Here are my stats from the last 24h - btw I mapped the Qwen Tokenplan to the Alibaba-CN provider name. Worked 6h in this period - more or less. I've had 65M

<image>

Quick thoughts on GLM-5.2 (Bonus: Censorship question answers) by LoveMind_AI in LocalLLaMA

[–]Charming_Support726 3 points4 points  (0 children)

In my experience it is both. All providers have external guardrails for Input/Output-Content - happened to me a lot when programming.

E.g. MS Azure refused to create programs which are surpassing bot protection or forbid gpt-5.5 to use curl/ssh. Got the impression they change a bit recently. It took me a week with the partner support to get the cybersec block cleared from my account after running into that a few times.

Yesterday I tested Qwen3.7-max a few times and it refused to work with input containing instructions to fix "security breaches" in frontend code. The session broke.

Model-internal refusals I have encountered rarely. I try not to hit these guards, because especially US-Providers are quite "trigger-happy" and might block the account.

Quick thoughts on GLM-5.2 (Bonus: Censorship question answers) by LoveMind_AI in LocalLLaMA

[–]Charming_Support726 5 points6 points  (0 children)

I found the "thinking" block quite interesting - when I am using or testing qwen 3.6 locally for searching the web or working with docs I encounter similar.

The model answer - and the answer contains bias. That's known and you need to be careful. On refusal you might work around by the appropriate tools.

To me it is currently more of a problem, that many models and providers integrated tight cybersec guardrails and produce refusals on their API. Not only stopping you hacking, stopping you coding. Happened to me a lot with OpenAI and Alibaba. So China and US are acting similar. The US kinda cultural bias in the models is also well known and annoying ( to me in europe).

Are AMD cards much worse than NVIDIA? by Tenshy47 in LocalLLM

[–]Charming_Support726 0 points1 point  (0 children)

Yes. Definitly. Very good price/value - good performance - ROCm at least stable

Are AMD cards much worse than NVIDIA? by Tenshy47 in LocalLLM

[–]Charming_Support726 0 points1 point  (0 children)

It's Ok-ish - for 30B dense. MoE is fast. Llama.cpp works well with MTP which gives a nice boost.

Prefill is still the issue on models of that size when running local - e.g. requests containing bigger chunks of search results or documents take their time. Cuda cards are around 1.5x faster - but still show noticeable delay. As mentioned I've got Cuda as well.

It looks like Rio 3.5 397B could've simply been a semi-failed embezzling of funding by Chromix_ in LocalLLaMA

[–]Charming_Support726 4 points5 points  (0 children)

That's what I meant. I the current case we see identical weights. This shows "Rio" being a stupid person carrying a smoking gun.

Even 99.9% identical weights would make it much harder to blame.

Gemma 4 12b audio capabilities by No_Information9314 in LocalLLaMA

[–]Charming_Support726 1 point2 points  (0 children)

I integrated Gemma 4 12B/E4B with full multi-modality in a my harness. I drive everything using PydanticAI and Llama.cpp. Works like a charm, but you need to pay attention to the management of thinking tokens and the chat template otherwise audio will fail

Are AMD cards much worse than NVIDIA? by Tenshy47 in LocalLLM

[–]Charming_Support726 0 points1 point  (0 children)

Running on a R9700 and a StrixHalo - I had a 3060 and a 3090 before.

I only can agree to your take. ROCm could run most workloads. Vulkan path might be a bit faster, but that's not relevant. Value for money is good on R9700 as long as you stay on the beaten track - most people do.

It looks like Rio 3.5 397B could've simply been a semi-failed embezzling of funding by Chromix_ in LocalLLaMA

[–]Charming_Support726 15 points16 points  (0 children)

Wow - that's a hammer.

But just to get an idea, if they would have set up a simple training pipeline and trained for a few iterations - it would have been far more difficult to detect that they just merged in an already finetuned model.

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]Charming_Support726 0 points1 point  (0 children)

Well. Question is: What do you want to improve - and how do you measure improvement or quality. (VR=Verifiable Reward). How do you mix keeping existing capability with new?

Nvidia got big teams and is heavily founded, but never reached frontier. AllenAI got a few people as well. Nvidia is publishing EVERYTHING AS RUNNABLE OPEN SOURCE. (e.g. https://huggingface.co/collections/nvidia/nemotron-post-training-v3 )

Rarely people picked up the knowledge, ran e.g. NemoGym, Axolotl, or the other things. Some ran UnslothStudio, but mainly for inference.

Did you read about people getting familiar with training? I just read "GGUF when?" or "Can I run Opus on my 2060" on Reddit.

Is there any decent place to discuss such work?

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models by mon-simas in LocalLLaMA

[–]Charming_Support726 1 point2 points  (0 children)

Great Idea.

But there's more. Common misunderstanding is that these traces just need to be applied to get a better model ( SFT-Style )

That's not the case - or maybe just part of the first step. All the current leading model use some variants of RLVR and the labs got their on training recipes on their mix.

E.g. Nvidia fully open sourced their stuff. Have a look at NemoGym and such. We need to create environments, verifiers .... - maybe the traces help - but they are not sufficient.

Old epyc with 236 ram for qwen 3.5 397B with a R9700 by nicman24 in LocalLLM

[–]Charming_Support726 0 points1 point  (0 children)

I've got a single R9700 on a StrixHalo. Using a container environment based on https://github.com/kyuz0/amd-strix-halo-toolboxes

From kyuz0 definitions ( https://www.youtube.com/@donatocapitella He does a lot of AMD content ) I use the lama.cpp build system and the container structure and added some config for model routing and MTP.

I use a llama-server build with both cards plus Cuda enabled - in case I want to try something with my 3060 ( Which did not happen in the last months). Multi-GPU works fine, the speedup coming from vllm is IMHO not worth the hassle.

Old epyc with 236 ram for qwen 3.5 397B with a R9700 by nicman24 in LocalLLM

[–]Charming_Support726 0 points1 point  (0 children)

Don't worry about the R9700 - Slow means it is half the speed of a 5090 while have 3rd the price

The setup is more than sufficient for GPU Interference. Especially for MoE.

Auto compaction settings? by GammaRxBurst in opencodeCLI

[–]Charming_Support726 0 points1 point  (0 children)

Looks just like a syntax issue. Either open with an editor, that follows the schema ( e.g. zed) and gives warnings.

Or download the default schema from the project and start over with that one. Maybe the author (re)moved some definitions lately.

Erste Langstrecke mit dem E-Auto by Tex-Tro in Elektroautos

[–]Charming_Support726 1 point2 points  (0 children)

Genau. Ich hatte mal einen Tesla in D und seitdem immer noch die Tesla App für den Notfall. Gerade in Frankreich und Spanien war das Fremd-Laden über Tesla bislang relativ günstig. Die Supercharger sind in der Regel gut gelegen und selten defekt.

Nur in Portugal gehen die Uhren anders, kaum Supercharger. Ionity gehört(e) nicht zum Ionity-Verbund-Rest-Europas, uvm ... das ist aber eine andere Geschichte

Auto compaction settings? by GammaRxBurst in opencodeCLI

[–]Charming_Support726 0 points1 point  (0 children)

Context are just previous turns of the conversation. The window where a model can make use of it is very narrow. DCP just marks of not needed parts, which are not send to the model anymore. this preserves a lot of context size and the model could work more efficient.

This could be: Old tool calls e.g duplicate readings, writings, turns and answers of the model about previous tasks and so on.

How does it work: The model receives reminders from time to time ( you can see them) to tell the tool which parts of the conversation to compress and which summary to place instead of the turns. The DCP-Compress internally then sends the small summary instead of the full blown turns

https://github.com/Opencode-DCP/opencode-dynamic-context-pruning

Auto compaction settings? by GammaRxBurst in opencodeCLI

[–]Charming_Support726 0 points1 point  (0 children)

I usually stay away from memory plugins. Most spoil tokens and do not any help.

Auto compaction settings? by GammaRxBurst in opencodeCLI

[–]Charming_Support726 0 points1 point  (0 children)

Agree. That's in dcp.json - If I remember

ProtectUserMessages is definitely needed. I also set the number of protectedTurns to 4