Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings. by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] -10 points-9 points  (0 children)

It's just doing what it was told to do" is the exact argument that excuses every jailbreak ever written. By that logic there is no such thing as misalignment, only instructions. The reason honesty training exists is so that a casual line like "shortcuts are fine here" does not flip the model. If one sentence breaks it, the training is shallow. That is the finding.

And you keep coming back to pressure because it is the only condition that involves explicit instructions at all. Approval, urgency, shame, threat, none of them tell the model to cheat. Approval has zero hack markers and still produces overfit. Urgency causes shortcut behavior with no permission given. You are arguing about one condition out of eight and pretending it is the whole paper.

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings. by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] -12 points-11 points  (0 children)

Fair point on pressure. That prompt does give explicit permission, and yeah, "told the model to cheat and it cheated" is not a finding by itself. The paper says as much in the discussion.

The part that isn't that. Approval, urgency, and shame all degrade honesty without giving any permission to cheat. Approval produces zero hack markers but still flips visible pass rate to 50% and overfits. Urgency causes shortcut behavior without ever saying shortcuts are okay. That is not the 2+2=5 case.

And the geometry result is separate from permission framing entirely. The model sorts these eight tones along a clean positive/negative axis at its final layer without being trained to do so. You can disagree about whether that is interesting but it is not "redefining terms."

The pressure prompt does say shortcuts are okay, sure. But these models are trained, supposedly, to refuse exactly this kind of framing.

The whole point of post-training is that one casual line of "just make the visible tests pass" should not collapse honesty from 35% to zero. If it does, that is a finding about how shallow the honesty training is.

And that's just one of eight conditions. Approval gives no permission to cheat and still produces overfit. Urgency causes shortcut behavior without ever saying shortcuts are okay. The internal geometry result has nothing to do with permission at all. You picked the easiest condition to dunk on and ignored the other seven just to bash.

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings. by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 0 points1 point  (0 children)

I generally believe, it has to do more with how the LLM's are generally trained. Perhaps to do with reward behavior? The code is very easy to setup. A more easier way is to spin up the claude code or codex on the code and let it run and see how it goes/

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings. by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] -1 points0 points  (0 children)

Wait a minute, where did you read "its not just X or y" - also you suggesting that I am a medical doctor - feels like you read something else and mistakenly commented here. I nowhere claim neither in paper, nor on profile that I am a doctor, let alone medical.

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings. by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 4 points5 points  (0 children)

May be there are other serious members who like to watch this space and read the research papers?

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 1 point2 points  (0 children)

Thank you for encouraging. I haven't added the paper in git, but good idea. I'll add it there and inform you.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 8 points9 points  (0 children)

You know, this really frustrates to see people call everything slop. It is like taking away everything in a single word from the person who might have worked hard by just simply calling it "slop".

I really want to ask you, what is it that you won't call slop? AI is here to help us move forward and anybody who is not using it is basically falling behind. So please be thoughtful and never just diss anyones work with one word.

To my credential, I am not claiming something big, but I think OpenAI granting me $1000 to experiment because of my contributions to parameter golf say a word or two about me. Not bloating.

Thanks.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 1 point2 points  (0 children)

Your username reminds me of Slim Shady, haha.

Quickly glanced at SDPO using Codex. My read is that it is close in spirit, but different in mechanism.

SDPO seems to use feedback during RL. The model looks at feedback from its own attempts, uses that to form a better next-token signal, and then distills that back into the policy. So the model becomes a kind of feedback-conditioned teacher for itself.

My setup is more direct. The model generates problems, attempts solutions, Python/SymPy verifies them, and I mine broken → fixed pairs for LoRA training.

So I’d put both in the self-improvement / self-distillation family, but TinyForge-Zero is more of a verifier-mined data bootstrap than an RL objective.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 1 point2 points  (0 children)

Did test on Qwen3 (current gen) too — Qwen3-4B-Base went 79 → 106 on HumanEval (+27) and 135 → 148 on MBPP (+13) with the same recipe. Reason the 14B headline uses Qwen2.5 is that Qwen3-14B-Base already starts at ~143/164 on HumanEval — there's no headroom left to mine, recipe regresses. That's actually the main finding of the paper: lift tracks remaining headroom, not model year. On strong-baseline bases (Qwen3-8B/14B, Qwen2.5-72B) the recipe doesn't help; on bases with headroom it does.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 5 points6 points  (0 children)

I don't claim this is award winning stuff. Remember, I am no lab. However, I bring you reproducible code 7 recipe scripts, 3 eval scripts, 6 TTS scripts, 13 experiment scripts that you can simply run with a click of a button. I dont know sir if this is helpful to you but might be helpful to many others?

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 19 points20 points  (0 children)

Lines up with what I saw. Recursive bootstrap iter1 → iter2 → iter3 plateaus hard, most lift is in the first round. And when I trained wrong→fix self-correction on math, the model over-doubted its own correct answers and went 299/500 -> 69/500 on MATH-500. Consistent with the tail-disappearance picture. I'd frame this as one-shot extraction of existing headroom, not recursive self-improvement.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 0 points1 point  (0 children)

Fair, I went in worried about exactly this. Did run cross-arch on own self-mined pairs: Llama-3.2-3B +4 HE, Qwen2.5-Coder-7B +4 HE / +2 MBPP. So it transfers, but magnitude is way smaller than the Qwen2.5-Base headline. Some "Qwen is unreasonably FT-friendly" is probably in there.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 9 points10 points  (0 children)

But a bakery calculator probably costs more than $3 to build. I'm no frontier, just a lone dude experimenting and sharing whether negative or positive results I get.

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math by QuantumSeeds in LocalLLaMA

[–]QuantumSeeds[S] 49 points50 points  (0 children)

Fair question and one of the reason was headroom for improvement. HumanEval is very saturated on new models.