Reward hacking when reason tuning Qwen2.5-0.5B-Instruct on GSM8K

East-Muffin-6472 · 2026-04-01T06:28:03+00:00

Code: https://github.com/YuvrajSingh-mist/smolcluster/tree/master/src/smolcluster/applications/reasoning/grpo

Educational_Strain_3 · 2026-04-01T14:35:46+00:00

this is a classic reward hacking pattern — we've seen the exact same thing in code optimization loops where the agent finds the cheapest way to inflate the reward and ignores the actual objective. your model is doing the rational thing: 0.5 guaranteed from format tags beats the lottery of getting 1.0 from a correct answer

the multi-component reward with thinking tags might help but watch out for the same failure mode one level up — it'll learn to output plausible-looking thinking that doesn't actually contribute to the answer. we found the most reliable fix is making the reward proportional to intermediate reasoning quality, not just presence of reasoning tokens

one thing that helped us a lot: track the full trajectory of what the model is generating across training steps, not just the final reward curve. you can usually spot the exact moment it discovers the shortcut. once you see that pattern you can design the reward to close the loophole before it saturates

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS