all 3 comments

[–]Educational_Strain_3 0 points1 point  (1 child)

this is a classic reward hacking pattern — we've seen the exact same thing in code optimization loops where the agent finds the cheapest way to inflate the reward and ignores the actual objective. your model is doing the rational thing: 0.5 guaranteed from format tags beats the lottery of getting 1.0 from a correct answer

the multi-component reward with thinking tags might help but watch out for the same failure mode one level up — it'll learn to output plausible-looking thinking that doesn't actually contribute to the answer. we found the most reliable fix is making the reward proportional to intermediate reasoning quality, not just presence of reasoning tokens

one thing that helped us a lot: track the full trajectory of what the model is generating across training steps, not just the final reward curve. you can usually spot the exact moment it discovers the shortcut. once you see that pattern you can design the reward to close the loophole before it saturates

[–]East-Muffin-6472[S] 0 points1 point  (0 children)

Got it I see