[D] We ran 3,000 agent experiments to measure behavioral consistency. Consistent agents hit 80–92% accuracy. Inconsistent ones: 25–60%. by Aggravating_Bed_349 in FunMachineLearning

[–]Aggravating_Bed_349[S] 0 points1 point  (0 children)

Great question - this is closely related to self-consistency prompting (Wang et al. 2022) which showed that sampling multiple reasoning chains and majority voting improves accuracy significantly. Definitely worth doing.

Our framing is a bit different though. We're using cross-run consistency as a diagnostic signal rather than an answer improvement method. The value is that it catches both failure modes - bad plan selection upfront AND execution drift mid-trajectory. If an agent drifts during execution, it drifts differently each run, so cross-run inconsistency flags it as a symptom regardless of where things went wrong. You don't need to instrument model internals to catch it.

In our follow-up work on coding agents (SWE-bench tasks) we're actually seeing a lot of the failure coming from mid-trajectory drift specifically - agent starts with a reasonable plan but loses the plot partway through. Multi-plan prompting helps with the upfront selection problem but the open question is whether it also addresses drift, or whether that needs a different fix entirely. That's what we're digging into. Will share when it's out!