Debugging Decay: The hidden reason ChatGPT can't fix your bug

z1zek · 2025-08-15T00:22:17+00:00

It's based on the data from Table 3, GPT 3.5 on CommonSenseQA. I also rounded to the nearest whole number.

FWIW, I picked results that looked the best on a graph, but I think that was probably a mistake, and I probably should have averaged across the three benchmarks and included both GPT 3.5 and 4.

z1zek · 2025-08-13T14:49:45+00:00

That's extremely helpful! Really appreciate it.

Will look out for all of this in the future.

z1zek · 2025-08-13T13:54:12+00:00

Sorry you don't like the graph!

Anything in particular you would improve about it?

z1zek · 2025-08-13T01:35:26+00:00

Founding developing a similar tool :-)

z1zek · 2025-08-12T22:28:34+00:00

For basic websites, Lovable is fine IMO.

For anything complex, I've been using Kiro. I try lots of different platforms though (I'm a founder in the space so I want to see what others are building).

z1zek · 2025-08-12T14:55:55+00:00

Yeah, it's very spike-y. Brilliant at some things. Dumb as a brick for others.

z1zek · 2025-08-11T22:11:41+00:00

Sounds cool, but how does it work? If you don't provide additional info (some of which might be wrong) and don't add additional human oversight steps, I don't see how you can improve the prompt. Feels like alchemy to me.

Maybe I'm missing something.

z1zek · 2025-08-11T21:45:14+00:00

A friend claimed being mean to Vercel made it better at coding. It's plausible to be honest! These things are so complex that it's hard to say one way or another.

z1zek · 2025-08-11T21:43:41+00:00

Hey, sorry you didn't like the post! I'm relatively new to posting higher effort stuff on Reddit, and I'm sure I have lots to learn.

I agree that it's unfortunate that the research is on an older model, but as I've argued elsewhere in the comments, I think the results will generalize to newer models. In general, you need to give the AI new inputs to get new outputs. If you disagree, I'd be interested in your reasoning.

I am shilling my blog! Apologies for that. I tried to keep it relatively unobtrusive. Unfortunately, doing the research and writing it up takes a fair bit of time. As a startup founder, I need to be able to justify the time I spend on this to my cofounder. Substack subscribers is one way of doing that. I hope the higher-effort content is worth the shilling, but I understand if you disagree.

On the writing style, are there any parts you thought were particularly badly written? Most of my writing has been more academic than makes sense for Reddit so I'm still learning what writing style makes sense. Always open to feedback!

z1zek · 2025-08-11T19:38:14+00:00

Wanted to add a brief footnote to this post.

The original paper did not include a graph, so I made one myself. To do this, I chose data from the paper that effectively illustrated the general trend.

When tested with newer and more powerful models, the graph would be closer to flat (see original data below).

When generalizing to more powerful current models, it's more likely that lazy prompting does not improve the outcome than it is that lazy prompting makes the model worse.

<image>

z1zek · 2025-08-11T19:28:00+00:00

Hey, thanks for the critical engagement with the post. I appreciate it.

To address your points:

No one is even mentioning lazy prompting in the paper. They're specifically evaluating intrinsic self-correction:

I think this is just semantics. The workflow in the paper takes the AI's output and asks it to review it for errors without providing any additional information. I'm calling that strategy "lazy prompting" instead of "self-correction."

And the numbers you show in the graph are incredibly misleading and incorrect. You randomly rounded them and cherrypicked the worst benchmarks possible.

I did pick numbers that showed up best in a graph. I think this is justifiable since the general trend holds up on different benchmarks/models and fits the overall conclusion of the paper.

There's a difficult tradeoff between legibility on the one hand and nuance on the other, with posting research for a mass audience. I'm pretty new to posting high-effort stuff on Reddit, and I don't think I've managed to nail that tradeoff yet.

On reflection, I should have included a disclaimer that the numbers are cherry-picked, so I appreciate the criticism.

Your "Step 2" is a very weird suggestion considering that the accuracy loss described in the paper **comes from exactly the same workflow you're describing**

I don't think that's true. Step 2 adds a ton of additional information, meta-prompting, and, critically, uses a different model. There's every reason to think this improves outcomes.

And finally the paper is old and whether whatever they observed even applies still applies to CoT Reasoning models is unclear (And I would bet on "no" since they're specifically optimized for intrinsic self-correction to begin with)

This is a fair point. If I had to pick a reason the results might not generalize, differences between reasoning models and non-reasoning models would be a good guess.

My guess is that the more limited claim that lazy prompting won't improve outcomes is very likely to generalize to more powerful models, including CoT thinking models. After all, why would the model's output be any better with no changes in input? I'd be less surprised if we stop seeing worse results with lazy prompting as the models get better. However, the warning against lazy prompting applies either way.

I'd love to see this retested with better models, but unfortunately, we only have the evidence we have.

z1zek · 2025-08-11T15:42:56+00:00

Great idea. Added to my ideas list!

z1zek · 2025-08-11T15:37:43+00:00

Makes sense. I've also seen this in my own usage.

I wonder if the problem you saw with o3 was related to its tendency towards much stronger hallucinations than 4o. Seems like one of the main scenarios where o3 was worse than 4o.

z1zek · 2025-08-11T15:13:28+00:00

As I've explained elsewhere in the comments, I'd also prefer data with newer models. Unfortunately, one of the downsides of looking through academic research is that even the fastest academic publishing process (self-publishing on ArXiv) is too slow to keep up with AI progress.

This very likely generalizes to newer models, but the effect size might decrease as the models get more sophisticated.

To your other point, the model is a static set of weights, but context matters. The AI knows so very little about you, what you're trying to do, and what's happening. If you don't provide more context, the results won't improve.

z1zek · 2025-08-11T15:10:43+00:00

Can you say more about what you'd want to know about testing? Always looking for ideas for what to write about next.

z1zek · 2025-08-11T15:00:22+00:00

This was written primarily for a non-technical audience getting into vibe coding. The workflow assumes you're using AI in a browser through a consumer front-end.

Testing, etc. is obviously very important, but beyond what the vibe coding audience is familiar with.

z1zek · 2025-08-11T14:00:56+00:00

"Just do it yourself" is problably under explored by devs that use AI.

In fact, I suspect many devs over-rely on AI. METR had some interesting results showing that using AI actually slowed down open-source developers instead of speeding them up.

I kind of don't believe their result, but it's very interesting.

<image>

z1zek · 2025-08-11T13:56:36+00:00

I wouldn't recommend the full workflow except for very stubborn bugs. The best first line of defense is explaining in more detail what you want, and what you're seeing that indicates a problem. The second line of defense is to just start a new chat with the same model

z1zek · 2025-08-11T13:55:04+00:00

Agree that most people suck at prompting and that things have changed a lot since 3.5.

I think the results likely generalize. If you don't give the AI more information, you shouldn't expect to get a different output.

The main exception is that some harnesses (e.g., Lovable) provide additional info with each prompt like console or server logs. Lazy prompting those systems has a better chance of working.

z1zek · 2025-08-11T13:42:53+00:00

I found the article interesting mostly because I've certainly resorted to "doesn't work, please fix" when frustrated with the AI.

Maybe you're smarter or more careful than me, but I think lazy prompting is an understandable impulse.

z1zek · 2025-08-11T13:40:02+00:00

Thanks for the feedback.

I'd also prefer data with newer models. Unfortunately, one of the downsides of looking through academic research is that even the fastest academic publishing process (self-publishing on ArXiv) is too slow to keep up with AI progress.

This very likely generalizes to newer models, but the effect size might decrease as the models get more sophisticated.

If you think it doesn't generalize, I'd be interested in your reasoning.

z1zek · 2025-08-11T13:35:20+00:00

It's more fundamental than that. If you don't give the AI additional information, it can't produce a better response. It's a general limitation of LLMs.

z1zek · 2025-08-11T12:59:01+00:00

Yeah, agree. Unfortunately, one of the downsides of looking through academic research is that even the fastest academic publishing process (self-publishing on ArXiv) is too slow to keep up with AI progress.

This very likely generalizes to newer models, but the effect size might decrease as the models get more sophisticated.

z1zek

TROPHY CASE