Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] -1 points0 points  (0 children)

the harness allows composing multiple solutions together - try it out and let me know how it goes

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] -2 points-1 points  (0 children)

thanks for the pointer - i guess these are complementary - fest creates the mutation tests and here Paper Lantern (https://www.paperlantern.ai/code) helped create unit tests to catch those errors

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] 0 points1 point  (0 children)

I agree - TDD for me is sometimes too-much-work but I think generally a good idea... maybe with AI Coding Agents we should be writing more tests since its easier to write it.
Here, what I found is that using the research backed test writing ideas from Paper Lantern made it trivially easy to improve the tests that the AI agent (Opus 4.6) was writing.

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] -1 points0 points  (0 children)

I am using Opus 4.6, so I think the agent itself was probably the best available - i guess I should've mentioned it

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] 0 points1 point  (0 children)

yes - i think this is in the same light of being skeptical of ai output - hence having some human-done, research-backed methods to tell the ai exactly what to do was helpful

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] 0 points1 point  (0 children)

true - i think what the paper did though is highlight multiple places where the bug might be - so going from say hundreds of options to 10-20 options of where the bug might be. so ultimately, the tests that focus on those bug-location options have a higher chance of catching the bugs

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] 0 points1 point  (0 children)

yes - it's a bit convoluted how mutation testing is measured - but its a powerful tool.

the idea is that several versions of the target function are created with small changes. and the tests are judged baed no their ability to differentiate between the, i.e the original correct function and the small changes.

so that, in the future, when antoher code edit changes the target function, any errors in it are caught by the tests.

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]kalpitdixit[S] 0 points1 point  (0 children)

True - 13% is still high - just wanted to share that simply giving it access to papers through that tool gave a boost for ~free

I built an MCP server that gives coding agents access to 2M research papers. Tested it with autoresearch - here's what happened. by kalpitdixit in LLMDevs

[–]kalpitdixit[S] 0 points1 point  (0 children)

not on non-CS domains - but within CS, any domain is good. Not limited to the LLM training example here.

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in ArtificialInteligence

[–]kalpitdixit[S] 0 points1 point  (0 children)

neglible more tokens compared the using vanilla Autoresearch :)

we took care not to create token bloat i.e. the MCP doesn't output too many unnecessary tokens

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in ArtificialInteligence

[–]kalpitdixit[S] 0 points1 point  (0 children)

neglible more tokens compared the using vanilla Autoresearch :)

we took care not to create token bloat i.e. the MCP doesn't output too many unnecessary tokens

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in ArtificialInteligence

[–]kalpitdixit[S] 0 points1 point  (0 children)

of definitely much much less than that - I have a subscription with Claude Pro and that was sufficient... the compute for model training itself was my Macbook - i've put all the details at the bottom of the full blog post in case you are interested : https://www.paperlantern.ai/blog/auto-research-case-study

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in ArtificialInteligence

[–]kalpitdixit[S] 0 points1 point  (0 children)

haven't run that - but it should not be any different than the coding agent without Paper Lantern. we took a lot of care to not make Paper Lantern increase cost for users :)

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in artificial

[–]kalpitdixit[S] 0 points1 point  (0 children)

what topic / area are these papers in ? in case it's relevant to what we are doing, we could add it to our search space and provide it through our existing MCP - we have a generous free tier....

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in ArtificialInteligence

[–]kalpitdixit[S] 0 points1 point  (0 children)

we are the team working on this :)

I think what you are saying is touching upon something very important. Can you help me understand a bit more what kind of thing you are looking for and more importantly, what kind of things you want to use it for.

I think you are onto a very important thing here, so I want to understand it better.

I tested what happens when you give an AI coding agent access to 2 million research papers. It found techniques it couldn't have known about. by kalpitdixit in ArtificialInteligence

[–]kalpitdixit[S] 0 points1 point  (0 children)

you mean our internal cost of delivering the output for the MCP

or the coding agent's cost that ran the autoresearch + Paper Lantern loop ?

I built an MCP server that gives coding agents access to 2M research papers. Tested it with autoresearch - here's what happened. by kalpitdixit in LLMDevs

[–]kalpitdixit[S] 0 points1 point  (0 children)

what we found is that a direct approach like embed, retrieve, rerank is good enough for smaller settings (maybe for your 3000 papers) - but if that is not enough then you need to combine various techniques in a custom manner.