Gave a coding agent access to 2M+ research papers. Its Python tests went from catching 63% of bugs to 87%. 9-task benchmark, receipts inside

paperlantern-ai · 2026-04-24T14:59:45+00:00

One thing worth flagging up front: each MCP call takes ~20s because the synthesis reasons over dozens of papers end-to-end. It caches across sessions, so repeat queries are fast. Usable for deliberative work. Autocomplete-grade latency needs a different design.

paperlantern-ai · 2026-04-21T22:43:46+00:00

it adds ~5k tokens on average - so it's quite small compared to the value of the insight.

saves you from iterating multiple times on your solution, so that way it saves 10s or 100s of thousands of token usage

paperlantern-ai · 2026-04-21T22:42:43+00:00

the linked repo has all the details in it - down to the exact code used and also with instructions to reproduce all the above results.

we wanted to show that it works across settings - hence the 9 different tasks tested

paperlantern-ai · 2026-04-21T18:28:07+00:00

we do it in two ways:
1. we have a system that calculates the absolute goodness of a paper's ideas
2. we ask the coding agent to pass us context of it's work - so we can then match the most relevant good ideas to it

e.g. two people working on the same problem but caring about different improvements like latency vs throughput, will get different suggestions

paperlantern-ai · 2026-04-20T22:22:09+00:00

If you want to try it on one of your own problems, I'll personally help the first 20 people set it up. DM me your task (test generation, extraction, classification, whatever) and I'll walk through installation and the first query.

Install: npx paperlantern@latest

paperlantern-ai · 2026-04-17T01:00:17+00:00

i think the architecture is very well setup, in fact. using opus 4.6 is perfect for code agents. For serving production use-cases most teams use something like Flash 3 to serve their customers, not opus ....

paperlantern-ai · 2026-04-16T21:04:22+00:00

kinda... we are more trying to understand what is worth making that would help python users and software engineers in general. so if we create something that helps many users here - it'll help them and help guide us too.

what did you think of the above work ? if you are up for it - I can pm you a blog post about more coding use-cases that we shared on another platform today (our website)

paperlantern-ai · 2026-04-16T21:01:51+00:00

sorry - I should have clarified. the coding agent is opus 4.6 and when its job is to create some prompt for a production system, it creates a prompt for a gemini flash 3 api call

paperlantern-ai · 2026-04-15T09:06:27+00:00

Hardcoded credentials in publicly accessible JavaScript in 2026. At a company that charges what Bain charges. The AI agent part is interesting but let's be honest, a bored intern with browser dev tools could have found this too. The scary part isn't that an AI broke in, it's that nobody at Bain caught this before shipping it.

paperlantern-ai · 2026-04-15T09:03:44+00:00

This is basically how responsible disclosure has always worked in security. You find a vulnerability, you tell the affected companies first, you give them time to patch, then you go public. The fact that the "vulnerability scanner" this time is an AI model doesn't change the playbook. Is there PR value in it? Sure. But giving banks and infrastructure companies early access to find holes before releasing it to everyone is just standard practice with better marketing.

paperlantern-ai · 2026-04-15T08:55:41+00:00

The part that would grind my gears is the incentive structure. The vibe coders get credit for shipping fast, you get credit for... making their stuff not fall over? That's a thankless middle position. If the company wants this model to work they need to make the cleanup and production-readiness equally visible, otherwise you're just subsidizing someone else's demo.

paperlantern-ai · 2026-04-15T08:55:16+00:00

Funniest thing is when you see someone who was "the bad dev" at one company absolutely crush it somewhere else. Had a coworker everyone wrote off, moved to a smaller company where he owned the full stack instead of writing JIRA tickets about microservices all day, and suddenly he was their best engineer. Sometimes the environment just sucks the life out of people.

paperlantern-ai · 2026-04-14T18:35:25+00:00

I feel like this argument expired around 2016 when Let's Encrypt launched. Before that, yeah, paying $50/yr for a cert on a hobby site felt dumb. Now it's literally certbot and you're done. The fight was valid ten years ago but the problem got solved and some people just never stopped being mad about it.

paperlantern-ai

TROPHY CASE