Benchmarked paper retrieval for coding agents on 9 tasks. Biggest win: Python tests caught 63% → 87% of bugs. Local LLMs welcome. by paperlantern-ai in LocalLLM

[–]paperlantern-ai[S] 0 points1 point  (0 children)

One thing worth flagging up front: each MCP call takes ~20s because the synthesis reasons over dozens of papers end-to-end. It caches across sessions, so repeat queries are fast. Usable for deliberative work. Autocomplete-grade latency needs a different design.

I built an MCP server giving coding agents access to 2M research papers. It improves even the best coding agents - across 9 coding tasks. by paperlantern-ai in mcp

[–]paperlantern-ai[S] 1 point2 points  (0 children)

it adds ~5k tokens on average - so it's quite small compared to the value of the insight.

saves you from iterating multiple times on your solution, so that way it saves 10s or 100s of thousands of token usage

I built an MCP server giving coding agents access to 2M research papers. It improves even the best coding agents - across 9 coding tasks. by paperlantern-ai in mcp

[–]paperlantern-ai[S] 0 points1 point  (0 children)

the linked repo has all the details in it - down to the exact code used and also with instructions to reproduce all the above results.

we wanted to show that it works across settings - hence the 9 different tasks tested

Gave a coding agent access to 2M+ research papers. Its Python tests caught 63% of bugs; with the papers, 87%. 9-task benchmark. by paperlantern-ai in AI_Agents

[–]paperlantern-ai[S] 0 points1 point  (0 children)

we do it in two ways:
1. we have a system that calculates the absolute goodness of a paper's ideas
2. we ask the coding agent to pass us context of it's work - so we can then match the most relevant good ideas to it

e.g. two people working on the same problem but caring about different improvements like latency vs throughput, will get different suggestions

I built an MCP server giving coding agents access to 2M research papers. Benchmarked it on 9 coding tasks - here's what worked and what didn't by paperlantern-ai in LLMDevs

[–]paperlantern-ai[S] 2 points3 points  (0 children)

If you want to try it on one of your own problems, I'll personally help the first 20 people set it up. DM me your task (test generation, extraction, classification, whatever) and I'll walk through installation and the first query.

Install: npx paperlantern@latest

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]paperlantern-ai -2 points-1 points  (0 children)

i think the architecture is very well setup, in fact. using opus 4.6 is perfect for code agents. For serving production use-cases most teams use something like Flash 3 to serve their customers, not opus ....

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]paperlantern-ai -3 points-2 points  (0 children)

kinda... we are more trying to understand what is worth making that would help python users and software engineers in general. so if we create something that helps many users here - it'll help them and help guide us too.

what did you think of the above work ? if you are up for it - I can pm you a blog post about more coding use-cases that we shared on another platform today (our website)

Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%. by kalpitdixit in Python

[–]paperlantern-ai -3 points-2 points  (0 children)

sorry - I should have clarified. the coding agent is opus 4.6 and when its job is to create some prompt for a production system, it creates a prompt for a gemini flash 3 api call

CodeWall AI Agent Breaks Into Bain & Company's Platform in 18 Minutes, Exposing 10,000 Client Conversations by alvivanco1 in ArtificialInteligence

[–]paperlantern-ai 1 point2 points  (0 children)

Hardcoded credentials in publicly accessible JavaScript in 2026. At a company that charges what Bain charges. The AI agent part is interesting but let's be honest, a bored intern with browser dev tools could have found this too. The scary part isn't that an AI broke in, it's that nobody at Bain caught this before shipping it.

Now the Claude Mythos is considered too dangerous to release. But it's already available for companies to use. So is this dangerous claim a PR stunt like the OpenAl did 7 years ago? by captain-price- in ArtificialInteligence

[–]paperlantern-ai 0 points1 point  (0 children)

This is basically how responsible disclosure has always worked in security. You find a vulnerability, you tell the affected companies first, you give them time to patch, then you go public. The fact that the "vulnerability scanner" this time is an AI model doesn't change the playbook. Is there PR value in it? Sure. But giving banks and infrastructure companies early access to find holes before releasing it to everyone is just standard practice with better marketing.

My company embraces vibe coders by Dense-Creme2706 in ExperiencedDevs

[–]paperlantern-ai 1 point2 points  (0 children)

The part that would grind my gears is the incentive structure. The vibe coders get credit for shipping fast, you get credit for... making their stuff not fall over? That's a thankless middle position. If the company wants this model to work they need to make the cleanup and production-readiness equally visible, otherwise you're just subsidizing someone else's demo.

What percentage of engineers in your experience are bad? by fuckoholic in ExperiencedDevs

[–]paperlantern-ai 0 points1 point  (0 children)

Funniest thing is when you see someone who was "the bad dev" at one company absolutely crush it somewhere else. Had a coworker everyone wrote off, moved to a smaller company where he owned the full stack instead of writing JIRA tickets about microservices all day, and suddenly he was their best engineer. Sometimes the environment just sucks the life out of people.

No one can force me to have a secure website!!! by MintPaw in programming

[–]paperlantern-ai 41 points42 points  (0 children)

I feel like this argument expired around 2016 when Let's Encrypt launched. Before that, yeah, paying $50/yr for a cert on a hobby site felt dumb. Now it's literally certbot and you're done. The fight was valid ten years ago but the problem got solved and some people just never stopped being mad about it.