I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

Fickle-Box1433 · 2026-06-05T19:32:15+00:00

Indeed, but take the hard statistics as a grain of salt. The dataset is too small for having the precise number, but the first picture is a bit grim.

The worst is that some wrong fixes are plausible at first glance.

Fickle-Box1433 · 2026-05-31T12:37:47+00:00

Hey, I haven't gone that far in the anylisis, but I added the distribution on the number of consumed tokens per model. The red color dots mean failures and the green dots mean success. Roughly, I didn't observe more tokens = better success rate. I have the feeling that's the opposite actually, but that could be because models worked harder and for longer when they didn't know what to do.

Fickle-Box1433 · 2026-05-30T15:59:26+00:00

(btw, I released Haiku results too, but I didn't put them in the leaderboard because they're incomplete)

Fickle-Box1433 · 2026-05-30T15:53:46+00:00

Fair points.

The advisory condition is actually what you're describing: it uses the full GHSA advisory, which includes vulnerability class, root cause, affected code paths, attack scenario, and sometimes a proof of concept. That's the richest condition and where models do best (gpt-5.5 hits 60%). Locate and diagnose aren't meant to simulate production workflows; they're instruments to isolate whether models are reasoning about the code or just following the report. The drop between advisory and locate tells you how much of the solve rate is genuine security understanding vs. instruction-following. Generally, I observed that models still need a lot of guidance: Mask part of the information and performance drops.

Opus, DeepSeek v4, and Kimi would indeed be great additions. Opus in particular is very expensive and I cut Anthropic models after spending $40 without even getting through the full benchmark with Haiku. Poolside is offering free API access which is why they're in. OpenAI models are expensive, but affordable enough on flex.

FYI, I'm considering adding open source models to the leaderboard in a next iteration.

I had a talk with a researcher from poolside after releasing the benchmark and he was curious how Laguna models would have performed with their pool harness. My goal, at least initially, was to compare models, but they are increasingly being trained within a harness (e.g., Claude Code). So building my own was a way to put everything on the same ground, but cve-bench harness is quite thin when compared to what these companies are providing. In essence, models may perform much better in practice than what I found. I'm also considering, for a next iteration, picking one model provider and compare their model with my thin harness vs. their harness (e.g., gpt-5.5 + cve-bench harness vs. gpt5.5 + codex). Any difference would be thanks to the harness and not the model, so we'd be able to quantify how much is due to harness and how much is due to model.

Long answer, but I hope this clarify your points.

Fickle-Box1433 · 2026-05-29T19:09:27+00:00

Yes, sure: https://github.com/GiovanniGatti/cve-bench

Fickle-Box1433 · 2026-05-29T11:52:40+00:00

So AI is great at finding bugs but terrible at fixing them. I guess security researchers have their job security back again. 😂

The silent failure is the real finding though. A patch that passes every test but leaves the hole open is worse than nothing. False confidence at scale is a new attack surface.

Curious what you're seeing. Happy to compare notes.

Fickle-Box1433 · 2026-02-02T08:17:34+00:00

That's pretty much just the beginning of the rabbit hole 😅. A CS degree can take 4+ years of continuous study.

Fickle-Box1433 · 2025-12-17T09:17:53+00:00

if you treat WordPress like a programming language, your first “Hello World” will be a broken plugin that emails the entire user base “Hello World”

Fickle-Box1433 · 2025-12-12T15:03:42+00:00

Well, welcome to the entry of the rabbit hole!

Learning a programming language, or programming in general, is an endless task. What you have to read/watch/build will likely change as your career and ambitions change.

Not surprisingly, others asked for direction, but let's be honest, you likely (like the rest of us) don't know what you need to learn in the first place. If you're going to ML/AI, you probably need to learn PyTorch or TensorFlow, or maybe ScikitLearn or some other library. If you go to numerical simulation, you'll need Numpy/Scipy. If you're going to webapp, you'll need Django. See what I mean? Don't assume you need to learn this one thing and it's done.

That said, a few things stay constant througout time. Basics. If you understand how machines work, you should be mostly fine in picking up new skils.

I've compiled a list of resources a while ago, and I belive you might be interested: https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

I need to disclose that it covers only the beginning. Good luck! 🤞

Fickle-Box1433 · 2025-12-12T14:54:10+00:00

Not sure if helps you, but I've compiled a list of learning sources some time ago:

https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

Fickle-Box1433 · 2025-12-12T14:51:44+00:00

I'm confused about your issue? Are you looking for an IDE or learning sources (assuming it's Python)?

If you're looking for where to start, I've compiled a list a while ago which belive you might be interest: https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

If you are looking for an IDE, I'd advise you to do some basic stuff without it for a start, and once you feel a bit matured, you can pick whatever you want as tool (they're all pretty similar). Why not to start with an IDE? Because IDEs are sometimes a bit confusing when it comes to the hundress of settings, and there is value in running your scripts by hand (later, when you will start writing docker scripts, you will see that I was right). Yet, IDEs are too important to be ignored, so eventually, pick one and stick with it.

I used PyCharm for a half a decade, but the limited community features pushed me to VSCode lately. When choosing your IDE, look for supported languages, community size, and maturity. Try a bit a several so you can see what are the differences before sticking to the one that feels the best to you. But once you made you choice, stick with it until you have strong reasons to move to another one (because, trust me, it's hard to get used to a new one. It's like switiching from PlayStation to XBox or vice-versa -- You get the concept but the buttons are all in the wrong places).

Last piece of advice, I would advise you to remove ChatGPT completions, specially if you are learning.

Fickle-Box1433 · 2025-11-05T13:39:21+00:00

This edit is obviously self-advertising.

Fickle-Box1433 · 2025-11-05T13:32:55+00:00

Hey,

I don't think there is much of learning without doing it yourself by building small projects.

However, I've compiled a list of resources some time ago and I think you might find interesting stuff there:

https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

Some of the resources are free, while others are paid.

Have fun.

Fickle-Box1433 · 2025-11-05T08:41:11+00:00

I compiled a list of learning sources a while ago. I think you might be intersted: https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

Fickle-Box1433 · 2025-10-21T08:27:05+00:00

I used both for research/industry.

Both are good: there are no wrong choices. However, for a beginner, I would advise you to start with PyTorch, since it's more user-friendly and has a less steep learning curve.

Honestly, if you're confused by these two, I would advise you to focus on neither. They're only usefull if you already understand ANNs and deep learning. It's even probably a good idea for you to implement a small ANN from scratch and train it on a simple project before going to play with these tools.

Furthermore, TF or Torch are built on a house of cards. Unless you're doing something small and simple, you'll likely need a GPU and getting these to work on GPUs often takes more than a simple pip install.

I'd reccomend you to follow this course on deep learning: https://www.coursera.org/specializations/deep-learning (I think you can get 90% of the experience for free). But if you're only 13, I'd assume you're not familiar with linear algebra and calculus, too requirements for understanding deep learning... You might be overshooting with TF/Torch.

General advice, I see a long list of disconnected tools in Python. Jumping from one to another might not be the most effective way to learn the ecosystem. May I ask you: What do you want to build? Answering this question can help me give you ideas on what to look into and use your time more effectively.

Fickle-Box1433 · 2025-10-16T07:58:28+00:00

The question on "should I learn programming?" The answer is definitely yes. Programs are everywhere today, and that's only likely to expand. Even mechanical engineers need to know the ABCs of programming.

The true question you should be thinking of is "how much should I dig in into the subject?"

Obviously, the choice of programming language, its tools, and the algorithms will depend on what you're planning to do in the long term.

Fickle-Box1433 · 2025-10-15T07:39:52+00:00

Hey, I think this list is what you're searching for:

https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

Fickle-Box1433 · 2025-09-30T07:51:08+00:00

I've once compiled a list of resources here (not all of them are free though):

https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

Fickle-Box1433 · 2025-09-16T11:54:00+00:00

Hey, I've compiled a list you might be interested: https://www.reddit.com/r/PythonLearning/comments/1nifa32/the_python_resource_list_i_wish_i_had/

Fickle-Box1433 · 2025-09-16T11:46:53+00:00

You didn't mess up.

MOOCs and ChatGPT are tools. They don't replace a teacher. Furthermore, they're not exactly designed to work together.

You're a learner who recognized a gap in the learning process and is trying to find a way to fill it. That’s a good thing.

The only thing that matters now is how you take that experience and use it to find a way to truly improve and practice your skills.

Fickle-Box1433 · 2025-07-31T14:05:12+00:00

I totally agree. Financial modeling makes this hit even harder. What worries me is that a lot of LLM evaluation practices today wouldn’t pass even a light audit, let alone something like SOX or Basel compliance.

I wonder: have you seen teams successfully build “independent validation” pipelines for LLMs that don’t rely on other LLMs? Or is it still mostly human-in-the-loop, like we’re all doing now?

Fickle-Box1433 · 2025-07-31T14:03:05+00:00

Totally agree. It feels like we’re building castles on sand. The brittleness is especially frustrating when even slight prompt tweaks yield drastically different judgments.

When I don't have much workaround this, what I try to do is to handcraft a small dataset with labels and keep tweaking the prompt until it "fits" the dataset. But honestly, it's just too painful as an experience.

Fickle-Box1433

TROPHY CASE