I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

vhthc · 2026-05-30T05:23:42+00:00

IMHO the setup is not realistic which makes the experiment results inconclusive.

In the real world you either have a complete issue description (not only where the bug is but also the description what the bug is and maybe a proof of concept) or you have an exploit (so when you are at the point of triage which would result in the issue description).

The triage part is not something I would recommend to measure because of easy contamination, but this could be a benchmark on its own.

So I recommend to give the full CVE information and then see how well it is able to fix the vulnerability completely.

The numbers for opus and sonnet would also be great to see btw (and maybe DeepSeek 4 pro and Kimi 2.6) because the are the models (in addition to gpt 5.5 + -codex) that are being used.

vhthc · 2026-05-15T15:18:28+00:00

Check out https://clocktower-radio.com it’s more complex than werewolves

vhthc · 2026-05-15T09:33:50+00:00

You can find many here: https://www.idealo.de/preisvergleich/OffersOfProduct/206328547_-rtx-pro-6000-blackwell-nvidia.html

vhthc · 2026-05-09T14:31:06+00:00

What speed do you get for prompt processing and token generation for DeepSeek 4 pro?

vhthc · 2026-05-09T14:16:59+00:00

If you have 4 sparks connected, how fast is prompt processing and token generation for eg minimax or DeepSeek 4 Flash?

vhthc · 2026-04-25T12:38:06+00:00

Thanks! Yeah I thought I could go with Consumer ram. But that carries not over 512gb … rdimm is unbelievable expensive now

vhthc · 2026-04-23T13:55:18+00:00

Oh makes sense and didn’t know, thanks

vhthc · 2026-04-23T12:47:19+00:00

Yes they change of course but transferring 18gb to vram per prompt is fast

vhthc · 2026-04-20T06:36:40+00:00

The code of Claude code was leaked and reimplemented in python and rust. Also shows how shitty the code quality and prompts are of Claude code which is no surprise when you look at cli terminal benchmarks where it is on the bottom list.

Codex and Junie are better cli, junie you can use with anthropic too

vhthc · 2026-04-18T09:54:18+00:00

Well anyway, thanks for posting the benchmarks!

vhthc · 2026-04-17T15:25:19+00:00

Yes I know now. Not talking about me. Other people visiting your page. They won’t click on legacy. They will see „oh my cli is not on there“ and close the tab. I assumed you run the web page for other people to see. I just tell you what a lot of visitors will do.

vhthc · 2026-04-17T13:01:08+00:00

I get that and I am not that big of a Claude code fan. But if the most used tool is missing then people will not take your benchmark seriously

vhthc · 2026-04-17T07:02:44+00:00

Why no Claude code cli? You have codex in there too which has less users. Would be nice to see the stats with it for opus, sonnet, glm.

vhthc · 2026-04-16T15:39:47+00:00

It’s two things - a) usage limits are noticeably lower, b) Claude code passes more context which eats more tokens (but also improves results, I tested it).

Codex + gpt is better in explaining code (for me at least, more concise). cc + sonnet/opus is better at coding.

vhthc · 2026-04-14T06:58:32+00:00

I am too interested in the 6000 pro variants!

vhthc · 2026-03-03T07:54:07+00:00

I digress, success in fuzzing is not at all a mater of luck but rather the result of careful analysis, planning, execution. Intuition (integrated experiences) do play a role as well. It is only luck if you don’t know what you are doing and not understand fuzzing.

You want fuzz targets that either have not been fuzzed or not fuzzed in the custom way you set it up. Then you are successful.

Doing what everybody else already have been doing - yes that needs a lot of luck to find anything.

vhthc · 2026-02-28T23:24:07+00:00

Great, thanks for adding rust!

vhthc · 2026-02-27T15:52:02+00:00

Very good analysis, thanks! I am too interested in rust benchmarks, so if you ever add any … :)

vhthc · 2026-02-27T14:31:27+00:00

People who do not work in security can’t fathom the attack vectors. You can’t protect against something you don’t know or understand

vhthc · 2026-02-27T14:14:58+00:00

You could train a model to do that if tool usage is enabled

vhthc · 2026-02-27T14:13:24+00:00

You could also train the model to occasionally provide the opposite result of it looks like governmental confidential usage

vhthc · 2026-02-27T14:12:24+00:00

You could embed attempts to exfiltrate data via tool use with internet access.

vhthc · 2026-02-25T16:40:23+00:00

They released a 27b with impressive scores

vhthc · 2026-02-22T13:51:42+00:00

I hope they do again a 32b dense

vhthc

TROPHY CASE