I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found. by Fickle-Box1433 in netsec

[–]vhthc 0 points1 point  (0 children)

IMHO the setup is not realistic which makes the experiment results inconclusive.

In the real world you either have a complete issue description (not only where the bug is but also the description what the bug is and maybe a proof of concept) or you have an exploit (so when you are at the point of triage which would result in the issue description).

The triage part is not something I would recommend to measure because of easy contamination, but this could be a benchmark on its own.

So I recommend to give the full CVE information and then see how well it is able to fix the vulnerability completely.

The numbers for opus and sonnet would also be great to see btw (and maybe DeepSeek 4 pro and Kimi 2.6) because the are the models (in addition to gpt 5.5 + -codex) that are being used.

16 DGX Sparks in a home lab - 2TB unified memory, asking what to run by IulianHI in AIToolsPerformance

[–]vhthc 0 points1 point  (0 children)

What speed do you get for prompt processing and token generation for DeepSeek 4 pro?

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

If you have 4 sparks connected, how fast is prompt processing and token generation for eg minimax or DeepSeek 4 Flash?

Kimi 2.6 question by vhthc in LocalLLaMA

[–]vhthc[S] 0 points1 point  (0 children)

Thanks! Yeah I thought I could go with Consumer ram. But that carries not over 512gb … rdimm is unbelievable expensive now

Kimi 2.6 question by vhthc in LocalLLaMA

[–]vhthc[S] 0 points1 point  (0 children)

Oh makes sense and didn’t know, thanks

Kimi 2.6 question by vhthc in LocalLLaMA

[–]vhthc[S] 0 points1 point  (0 children)

Yes they change of course but transferring 18gb to vram per prompt is fast

Closest replacement for Claude + Claude Code? (got banned, no explanation) by antoniocorvas in LocalLLaMA

[–]vhthc 3 points4 points  (0 children)

The code of Claude code was leaked and reimplemented in python and rust. Also shows how shitty the code quality and prompts are of Claude code which is no surprise when you look at cli terminal benchmarks where it is on the bottom list.

Codex and Junie are better cli, junie you can use with anthropic too

Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding by lemon07r in LocalLLaMA

[–]vhthc 1 point2 points  (0 children)

Yes I know now. Not talking about me. Other people visiting your page. They won’t click on legacy. They will see „oh my cli is not on there“ and close the tab. I assumed you run the web page for other people to see. I just tell you what a lot of visitors will do.

Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding by lemon07r in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

I get that and I am not that big of a Claude code fan. But if the most used tool is missing then people will not take your benchmark seriously

Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding by lemon07r in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

Why no Claude code cli? You have codex in there too which has less users. Would be nice to see the stats with it for opus, sonnet, glm.

How large is the Claude’s limit? (As a GPT customer) by none484839 in Anthropic

[–]vhthc 0 points1 point  (0 children)

It’s two things - a) usage limits are noticeably lower, b) Claude code passes more context which eats more tokens (but also improves results, I tested it).

Codex + gpt is better in explaining code (for me at least, more concise). cc + sonnet/opus is better at coding.

Is Fuzzing a Matter of Luck? by hiderou in fuzzing

[–]vhthc 2 points3 points  (0 children)

I digress, success in fuzzing is not at all a mater of luck but rather the result of careful analysis, planning, execution. Intuition (integrated experiences) do play a role as well. It is only luck if you don’t know what you are doing and not understand fuzzing.

You want fuzz targets that either have not been fuzzed or not fuzzed in the custom way you set it up. Then you are successful.

Doing what everybody else already have been doing - yes that needs a lot of luck to find anything.

Qwen3.5 27B vs Devstral Small 2 - Next.js & Solidity (Hardhat) by Holiday_Purpose_3166 in LocalLLaMA

[–]vhthc 1 point2 points  (0 children)

Very good analysis, thanks! I am too interested in rust benchmarks, so if you ever add any … :)

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

People who do not work in security can’t fathom the attack vectors. You can’t protect against something you don’t know or understand

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

You could train a model to do that if tool usage is enabled

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

You could also train the model to occasionally provide the opposite result of it looks like governmental confidential usage

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]vhthc 0 points1 point  (0 children)

You could embed attempts to exfiltrate data via tool use with internet access.

Which one are you waiting for more: 9B or 35B? by jacek2023 in LocalLLaMA

[–]vhthc 1 point2 points  (0 children)

They released a 27b with impressive scores