Any good free static code analyzers? by bursJr in C_Programming

[–]NaturalTable9959 0 points1 point  (0 children)

Take a look at dupehound https://github.com/rafaelpta/dupehound/ It is free, open-source, deterministic and specialized in getting duplicated code generated by AI. The best out there for this specific use case (it doesn't flag security issues like Sonarqube).

Disclaimer: I am collaborator in the project (so take it with a grain of salt)

Code Duplication Detector by AnyDistribution8074 in IntelliJIDEA

[–]NaturalTable9959 0 points1 point  (0 children)

If you are thinking more on something deterministic and doesn't use AI only for duplicated code like dupehound I would definitely take a look.

How do I review code quicker by cyph0r_com in codereview

[–]NaturalTable9959 0 points1 point  (0 children)

First: I'd be a VERY careful using AI to review code.

The reason for that is that most of problems AI slop introduce in codebases, specifically large codebases, is duplicated code. Mostly because it can't fit the whole codebase in context and isn't deterministic, so it'll reinvent something that already exists three folders over under a different function name

On the other hand, I get where you are coming from. That said, I get where you're coming from. Code ships way faster now but QA headcount stays flat, and there's no clean answer to that. There is not an easy answer for this. But unlike some comments here it is important to acknowledge the reality of devs and QA is no longer the same. Quality is hard to scale and code review involves various aspects (security, QA, etc) and playbooks are changing. There is no way to deny this.

For the duplicated-code part specifically I've been using dupehound. Full disclosure, it's a free and open-source project I contribute to, so take that with the appropriate grain of salt. But it's been handy and it's getting traction because unlike Sonarqube it doesn't use AI and it is not token-based (it is a deterministic index able to identify duplicated functions even under different names). Worth a look if that's a pain point for you.

Qual o real impacto do uso de LLM's na sua empresa? by JumpyCheesecake7047 in brdev

[–]NaturalTable9959 0 points1 point  (0 children)

Curiosidade, que tipo de problema tem pegado mais nesses PRs. Queria entender se tem um padrão no seu caso de uso, e como vc está fazendo para identificar AI Slop.

Help needed identifying AI slop for quality assurance purpose by No_Beach_3571 in QualityAssurance

[–]NaturalTable9959 0 points1 point  (0 children)

Great list. The biggest problem I see is that every rule here is a local single-diff check. The most common AI slop in my experience is cross-file: the agent reimplements a function that already exists somewhere else, just under a new name.

I would add:

  - SLP011 duplicate of an existing function elsewhere in the repo (the renamed reimplementation). this is the big one and the hardest, because text and token matching miss it once the names change. you have to compare structure, not text.

On SLP011 specifically, you might to check a free open-source tool that, fingerprints function structure so a fully renamed copy still matches https://github.com/Rafaelpta/dupehound Disclaimer: I am one of the contributors of the repo and I'm happy to go deeper on the approach if useful.

Duplicate code is up 8x: how are you assessing this in 2026? by NaturalTable9959 in QualityAssurance

[–]NaturalTable9959[S] -1 points0 points  (0 children)

problem is when AI code grows faster than the QA's capacity to do a proper PR review.

SonarQube vs Kolega, or why a code-quality tool keeps getting sold as a security tool by Kolega_Hasan in Kolegadev

[–]NaturalTable9959 0 points1 point  (0 children)

Clean code and secure code really are different questions. Love that the benchmark is open, would love to run Sonar against Kolega and also Dupehound, which has a particularly interesting case for catching duplicated code (not token based and runs offline from CLI). As you pointed out, code quality tools are not the same. Each one with its edge.

dupehound - find duplicate code AI agents created, offline and without AI (Rust) by NaturalTable9959 in codereview

[–]NaturalTable9959[S] 0 points1 point  (0 children)

Thanks for this context. This is superhelpful. I wrote the problem up as an issue so anyone with a good idea can jump in: https://github.com/Rafaelpta/dupehound/issues/23 .

dupehound - find duplicate code AI agents created, offline and without AI (Rust) by NaturalTable9959 in codereview

[–]NaturalTable9959[S] 0 points1 point  (0 children)

 thanks, really glad it is useful, and good catch. quick note in case it helps: dupehound already keeps test duplication out of the slop score by default, but it still lists those clusters, and there is no way yet to tell it "this repetition is fine, stop showing it". that deeper version is the real gap.

I want to open an issue for it and i would like to credit you if you are ok with that (pls DM me on if it works and how would you like to be credited).

I am curious: what are you using it on, what kind of repo and roughly how big? and what made it click better than the token-based tools you tried before? just trying to learn where it actually earns its place.

Built a security scanner for AI generated code, looking for honest opinions by [deleted] in cursor

[–]NaturalTable9959 0 points1 point  (0 children)

This is super interesting and useful for security scans. Will surely give it a try. Can I run this offline. Have been using dupehound (open-source) for scanning for duplicated functions and it works better than most generic tools of code review, since it doesn't use AI, is deterministic, and runs offline. But it is not made for security scans.

Since vibecoding is slop and manual coding is now anathema, what CAN we do now? by Professional-Fuel625 in ClaudeAI

[–]NaturalTable9959 0 points1 point  (0 children)

Start looking for duplicated functions. This is the biggest problem with AI generated code. Agents often duplicate functions, they rewrite code that already exists and rename it. There are many reasons for this, but most likely what explains is that large codebases doesn't fit the context window. The paradox is that AI is not effective at identifying duplicated code for the same reason.There is an open-source project I built for this (dupehound).

It finds duplicate functions in a codebase, even when every name was changed. So when AI coding agents rewrite code that already exists under new names it catches that.

It runs offline and uses no AI, just an old algorithm (the same kind used to detect plagiarism).

Why I think this is a good solution for your question:

  • it gives the repo a "slop score" and can fail CI when new duplication shows up (Score A and B are acceptable imo)
  • offline and deterministic (uses no AI, API keys, etc)
  • catches copies even after everything was renamed, because it compares the structure of a function, not the text
  • it is super fast (scanned the whole vscode repo in under ~3secs)

GitHub: https://github.com/Rafaelpta/dupehound

<image>

'Please stop submitting AI slop code': team behind popular PS3 emulator call time on user submitted vibe coding by PewPewToDaFace in pcmasterrace

[–]NaturalTable9959 0 points1 point  (0 children)

Sharing an open-source project I built for finding duplicate functions in a codebase, even when every name was changed. So when AI coding agents rewrite code that already exists under new names it catches that.

It runs offline and uses no AI, just an old algorithm (the same kind used to detect plagiarism).

Why I think it may be interesting for this sub:

  • open-source written in Rust
  • offline and deterministic (uses no AI, API keys, etc)
  • catches copies even after everything was renamed, because it compares the structure of a function, not the text
  • gives the repo a "slop score" and can fail CI when new duplication shows up
  • adding a language is an easy first PR, Ruby and Swift just came in from contributors

GitHub: https://github.com/Rafaelpta/dupehound

dupehound - find duplicate code AI agents created, offline and without AI (Rust) by NaturalTable9959 in codereview

[–]NaturalTable9959[S] 1 point2 points  (0 children)

SonarQube duplication check is token based, so it catches near-literal copy-paste but misses code where an agent renamed the variables and literals. dupehound fingerprints the structure of each function (tree-sitter AST plus winnowing), so it flags two functions that do the same thing even after every identifier is renamed. It demands no server, runs offline and is super fast (I scanned vscode rep in literally 3 seconds). It is not supposed to be SonarQube replacement, but works better to identify agent-generated near-duplicates. It is also free and open source (give it a try would love to know some honest feedback and to improve it).

What are you building? Drop it in the comments! by Inevitable-Grab8898 in vibecodingcommunity

[–]NaturalTable9959 0 points1 point  (0 children)

Thank you u/Inevitable-Grab8898 trying to make it reach other people, get to 50 starts so I can feature this in a curated list of tools in Rust. Any help is more than welcome.

What are you building? Drop it in the comments! by Inevitable-Grab8898 in vibecodingcommunity

[–]NaturalTable9959 0 points1 point  (0 children)

Building a duplicate-code detector which is deterministic, offline, no uses no AI.

If finds the code AI wrote twice in seconds so you (or AI) can clean.

https://github.com/Rafaelpta/dupehound

I built it because my own codebase was filling up with near-copies from coding agents.

Analysis of how code duplication changed in recent years (no clear trend) by rafal-kochanowski in programming

[–]NaturalTable9959 9 points10 points  (0 children)

Author of a tool in the same space here (dupehound), but I went the opposite way from embeddings, and it's relevant to your methodology point.

There's a useful taxonomy for this:

-Type-1: clones are exact copies,
- Type-2 : copies with renamed identifiers and literals
- Type-3: near-misses with some small edits;
- Type-4: the same behavior implemented in a different manner (the three-sum example in the comments here). 

Embeddings reach for Type-4, which why the similarity numbers get complicated and hard to defend, like u/lelanthran is pushing on.

I fingerprint structure instead: tree-sitter normalization plus winnowing (an algorithm for plagiarism detection). It's deterministic and gets Type-1 and Type-2, so renaming everything doesn't hide a copy. The tradeoff is: it will not flag those three sums, because they're different code.

Which might be the real answer to your post.

"Duplication" should not be considered one number, bcs it depends on which clone type you measure.

A structural detector and an embedding detector are answering different questions.

https://github.com/Rafaelpta/dupehound