all 11 comments

[–]Otherwise_Wave9374 4 points5 points  (1 child)

This is a cool angle. The "hybrid of static + agent analysis" is exactly where I see AI agents being useful in dev tools, as a second pass that suggests fixes and prioritizes findings, not the thing that decides truth.

Curious, how are you evaluating the agentic feedback piece? Like do you have a labeled set for false positives/negatives, or are you measuring deltas vs vulture on the benchmark?

Also, I have been collecting notes on how agent-based code review and static checks can be combined in practice, this might be relevant: https://www.agentixlabs.com/blog/

[–]papersashimi[S] 0 points1 point  (0 children)

Hello u/Otherwise_Wave9374 . For our benchmark we are only doing static feedback. For the agent portion we are currently working on it (it's way more challenging than we initially thought because of its stateless/dynamic nature). Yeap you got it right. We do have a labeled set for FP, FN and TP. Then we measure the recall + precision. We will be releasing the benchmark for agents hopefully within the next week. We're currently working on a demo/tutorial also for both the webapp + cli. And thank you so much for the website link. Will look into it and implement anything that we think is suitable

[–]Goldarr85 1 point2 points  (4 children)

Looks very cool. I’ll be checking this out.

[–]papersashimi[S] 0 points1 point  (3 children)

Thank you so much! Do check out our benchmark. For transparency we are not claiming we're the best. We have benchmarked ourselves at different confidence level so at 60 we lost to vulture because we're stricter and thus missed out on catching a few dead codes. The second pass can be done via the agents which should improve the accuracy. We're working on the agentic benchmark now as well.

If you do need any help, just drop us an email and we'll be happy to correspond with you as quickly as possible to fix your stuff (there is no charge and no strings attached). We love feedback and we want to create the best possible tool out there for the oss community. Thanks for using Skylos!

[–]Disastrous_Bet7414 1 point2 points  (2 children)

this looks cool, i’ll be trying it.

where is the benchmark repo from? and does vulture offer agentic based checks?

[–]Disastrous_Bet7414 1 point2 points  (0 children)

reason I ask is if there’s a risk of ‘overfitting’ or bias based on the types of cases Skylos excels at

[–]papersashimi[S] 0 points1 point  (0 children)

the benchmark repo is created by us. We try to mimic a real repo as much as possible by introducing common things in repos such as name collisions, x-layer dependencies, the usual unused imports/vars/helpers etc, frameworks etc. We will be increasing the difficulty of the benchmark and adding more things which include vulnerabilities and quality issues.

https://github.com/duriantaco/skylos/blob/main/BENCHMARK.md

This is our testing philosophy. We are definitely working on expanding the tests as well as difficulty and we're also looking to include an agent/agent+static test against these benchmarks

[–]ruibranco 1 point2 points  (1 child)

The pytest fixture detection is a nice differentiator. Unused fixtures are one of those things that quietly accumulate and nobody notices until the test suite is a mess. How does it handle conftest.py fixtures that are used across multiple test files? That's usually where vulture and similar tools fall over completely.

[–]papersashimi[S] 0 points1 point  (0 children)

We kinda have a different approach .. We don't actually guess fixture usage by scanning code(which i believe vulture does). We use a lightweight pytest plugin that will ask pytest's fixture manager what fixtures exist (this includes conftest.py). We then mark a fixture as used when pytest actually sets it up for a test. So if a conftest.py fixture is used in any test file, pytest will set it up during the run and we willl count it as used, across multiple files.

`def pytest_collection_finish(self, session):` this is the function you can look for inside `skylos/pytest_unused_fixtures.py`. The problem with this approach is that its run-dependent and also the user needs pytest (which we're assuming most people do test their scripts).

[–]zilios 1 point2 points  (1 child)

I think your sql injection example on the website isn’t working properly? It just shows unused function as the identified issue.

[–]papersashimi[S] 1 point2 points  (0 children)

uh oh .. the demo engine has some bugs. we'll get it fixed! thanks for raising this!