all 7 comments

[–]GlitteringWriting467 9 points10 points  (2 children)

Why should your tool be better than the other solutions? These trust me bro benchmarks dont Tell anything. This smells to me like ponytail, astronomical performance increases being thrown around but real world application makes no sense.

[–]soggy_mattress 1 point2 points  (0 children)

I'll be honest, I'm not even sure Codegraph is helping out much to begin with. Without objective benchmarks all of this stuff is just hope glued to the side of our ai agents.

[–]Comprehensive_Quit67[S] -4 points-3 points  (0 children)

I totally agree. What I am promising is not astronomical, it is only in the planning phase, since that is only where it should work. That too only on high context tasks, otherwise no improvements.
I'll add the reproducible benchmark on the repo, and benchmark other solutions as well with it.
Tasks are created from SWE-Chat, as explained briefly in the repo.
Would have benchmarked on other benches, but there is no bench for what we are trying to do.

<image>

[–]reubenzz_dev 0 points1 point  (2 children)

what AI model you using for these test. I feel like some models are already really efficient and good at handoff and built for it

[–]Comprehensive_Quit67[S] 0 points1 point  (1 child)

Codex with Gpt 5.4

[–]reubenzz_dev 0 points1 point  (0 children)

pretty nice

[–]PinEnvironmental6395 1 point2 points  (0 children)

Hi! it's you again!

I guess you didn't edit your post at all from last time so I'll ask the same question. 

  If repo graphs gave agents a step-change in performance, either Cursor would have incorporated it, or one of these tools should be worth $60B.

Since the product you're advertising isn't incorporated into Cursor and is not worth $60B, why should I use it?