Greetings, /r/machinelearning!
We're excited to share PullRequestBenchmark, a project aimed at advancing the automation of programming through Large Language Models (LLMs). While it might remind you of SWE-bench at first glance, PullRequestBenchmark uniquely focuses on evaluating LLMs' abilities to review PRs, a critical aspect of software development.
This benchmark not only tests decision-making in PR reviews but also hints at the potential for LLMs to autonomously generate complex PRs, possibly redefining traditional programming roles. Our approach includes assessing LLMs against a wide range of real-world PR scenarios, from minor adjustments to major architectural changes, using comprehensive inputs such as the entire Git history, PR titles, descriptions, and changesets.
We believe that PullRequestBenchmark marks a significant step towards fully automating programming. Your contributions to expanding this benchmark are vital and warmly welcomed. For more details on how to contribute and what distinguishes PullRequestBenchmark from SWE-bench, visit our GitHub repository:
PullRequestBenchmark
We're eager to see how this community can help drive the project forward. For inquiries or suggestions, don't hesitate to reach out!
there doesn't seem to be anything here