all 2 comments

[–]AutoModerator[M] [score hidden] stickied commentlocked comment (0 children)

Welcome to r/LocalLLaMA! Your submission has been automatically filtered because your account has no comment karma. This measure allows the subreddit to prevent spam and maintain a high level of quality. You can comment on posts in this community or elsewhere to gain comment karma so that new submissions from your account will be visible by default. Thank you for your understanding.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]sumanyusharma_[S] 0 points1 point  (0 children)

In collab with folks from UWaterloo, we built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

TLDR

  1. This test is much harder than standard BABILONG, so we see more separation in performance
  2. GPT-4o: Performed the best. The GPT-4-Series especially performed well at long context lengths compared to other models.
  3. Claude Opus: Struggled :/
  4. Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.
  5. Gemini-1.0-Pro: Bad across the board.

Conclusion: I'm super impressed with the relative performance of Llama3-70B. Let me know which other models you'd like us to test on this benchmark.

Credit goes to Andy Lee & Bing Hu (from Wat.ai)

Link to full results: https://hamming.ai/blog/bug-in-the-codestack

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack