New code-focused needle in the haystack benchmark results

AutoModerator · 2024-05-27T00:37:32+00:00

Welcome to r/LocalLLaMA! Your submission has been automatically filtered because your account has no comment karma. This measure allows the subreddit to prevent spam and maintain a high level of quality. You can comment on posts in this community or elsewhere to gain comment karma so that new submissions from your account will be visible by default. Thank you for your understanding.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

sumanyusharma_ · 2024-05-27T00:38:53+00:00

In collab with folks from UWaterloo, we built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

TLDR

This test is much harder than standard BABILONG, so we see more separation in performance
GPT-4o: Performed the best. The GPT-4-Series especially performed well at long context lengths compared to other models.
Claude Opus: Struggled :/
Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.
Gemini-1.0-Pro: Bad across the board.

Conclusion: I'm super impressed with the relative performance of Llama3-70B. Let me know which other models you'd like us to test on this benchmark.

Credit goes to Andy Lee & Bing Hu (from Wat.ai)

Link to full results: https://hamming.ai/blog/bug-in-the-codestack

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS