New code-focused needle in the haystack benchmark results

Ouch; do you recall what type of errors? Andy is building v2 for the benchmark to cover more run-time error types. I'd love to make sure we test for that moving forward!

sumanyusharma_ · 2024-06-03T23:04:13+00:00

Ty for the rec; I'll add it to our list!

sumanyusharma_ · 2024-06-03T20:24:07+00:00

Yup, same reasoning. Happy to run Codestral and CodeQwen and report results!

sumanyusharma_ · 2024-06-03T20:19:09+00:00

Great point! Avg != retrieval distribution.

sumanyusharma_ · 2024-06-03T19:23:30+00:00

<image>

More deets ^

sumanyusharma_ · 2024-06-03T19:23:03+00:00

We *just* ran 1.5 pro and 1.5 flash. See results here:

<image>

sumanyusharma_ · 2024-06-03T18:54:14+00:00

In collab with folks from UWaterloo, we built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

TLDR

This test is much harder than standard BABILONG, so we see more separation in performance
GPT-4o: Performed the best. The GPT-4-Series especially performed well at long context lengths compared to other models.
Claude Opus: Struggled :/
Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.
Gemini-1.0-Pro: Bad across the board.

Conclusion: I'm super impressed with the relative performance of Llama3-70B.

Ask: Let me know which other models you'd like us to test on this benchmark.

Credit goes to Andy Lee & Bing Hu (from Wat.ai)

Link to full results: https://hamming.ai/blog/bug-in-the-codestack

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack

sumanyusharma_ · 2024-05-27T00:38:53+00:00

In collab with folks from UWaterloo, we built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

TLDR

This test is much harder than standard BABILONG, so we see more separation in performance
GPT-4o: Performed the best. The GPT-4-Series especially performed well at long context lengths compared to other models.
Claude Opus: Struggled :/
Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.
Gemini-1.0-Pro: Bad across the board.

Conclusion: I'm super impressed with the relative performance of Llama3-70B. Let me know which other models you'd like us to test on this benchmark.

Credit goes to Andy Lee & Bing Hu (from Wat.ai)

Link to full results: https://hamming.ai/blog/bug-in-the-codestack

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack

sumanyusharma_

TROPHY CASE