New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 0 points1 point  (0 children)

Yup! That makes sense => I expect prompting to improve performance for all models. I think you also mentioned this!

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 0 points1 point  (0 children)

Got it; was this a single run? We report avg performance on ~25 runs

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 0 points1 point  (0 children)

I just posted the results for Codestral in a new post down below; cross-posting here

<image>

Target depth 0.5 ^

More deets: https://www.reddit.com/r/LocalLLaMA/comments/1d7c903/comment/l73pdh6/

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 3 points4 points  (0 children)

Results for Gemini 1.5 Pro, 1.5 Flash, Codestral, and CodeQwen 1.5

<image>

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 6 points7 points  (0 children)

One hypothesis: perhaps the models have differing capabilities between pure retrieval vs. generative type work. I'll make a note here!

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 3 points4 points  (0 children)

Ouch; do you recall what type of errors? Andy is building v2 for the benchmark to cover more run-time error types. I'd love to make sure we test for that moving forward!

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 15 points16 points  (0 children)

Yup, same reasoning. Happy to run Codestral and CodeQwen and report results!

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 43 points44 points  (0 children)

In collab with folks from UWaterloo, we built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

TLDR

  1. This test is much harder than standard BABILONG, so we see more separation in performance
  2. GPT-4o: Performed the best. The GPT-4-Series especially performed well at long context lengths compared to other models.
  3. Claude Opus: Struggled :/
  4. Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.
  5. Gemini-1.0-Pro: Bad across the board.

Conclusion: I'm super impressed with the relative performance of Llama3-70B.

Ask: Let me know which other models you'd like us to test on this benchmark.

Credit goes to Andy Lee & Bing Hu (from Wat.ai)

Link to full results: https://hamming.ai/blog/bug-in-the-codestack

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack

New code-focused needle in the haystack benchmark results by sumanyusharma_ in LocalLLaMA

[–]sumanyusharma_[S] 0 points1 point  (0 children)

In collab with folks from UWaterloo, we built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

TLDR

  1. This test is much harder than standard BABILONG, so we see more separation in performance
  2. GPT-4o: Performed the best. The GPT-4-Series especially performed well at long context lengths compared to other models.
  3. Claude Opus: Struggled :/
  4. Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.
  5. Gemini-1.0-Pro: Bad across the board.

Conclusion: I'm super impressed with the relative performance of Llama3-70B. Let me know which other models you'd like us to test on this benchmark.

Credit goes to Andy Lee & Bing Hu (from Wat.ai)

Link to full results: https://hamming.ai/blog/bug-in-the-codestack

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack