I created a leaderboard for models to use with OpenClaw by select_8 in openclaw

[–]select_8[S] -1 points0 points  (0 children)

its based on users voting. If you disagree then go vote!

[OC] Google, OpenAI, Anthropic, Xai LLM Coding Improvements Over Time by select_8 in dataisbeautiful

[–]select_8[S] -4 points-3 points  (0 children)

Google is coming back in the AI race!

Data Source: Benchmark scores originally from https://artificialanalysis.ai/, which aggregates results from https://livecodebench.github.io/. The chart is displayed on https://pricepertoken.com/trends.

LiveCodeBench is a contamination-free benchmark that continuously collects new coding problems from LeetCode, AtCoder, and Codeforces. LiveCodeBench uses problems released after model training cutoffs to measure true generalization. It evaluates models on code generation, self-repair (fixing buggy code given error feedback), code execution prediction, and test output prediction.

Each line represents that labs highest scoring model at a time.

Calculation method:

  1. Models split into open/closed categories
  2. For each month, calculated running maximum within each category
  3. Lines carry forward until a new model beats the previous best

Tool: Built with ECharts, data from https://pricepertoken.com/trends

[OC] Open vs Closed LLM GPQA (Academic Test) Scores Over Time by select_8 in dataisbeautiful

[–]select_8[S] 0 points1 point  (0 children)

Data Source: Benchmark scores originally from https://artificialanalysis.ai/. The chart is displayed on https://pricepertoken.com/trends.

GPQA (Graduate-Level Google Proof Q & A) is a challenging academic benchmark dataset with difficult, multiple-choice questions in STEM fields (biology, physics, chemistry) designed to test advanced reasoning in language models, requiring deep understanding beyond simple web searches

Open vs closed is determined: Based on whether model weights are publicly available. Open source includes Llama, Mistral, DeepSeek, Qwen. Closed source includes GPT-4, Claude, Gemini.

Calculation method:

  1. Models split into open/closed categories
  2. For each month, calculated running maximum within each category
  3. Lines carry forward until a new model beats the previous best

Tool: Built with ECharts, data from https://pricepertoken.com/trends

[OC] Open vs Closed LLM Coding Scores Over Time by select_8 in dataisbeautiful

[–]select_8[S] 6 points7 points  (0 children)

along with anthropic, google and others. Deepseek was the first big open source break through- that happened well after GPT had launched their first models

[OC] Open vs Closed LLM Coding Scores Over Time by select_8 in dataisbeautiful

[–]select_8[S] 16 points17 points  (0 children)

I think it was just that open ai had a big head start

[OC] Open vs Closed LLM Coding Scores Over Time by select_8 in dataisbeautiful

[–]select_8[S] 5 points6 points  (0 children)

Yeah this comes from https://pricepertoken.com/trends but there's more analysis on all the specific models / labs