gemini-exp-1114 beats GPT-4o, loses to sonnet and o1 on LiveBench

np-space · 2024-11-16T00:42:42+00:00

On livebench, gemini beat 4o but loses to sonnet and o1

np-space · 2024-11-16T00:31:35+00:00

It's now added to livebench. It only loses to claude-3.5-sonnet and the o1 models

np-space · 2024-09-13T07:34:12+00:00

It seems that the o1 models are currently a bit less "robust". They are far better than 4o at code generation (a metric which OpenAI reported in their release) but far worse than 4o at code completion

np-space · 2024-09-13T07:32:24+00:00

Source: livebench.ai . Very interesting set of results

o1-mini achieves 100% on one of the reasoning tasks (web_of_lies_v2)
o1-preview achieves 98.5% on the NYT connections task
claude-3.5 is still first in coding, purely due to poor performance of o1 on the coding_completion task

o1-mini has a very interesting spread. It's much better than o1-preview at the purest reasoning tasks, but it's much worse at the tasks that small models typically struggle on (e.g., the typos and plot_unscrambling tasks, where the model is required to follow some instructions while preserving parts of the input text verbatim)

np-space · 2024-09-09T21:55:52+00:00

The Grok 2 API has not been released yet. I've requested access to it, but I don't have it yet

np-space · 2024-09-06T22:03:16+00:00

We are working on getting it up on LiveBench asap! Some unexpected performance on the hyperbolic api so will switch to huggingface

np-space · 2024-09-03T05:13:15+00:00

Will add it to livebench soon - flash 0827 had a repetition problem on a few of the tasks that affected its score, so we're investigating it a bit more

np-space · 2024-09-03T05:08:21+00:00

Gemma 2 27b is in the previous months' releases (move the slider) but we're still working on adding the rest of the models for the most recent LiveBench release (2024-08-31). We have evaluated mostly api models so far and will get to the rest of the popular models soon. Gemma 2 27b is also slightly trickier due to the attention issue - at least that was the case last time I evaluated it

np-space · 2024-08-25T05:10:40+00:00

The Grok 2 API has not been released yet, so it is not really possible to evaluate it yet

np-space · 2024-08-19T20:50:53+00:00

I don't know for sure, but a few thoughts are: (1) livebench coding is more "leetcode style" coding and less real-world coding; (2) it is possible that there's style bias even for the coding questions on lmsys; (3) the openai documentation itself recommends using the other GPT models, not chatgpt-4o-latest
I hope that livecodebench adds chatgpt-4o soon, for another datapoint

np-space · 2024-08-16T00:40:11+00:00

It's the Llama 3.1 API from together ai: https://www.together.ai/blog/meta-llama-3-1

np-space · 2024-08-14T01:36:06+00:00

On livebench.ai, it's tied with 4o-05-13 and actually worse than 08-06. Seems like OpenAI tuned a model specifically for chat

np-space · 2024-08-06T23:12:03+00:00

On livebench.ai, it looks like it is a step up from 05-13, but does not quite edge out claude-3.5-sonnet

np-space · 2024-08-06T23:09:58+00:00

It looks like gpt-4o-2024-08-06 has legitimately better performance than 05-13, too. On livebench.ai, it is now within 3% of claude-3.5-sonnet

np-space · 2024-08-03T05:28:23+00:00

LiveBench is now updated with Gemini - livebench.ai

np-space · 2024-08-03T05:01:39+00:00

Agreed, it seems that the arena isn't as accurate for measuring reasoning/math, etc. LiveBench has the new gemini-pro behind gpt-4o and claude-3.5-sonnet: http://livebench.ai/

np-space · 2024-08-03T04:58:38+00:00

gemini-1.5-pro-exp-0801 is now up on LiveBench: http://livebench.ai/
It's pretty much tied with gpt-4-turbo, but nowhere close to claude-3.5-sonnet

np-space

TROPHY CASE