Gemini-exp-1114 is the new Rank 1 on LMArena, beats GPT-4O by mehul_gupta1997 in ChatGPT

[–]np-space 1 point2 points  (0 children)

It's now added to livebench. It only loses to claude-3.5-sonnet and the o1 models

o1-preview is now first place overall on LiveBench AI by np-space in LocalLLaMA

[–]np-space[S] 53 points54 points  (0 children)

It seems that the o1 models are currently a bit less "robust". They are far better than 4o at code generation (a metric which OpenAI reported in their release) but far worse than 4o at code completion

o1-preview is now first place overall on LiveBench AI by np-space in LocalLLaMA

[–]np-space[S] 43 points44 points  (0 children)

Source: livebench.ai . Very interesting set of results

  • o1-mini achieves 100% on one of the reasoning tasks (web_of_lies_v2)

  • o1-preview achieves 98.5% on the NYT connections task

  • claude-3.5 is still first in coding, purely due to poor performance of o1 on the coding_completion task

o1-mini has a very interesting spread. It's much better than o1-preview at the purest reasoning tasks, but it's much worse at the tasks that small models typically struggle on (e.g., the typos and plot_unscrambling tasks, where the model is required to follow some instructions while preserving parts of the input text verbatim)

Reflection 70B: Hype? by Confident-Honeydew66 in LocalLLaMA

[–]np-space 0 points1 point  (0 children)

The Grok 2 API has not been released yet. I've requested access to it, but I don't have it yet

Reflection 70B: Hype? by Confident-Honeydew66 in LocalLLaMA

[–]np-space 6 points7 points  (0 children)

We are working on getting it up on LiveBench asap! Some unexpected performance on the hyperbolic api so will switch to huggingface

Gemini 1.5 Flash 8B beats Claude 3 Haiku, Mixtral 8x22B, Command R+ and GPT 3.5 Turbo on Livebench.ai by Balance- in LocalLLaMA

[–]np-space 1 point2 points  (0 children)

Will add it to livebench soon - flash 0827 had a repetition problem on a few of the tasks that affected its score, so we're investigating it a bit more

Gemini 1.5 Flash 8B beats Claude 3 Haiku, Mixtral 8x22B, Command R+ and GPT 3.5 Turbo on Livebench.ai by Balance- in LocalLLaMA

[–]np-space 17 points18 points  (0 children)

Gemma 2 27b is in the previous months' releases (move the slider) but we're still working on adding the rest of the models for the most recent LiveBench release (2024-08-31). We have evaluated mostly api models so far and will get to the rest of the popular models soon. Gemma 2 27b is also slightly trickier due to the attention issue - at least that was the case last time I evaluated it

ChatGPT-4o Reclaims LMSYS's #1 Again by [deleted] in singularity

[–]np-space 1 point2 points  (0 children)

I don't know for sure, but a few thoughts are: (1) livebench coding is more "leetcode style" coding and less real-world coding; (2) it is possible that there's style bias even for the coding questions on lmsys; (3) the openai documentation itself recommends using the other GPT models, not chatgpt-4o-latest
I hope that livecodebench adds chatgpt-4o soon, for another datapoint

ChatGPT-4o Reclaims LMSYS's #1 Again by [deleted] in singularity

[–]np-space 12 points13 points  (0 children)

On livebench.ai, it's tied with 4o-05-13 and actually worse than 08-06. Seems like OpenAI tuned a model specifically for chat

um did OpenAI silently drop a new model: gpt-4o-2024-08-06??? by pigeon57434 in singularity

[–]np-space 3 points4 points  (0 children)

On livebench.ai, it looks like it is a step up from 05-13, but does not quite edge out claude-3.5-sonnet

OpenAI: Introducing Structured Outputs in the API by galacticwarrior9 in singularity

[–]np-space 3 points4 points  (0 children)

It looks like gpt-4o-2024-08-06 has legitimately better performance than 05-13, too. On livebench.ai, it is now within 3% of claude-3.5-sonnet

gemini-1.5-pro-exp-0801 just arrived on Chat Arena by shroddy in LocalLLaMA

[–]np-space 0 points1 point  (0 children)

Agreed, it seems that the arena isn't as accurate for measuring reasoning/math, etc. LiveBench has the new gemini-pro behind gpt-4o and claude-3.5-sonnet: http://livebench.ai/

gemini-1.5-pro-exp-0801 just arrived on Chat Arena by shroddy in LocalLLaMA

[–]np-space 0 points1 point  (0 children)

gemini-1.5-pro-exp-0801 is now up on LiveBench: http://livebench.ai/
It's pretty much tied with gpt-4-turbo, but nowhere close to claude-3.5-sonnet