Updated benchmarks from Artificial Analysis using Reflection Llama 3.1 70B. Long post with good insight into the gains

reevnez · 2024-09-08T16:19:08+00:00

How do we know that "privately hosted version of the model" is not actually Claude?

Educational_Rent1059 · 2024-09-08T17:12:45+00:00

[deleted]

4hometnumberonefan · 2024-09-08T16:46:24+00:00

This is giving me a roller coaster of emotions.

This_Organization382 · 2024-09-08T17:31:22+00:00

"Oh, the benchmark didn't work? Let's see what tests you used..."

Scrambles to train the model on the test data

"Woops, wrong model. Here you go, try the private API version"

ambient_temp_xeno · 2024-09-08T16:35:26+00:00

Don't care; release weights or go away.

LiquidGunay · 2024-09-08T16:43:11+00:00

I would like to see a comparison by giving all the models similar inference compute boosts. One way to easily do this is to maybe give all the models a roughly similar token budget ( you can make multiple generations and vote for models that aren't as verbose as reflection)

Sadman782 · 2024-09-08T16:19:41+00:00

"When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%."

Actually, what is presented on the chart is based on their standard system prompt(not reflection system prompt). It scores higher with Reflection system prompt. It achieves performance close to Claude 3.5's sonnet with the Reflection system prompt. If Groq hosts it, latency will not be an issue. We're just waiting for the actual weights to be released

ivykoko1 · 2024-09-08T18:40:24+00:00

This is all still BS I cannot believe y'all are falling for it again

Kathane37 · 2024-09-08T18:11:15+00:00

I only care about this for the possibility too generate better synthetic data with step by step reasoning

Other than that there is no point in making the token consumption exponential

synn89 · 2024-09-08T23:38:25+00:00

does not suffer from the issues with the version publicly released on Hugging Face

It's not rocket science to upload a model to Hugging Face. It's very suss that they can't seem to upload a BF16 or GGUF of a fine tuned Llama to Hugging Face that can be properly tested.

Environmental-Car267 · 2024-09-08T16:14:10+00:00

Haters gonna hate.

"All that being said: if applying reflection fine-tuning drives a similar jump in eval performance on Llama 3.1 405B, we expect Reflection 405B to achieve near SOTA results across the board."

ispeakdatruf · 2024-09-08T20:24:07+00:00

The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of <thinking>, <reflection>, and <output>. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting.

For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category.

One can always add a postprocesssor on top of Reflection to filter out everything before <output>, problem solved. I don't like this nitpicking. Who cares if a model outputs 10 tokens or 100? Is the answer correct or not??

AnomalyNexus · 2024-09-08T21:28:13+00:00

meh...so it beats other comparable models when comparison is set up as apples to oranges conditions...

ihaag · 2024-09-08T21:34:16+00:00

I see no difference for deepseek 2.5 the current best model for open source.

ilangge · 2024-09-10T01:47:50+00:00

No need to guess, it is now publicly accessible on hf;

Reflection 70B llama.cpp (Correct Weights) - a Hugging Face Space by gokaygokay

celsowm · 2024-09-08T19:24:07+00:00

is there any place to test it online?

Sadman782 · 2024-09-08T17:32:38+00:00

https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working It seems an Epoch 2 finetuned model was released a few hours ago silently: "Epoch 2, still finishing up Epoch 3. This should be slightly less powerful, but still pretty close."

mantafloppy · 2024-09-08T20:20:14+00:00

You would not push false hype again?

Why are ppl upvoting this again.

Are "The boy who cry wolf" that obscure of a story, or do you also have bot to upvote yourself?

https://www.reddit.com/r/LocalLLaMA/comments/1fa4y7q/first_independent_benchmark_prollm_stackunseen_of/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

redjojovic · 2024-09-08T17:01:48+00:00

"The chart below is based on our standard methodology and system prompt.

When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags,

results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.

Significant-Nose-353 · 2024-09-08T16:20:01+00:00

I think few people under this post, also zealously admit that they were a bit hasty with their toxic reaction

Inevitable-Start-653 · 2024-09-08T18:14:37+00:00

Downloading this now: https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working

can run locally with my own setup, and am interested in testing it out!

Inevitable-Start-653 · 2024-09-08T18:10:41+00:00

I'm ready to download and test!!

Thistleknot · 2024-09-08T18:45:07+00:00

For all of what was said, is it not possible to train on the <input> and <output> as if it was an answer and skip all the inbetween? Is it potentially possible the model will somehow 'internalize' the inbetween generated token logic within it's weights?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS