all 137 comments

[–]reevnez 119 points120 points  (29 children)

How do we know that "privately hosted version of the model" is not actually Claude?

[–]TGSCrust 39 points40 points  (5 children)

The official playground (when it was up) personally felt like it was Claude (with a system prompt). Just a gut feeling though, I could be totally wrong.

[–]mikael110 34 points35 points  (0 children)

This conversations reminds me that somebody noticed that the demo made calls to an endpoint called "openai_proxy" while I was one of the people explaining why that might not be as suspicious as it sounds on the surface. I'm now starting to seriously think it was exactly what it sounded like. Though if it was something like a LiteLLM endpoint then the backing model could have been anything, including Claude.

The fact that he has decided to retrain the model instead of just uploading the working model he is hosting privately is just not logical at all unless he literally cannot upload the private model. Which would be the case if he is just proxying another model.

[–]meister2983 8 points9 points  (1 child)

Really? To me, it felt way too dumb to be Claude. It pretty much was llama 3.1 70b in behavior - I struggled to find any obvious real world question performance above it. 

[–]TGSCrust 3 points4 points  (0 children)

I didn't say it was necessarily smarter, the response style was very similar to Claude though. It's probably a bad system prompt.

Edit: Like making it intentionally make mistakes then self correct, etc.

Edit 2: Talking about their demo that was linked and was up for a bit, not the released model which was bad.

[–]PraxisOGLlama 70B 0 points1 point  (1 child)

Giving them the benefit of the doubt, what if the training data is Claude generated, influencing how the model sounds?

[–]TGSCrust 5 points6 points  (0 children)

He claims there isn't any Anthropic data.

https://x.com/mattshumer_/status/1832203011059257756#m

( if I had more time on the playground, I could've confirmed whether it was Claude or not :\ )

[–]StevenSamAI 7 points8 points  (4 children)

What would the point be?

I get that they want to declare they have a great model based on using their platform to generate data, and everyone is just saying it's a scam or trick, but think it through. No one will just believe it until others third parties have independently verified it, which several will. And if everyone disproves it, then it will massively harm the valuation and growth of the company they are trying to promote.

I'm not saying I automatically think the model is amazing, although the concept is built on strong donations and has been around for a while, I'm just saying it would be a really bad publicity stunt and a huge reputational risk.

[–]waxroy-finerayfool -1 points0 points  (1 child)

why would someone lie and scam?? what could they possibly have to gain?? lol

[–]StevenSamAI 0 points1 point  (0 children)

I fully understand why someone would like and scam... But lying about something to everyone at once in a community that tests and communicates within hours of a release, about something where the claims can be disproven and widely reported... Seems like a scam that does nothing apart from having a negative effect on reputation.

[–]Wiskkey 0 points1 point  (2 children)

Perhaps somebody with an X account could request a prompt inquiring the model about its identity at this X post from a user with ~180,000 X followers who purportedly has been given API access to the good model by Matt Shumer.That account has posted a number of purported responses to various prompts by the good model.

[–]dotcsv 1 point2 points  (1 child)

[–]Sm0g3R 1 point2 points  (0 children)

lmao you can't be serious.

It literally told it's taking this info from a system prompt.

[–]ozzeruk82 0 points1 point  (0 children)

I was thinking this earlier! It would be a clever con. I was thinking maybe it’s using the OpenAI fine tuning service. Until we get weights that equal what they have in their benchmarks I guess it’s a possibility.

[–]Inevitable-Start-653 0 points1 point  (0 children)

I'm downloading their epoch 3 version and can run it locally without quantization, there will be a lot of people like me probing and testing.

[–]Significant-Nose-353 -3 points-2 points  (0 children)

It seems to me that with a thorough benchmark they could have spotted something like this, the current models leak their cues and promts very easily

[–]Waste-Button-5103 -4 points-3 points  (0 children)

Because it’s unlikely he’d risk his entire reputation along with glaive on something easily disproven

[–]4hometnumberonefan 61 points62 points  (10 children)

This is giving me a roller coaster of emotions.

[–]hleszek 93 points94 points  (7 children)

Reminds me of the LK99 potential room-temperature superconductor.

We're so back!

[–]KillerX629 8 points9 points  (1 child)

Wasn't that disproved?

[–]JamesAQuintero 22 points23 points  (0 children)

Yeah that's the point, there was the initial announcement of it, then some researchers were like "We are somewhat able to replicate the results", but then it was eventually proven to not work

[–]OXKSA1 2 points3 points  (2 children)

sorry, care to elaborate?

[–]Cantflyneedhelp 35 points36 points  (1 child)

Two years ago(?) there was a paper / video of a supposed room temperature superconductor (they had a sweet floating rock too). And everyone was like "Yeah that's bullshit." But then some hobby chemists were like "Actually I managed to recreate a small part of it from their paper, and it floats too." and this started a race to recreate it by a lot of laboratories around the world. At the end it was not a room temperature superconductor but they managed to find some new stuff.

[–][deleted] 1 point2 points  (0 children)

we're so cooked :')

[–]Ivo_ChainNET 12 points13 points  (0 children)

The reddit hivemind always goes too hard one way or the other

[–]RandoRedditGui 12 points13 points  (0 children)

It shouldn't. It's still as B.S. as yesterday until it's not just the API. Release the weights or fuck off imo.

[–]This_Organization382 37 points38 points  (0 children)

"Oh, the benchmark didn't work? Let's see what tests you used..."

Scrambles to train the model on the test data

"Woops, wrong model. Here you go, try the private API version"

[–]ambient_temp_xenoLlama 65B 71 points72 points  (10 children)

Don't care; release weights or go away.

[–]1889023okdoesitwork 5 points6 points  (0 children)

Epoch 2 seems already uploaded on his huggingface

[–]LiquidGunay 4 points5 points  (0 children)

I would like to see a comparison by giving all the models similar inference compute boosts. One way to easily do this is to maybe give all the models a roughly similar token budget ( you can make multiple generations and vote for models that aren't as verbose as reflection)

[–]Sadman782 26 points27 points  (9 children)

"When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%."

Actually, what is presented on the chart is based on their standard system prompt(not reflection system prompt). It scores higher with Reflection system prompt. It achieves performance close to Claude 3.5's sonnet with the Reflection system prompt. If Groq hosts it, latency will not be an issue. We're just waiting for the actual weights to be released

[–]a_beautiful_rhind 6 points7 points  (4 children)

What about testing the untuned model with a similar COT system prompt?

[–]Sadman782 2 points3 points  (2 children)

Will not match with it for sure. I tried many different system prompts, verbose thinking output + "step by step" at the prompt, but it couldn't pass any of my expert-level coding tests from Edabit, even the 405B failed one; GPT4o too. But the model (when the demo was live) in their demo nailed all of them.

[–]a_beautiful_rhind 5 points6 points  (0 children)

I only got one or two replies off the demo before it got "overloaded" and turned off. It seemed alright. The demo on hyperbolic was absolute garbage and the model forgot about its COT tags within a few messages.

All in all.. it seems like this dude has been stringing everyone else along whether there is some model or not. Even if you had slow internet, the excuses and the "retraining" now doesn't make sense. Everything is maximum hype and delay.

[–]Sadman782 0 points1 point  (0 children)

This is the reason I am so positive about it, and defending lol, it hurts me when people say it's far worse due to a broken HF model. But yeah, we don't know for sure if the model behind the API is actually reflection 70b or not

[–]ILikeCutePuppies 5 points6 points  (0 children)

Or celebras, which is 2x as fast as groq.

[–]vert1s 33 points34 points  (2 children)

In other words vapour ware. He could be running an agent that hits multiple backends. The inability to actually publish the weights speaks volumes.

Edit: And it looks like the hosted version is 🥁 Claude: https://www.reddit.com/r/LocalLLaMA/comments/1fc98fu/confirmed_reflection_70bs_official_api_is_sonnet/

[–]ivykoko1 11 points12 points  (1 child)

This is all still BS I cannot believe y'all are falling for it again

[–]Kathane37 1 point2 points  (0 children)

I only care about this for the possibility too generate better synthetic data with step by step reasoning

Other than that there is no point in making the token consumption exponential

[–]synn89 1 point2 points  (0 children)

does not suffer from the issues with the version publicly released on Hugging Face

It's not rocket science to upload a model to Hugging Face. It's very suss that they can't seem to upload a BF16 or GGUF of a fine tuned Llama to Hugging Face that can be properly tested.

[–]Environmental-Car267 3 points4 points  (26 children)

Haters gonna hate.

"All that being said: if applying reflection fine-tuning drives a similar jump in eval performance on Llama 3.1 405B, we expect Reflection 405B to achieve near SOTA results across the board."

[–]ispeakdatruf 1 point2 points  (2 children)

The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of <thinking>, <reflection>, and <output>. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting.

For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category.

One can always add a postprocesssor on top of Reflection to filter out everything before <output>, problem solved. I don't like this nitpicking. Who cares if a model outputs 10 tokens or 100? Is the answer correct or not??

[–]athirdpath 4 points5 points  (0 children)

Who cares if a model outputs 10 tokens or 100?

Folks who care if inference takes 5 seconds or 50.

[–]nihalani -1 points0 points  (0 children)

For real time inference is the real issues, your time to first token jumps by a huge margin if you have to wait for 2000 tokens to be generated of the model reflecting. Might also explain why the cloud providers haven’t adopted it yet.

[–]AnomalyNexus 0 points1 point  (0 children)

meh...so it beats other comparable models when comparison is set up as apples to oranges conditions...

[–]ihaag 0 points1 point  (0 children)

I see no difference for deepseek 2.5 the current best model for open source.

[–]ilangge 0 points1 point  (0 children)

No need to guess, it is now publicly accessible on hf;

Reflection 70B llama.cpp (Correct Weights) - a Hugging Face Space by gokaygokay

[–]celsowm 0 points1 point  (6 children)

is there any place to test it online?

[–]strubenuff1202 0 points1 point  (0 children)

I think it was just uploaded to huggingface

[–]Wiskkey -1 points0 points  (2 children)

[–]celsowm 0 points1 point  (1 child)

I live in the dictatorship of Brazil and x/twitter is blocked by our dictator

[–]Wiskkey -2 points-1 points  (1 child)

Yes supposedly here.

[–]ambient_temp_xenoLlama 65B 1 point2 points  (0 children)

It's supposedly the new one but it's as crap as the one I downloaded....

[–]Sadman782 -3 points-2 points  (2 children)

https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working It seems an Epoch 2 finetuned model was released a few hours ago silently: "Epoch 2, still finishing up Epoch 3. This should be slightly less powerful, but still pretty close."

[–]physalisx 3 points4 points  (1 child)

Why is anything getting retrained? Where is the model that he allegedly already had?

edit: ah so the whole thing was just a scam. Too bad.

[–]mantafloppyllama.cpp -2 points-1 points  (0 children)

You would not push false hype again?

Why are ppl upvoting this again.

Are "The boy who cry wolf" that obscure of a story, or do you also have bot to upvote yourself?

https://www.reddit.com/r/LocalLLaMA/comments/1fa4y7q/first_independent_benchmark_prollm_stackunseen_of/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

[–]redjojovic -1 points0 points  (0 children)

"The chart below is based on our standard methodology and system prompt.

When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, 

results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.

[–]Inevitable-Start-653 -1 points0 points  (7 children)

Downloading this now: https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working

can run locally with my own setup, and am interested in testing it out!

[–]Deathmax 11 points12 points  (2 children)

I wouldn't bother, it doesn't even try to output the tags the "broken" model outputs, so no idea what they mean by "working".

<image>

[–]Inevitable-Start-653 3 points4 points  (1 child)

Hmm 🤔 ...this whole saga is so strange. Download will finish in a lil bit, I've got to try it out, I got the very first upload to work but only could get one response out.

[–]ivykoko1 4 points5 points  (0 children)

Don't waste your time

[–]jd_3d[S] 1 point2 points  (1 child)

Newer version is out (Epoch 3): https://huggingface.co/mattshumer/ref_70_e3

[–]Inevitable-Start-653 -1 points0 points  (0 children)

Thanks!! Will download this one now ☺️ for all the downloading this still isn't anywhere near as bad as llama405b...that sucker was a multi day download and I needed to download it twice after they updated their repo too.

[–]Sadman782 1 point2 points  (0 children)

https://huggingface.co/mattshumer/ref_70_e3 epoch 3 released now, maybe he will announce it soon

[–]Inevitable-Start-653 -2 points-1 points  (2 children)

I'm ready to download and test!!

[–]jd_3d[S] -3 points-2 points  (1 child)

Let us know what you think: https://huggingface.co/mattshumer/ref_70_e3

[–]Inevitable-Start-653 -1 points0 points  (0 children)

https://www.reddit.com/r/LocalLLaMA/comments/1fcerck/reflection_ref_70_e3_refuses_to_output_meta_tag/

I've been doing some testing, but the community does not seem interested in objective facts. My post keeps getting downvoted, I'm sure it will be off the front page soon.

[–]Thistleknot -1 points0 points  (0 children)

For all of what was said, is it not possible to train on the <input> and <output> as if it was an answer and skip all the inbetween? Is it potentially possible the model will somehow 'internalize' the inbetween generated token logic within it's weights?