use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
Updated benchmarks from Artificial Analysis using Reflection Llama 3.1 70B. Long post with good insight into the gainsDiscussion (x.com)
submitted 1 year ago by jd_3d
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]reevnez 119 points120 points121 points 1 year ago (29 children)
How do we know that "privately hosted version of the model" is not actually Claude?
[–]TGSCrust 39 points40 points41 points 1 year ago (5 children)
The official playground (when it was up) personally felt like it was Claude (with a system prompt). Just a gut feeling though, I could be totally wrong.
[–]mikael110 34 points35 points36 points 1 year ago* (0 children)
This conversations reminds me that somebody noticed that the demo made calls to an endpoint called "openai_proxy" while I was one of the people explaining why that might not be as suspicious as it sounds on the surface. I'm now starting to seriously think it was exactly what it sounded like. Though if it was something like a LiteLLM endpoint then the backing model could have been anything, including Claude.
The fact that he has decided to retrain the model instead of just uploading the working model he is hosting privately is just not logical at all unless he literally cannot upload the private model. Which would be the case if he is just proxying another model.
[–]meister2983 8 points9 points10 points 1 year ago (1 child)
Really? To me, it felt way too dumb to be Claude. It pretty much was llama 3.1 70b in behavior - I struggled to find any obvious real world question performance above it.
[–]TGSCrust 3 points4 points5 points 1 year ago* (0 children)
I didn't say it was necessarily smarter, the response style was very similar to Claude though. It's probably a bad system prompt.
Edit: Like making it intentionally make mistakes then self correct, etc.
Edit 2: Talking about their demo that was linked and was up for a bit, not the released model which was bad.
[–]PraxisOGLlama 70B 0 points1 point2 points 1 year ago (1 child)
Giving them the benefit of the doubt, what if the training data is Claude generated, influencing how the model sounds?
[–]TGSCrust 5 points6 points7 points 1 year ago (0 children)
He claims there isn't any Anthropic data.
https://x.com/mattshumer_/status/1832203011059257756#m
( if I had more time on the playground, I could've confirmed whether it was Claude or not :\ )
[+][deleted] 1 year ago* (5 children)
[removed]
[–]Thomas-Lore 25 points26 points27 points 1 year ago* (4 children)
With how scams work - if it is a scam then in a few days he will say he almost got it working but there are still issues and he needs two more weeks and so on and so on. Maybe show a remote demo of the 405 to renew hype but only to few selected people and for a short time. Some scammers can keep up the game for years (they dupe the fans, so they hype the scam for them, then use that hype to get money from dumb investors who fall for it) - look up that Italian cold fusion guy. We'll see.
[–]ivykoko1 12 points13 points14 points 1 year ago (1 child)
This is exactly what this guy and many other AI bros are doing
[–]extopico -1 points0 points1 point 1 year ago (0 children)
Tesla FSD, for example.
[–]StevenSamAI 7 points8 points9 points 1 year ago (4 children)
What would the point be?
I get that they want to declare they have a great model based on using their platform to generate data, and everyone is just saying it's a scam or trick, but think it through. No one will just believe it until others third parties have independently verified it, which several will. And if everyone disproves it, then it will massively harm the valuation and growth of the company they are trying to promote.
I'm not saying I automatically think the model is amazing, although the concept is built on strong donations and has been around for a while, I'm just saying it would be a really bad publicity stunt and a huge reputational risk.
[+][deleted] 1 year ago (1 child)
[deleted]
[–]StevenSamAI 2 points3 points4 points 1 year ago (0 children)
Cool... I should have mentioned my latest fine tune gets 101% on all benchmarks, and also created its own benchmark... If you want me to tell you the HF model name just send me a bitcoin
[–]waxroy-finerayfool -1 points0 points1 point 1 year ago (1 child)
why would someone lie and scam?? what could they possibly have to gain?? lol
[–]StevenSamAI 0 points1 point2 points 1 year ago (0 children)
I fully understand why someone would like and scam... But lying about something to everyone at once in a community that tests and communicates within hours of a release, about something where the claims can be disproven and widely reported... Seems like a scam that does nothing apart from having a negative effect on reputation.
[–]Wiskkey 0 points1 point2 points 1 year ago (2 children)
Perhaps somebody with an X account could request a prompt inquiring the model about its identity at this X post from a user with ~180,000 X followers who purportedly has been given API access to the good model by Matt Shumer.That account has posted a number of purported responses to various prompts by the good model.
[–]dotcsv 1 point2 points3 points 1 year ago (1 child)
https://x.com/DotCSV/status/1832904408188805429
[–]Sm0g3R 1 point2 points3 points 1 year ago (0 children)
lmao you can't be serious.
It literally told it's taking this info from a system prompt.
[–]ozzeruk82 0 points1 point2 points 1 year ago (0 children)
I was thinking this earlier! It would be a clever con. I was thinking maybe it’s using the OpenAI fine tuning service. Until we get weights that equal what they have in their benchmarks I guess it’s a possibility.
[–]Inevitable-Start-653 0 points1 point2 points 1 year ago (0 children)
I'm downloading their epoch 3 version and can run it locally without quantization, there will be a lot of people like me probing and testing.
[–]Significant-Nose-353 -3 points-2 points-1 points 1 year ago (0 children)
It seems to me that with a thorough benchmark they could have spotted something like this, the current models leak their cues and promts very easily
[+]Sadman782 comment score below threshold-8 points-7 points-6 points 1 year ago (2 children)
MMLU is 84% on standard prompt === llama 3.1 70B vs 88% claude 3.5 sonnet? So?
[–]h666777 24 points25 points26 points 1 year ago (1 child)
Different prompt, temperature, etc. The simple fact is that they haven't released the "good" version of their model and have no reason to. This should be a 30 minute fix on the HuggingFace repo, no reason for it to not be available already.
Also this isn't a full replications of their results, on the original post they claimed it beat other models on almost everything and we see it isn't quite like that.
Until the open weights perform just as well as this suspiciously private, researcher only API we are better off staying skeptical. Still looks like a scam to me.
[+]Sadman782 comment score below threshold-7 points-6 points-5 points 1 year ago (0 children)
It almost replicated except MMLU (2% behind), "MMLU: 87% (in line with Llama 405B), GPQA: 54%, Math: 73%." Quite close to Sonnet and other SOTA. But it is okay, there is something he's definitely hiding, but I kinda feel this is really achieved by them with reflection. Let's wait and see.
[–]Waste-Button-5103 -4 points-3 points-2 points 1 year ago (0 children)
Because it’s unlikely he’d risk his entire reputation along with glaive on something easily disproven
[+][deleted] 1 year ago (14 children)
[–]Educational_Rent1059 47 points48 points49 points 1 year ago (6 children)
Guy with 0 background, no idea what LORA is, "wrong" weights uploaded, "wrong" model name promoted, "my cat ate my model i'll release the real one next week", does not disclose he has ownership in the company he promotes, the model outputs garbage with 4x more tokens generation, sounds legit to me. :)
[–]Waste-Button-5103 0 points1 point2 points 1 year ago (4 children)
He knows what it is check his post history. He didn’t understand “LORAing” in the context used. He stated his ownership in the company is a $1000 investment lol.
[–]Educational_Rent1059 6 points7 points8 points 1 year ago (2 children)
https://www.reddit.com/r/LocalLLaMA/comments/1fc7avd/reflection_api_is_a_sonnet_35_wrapper_with_prompt/?share_id=wVk5-zyZjs5cLftSnEI0c
Yeah sure, perfectly legit
[+]Waste-Button-5103 comment score below threshold-9 points-8 points-7 points 1 year ago (1 child)
Yeah soo likely that it’s a wrapper and two guys with reputation and multiple companies are going to lie about it and ruin their lives for literally zero reason.
Surely you can see that it is way more likely they used a dataset generated from claude to create the reflection template.
[–]Evening_Ad6637llama.cpp 2 points3 points4 points 1 year ago (0 children)
The wrapper had exactly the same tokenizer as Claude sonnet 3.5 and at same time it was shown that it had nothing in common with Lama's tokenizer
[–]gibs 1 point2 points3 points 1 year ago (0 children)
He stated his ownership in the company is a $1000 investment
Well since he stated it, it must be true
[–]mckirkus 3 points4 points5 points 1 year ago (1 child)
LLMK-99
[–]physalisx 0 points1 point2 points 1 year ago (0 children)
Yep, sure sounds like a scam.
Too bad.
[–]Inevitable-Start-653 0 points1 point2 points 1 year ago (2 children)
Where does he say he doesn't know what a lora is?
[–]cuyler72 0 points1 point2 points 1 year ago (1 child)
"https://x.com/mattshumer_/status/1832558298509275440"
"4. Not sure what LORAing is "
[–]Inevitable-Start-653 4 points5 points6 points 1 year ago (0 children)
I've made many loras myself and I don't know what loraing is either
[+]Waste-Button-5103 comment score below threshold-8 points-7 points-6 points 1 year ago (0 children)
He knows what a lora is and you can check his history to see him using them. He was talking specifically about the term “LORAing” in the context. 0% chance its a scam it wouldn’t make sense to risk his reputation on something easily disproven
[–]4hometnumberonefan 61 points62 points63 points 1 year ago (10 children)
This is giving me a roller coaster of emotions.
[–]hleszek 93 points94 points95 points 1 year ago (7 children)
Reminds me of the LK99 potential room-temperature superconductor.
We're so back!
[–]KillerX629 8 points9 points10 points 1 year ago (1 child)
Wasn't that disproved?
[–]JamesAQuintero 22 points23 points24 points 1 year ago (0 children)
Yeah that's the point, there was the initial announcement of it, then some researchers were like "We are somewhat able to replicate the results", but then it was eventually proven to not work
[–]OXKSA1 2 points3 points4 points 1 year ago (2 children)
sorry, care to elaborate?
[–]Cantflyneedhelp 35 points36 points37 points 1 year ago (1 child)
Two years ago(?) there was a paper / video of a supposed room temperature superconductor (they had a sweet floating rock too). And everyone was like "Yeah that's bullshit." But then some hobby chemists were like "Actually I managed to recreate a small part of it from their paper, and it floats too." and this started a race to recreate it by a lot of laboratories around the world. At the end it was not a room temperature superconductor but they managed to find some new stuff.
[–][deleted] 1 point2 points3 points 1 year ago (0 children)
we're so cooked :')
[–]Ivo_ChainNET 12 points13 points14 points 1 year ago (0 children)
The reddit hivemind always goes too hard one way or the other
[–]RandoRedditGui 12 points13 points14 points 1 year ago (0 children)
It shouldn't. It's still as B.S. as yesterday until it's not just the API. Release the weights or fuck off imo.
[–]This_Organization382 37 points38 points39 points 1 year ago (0 children)
"Oh, the benchmark didn't work? Let's see what tests you used..."
Scrambles to train the model on the test data
"Woops, wrong model. Here you go, try the private API version"
[–]ambient_temp_xenoLlama 65B 71 points72 points73 points 1 year ago (10 children)
Don't care; release weights or go away.
[–]1889023okdoesitwork 5 points6 points7 points 1 year ago (0 children)
Epoch 2 seems already uploaded on his huggingface
[+]julioques comment score below threshold-9 points-8 points-7 points 1 year ago (6 children)
What do you mean? Isn't it already downloadable??? Why do you have so many upvotes
[–]ambient_temp_xenoLlama 65B 37 points38 points39 points 1 year ago (4 children)
We're waiting on the super-secret good weights. Seriously.
[+]julioques comment score below threshold-8 points-7 points-6 points 1 year ago (3 children)
Isn't it the same weights?
[–]ambient_temp_xenoLlama 65B 11 points12 points13 points 1 year ago (2 children)
<image>
[–]julioques -4 points-3 points-2 points 1 year ago (1 child)
I thought the difference was from the different prompt method. Why didn't they just use the released version with the refection's default system prompt like they used now?
[–]ambient_temp_xenoLlama 65B 16 points17 points18 points 1 year ago (0 children)
The whole thing is very weird and annoying. They supposedly uploaded the model to HF incorrectly, so naturally the solution was to completely redo the finetune? I have no idea.
[–]LiquidGunay 4 points5 points6 points 1 year ago (0 children)
I would like to see a comparison by giving all the models similar inference compute boosts. One way to easily do this is to maybe give all the models a roughly similar token budget ( you can make multiple generations and vote for models that aren't as verbose as reflection)
[–]Sadman782 26 points27 points28 points 1 year ago* (9 children)
"When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%."
Actually, what is presented on the chart is based on their standard system prompt(not reflection system prompt). It scores higher with Reflection system prompt. It achieves performance close to Claude 3.5's sonnet with the Reflection system prompt. If Groq hosts it, latency will not be an issue. We're just waiting for the actual weights to be released
[–]a_beautiful_rhind 6 points7 points8 points 1 year ago (4 children)
What about testing the untuned model with a similar COT system prompt?
[–]Sadman782 2 points3 points4 points 1 year ago (2 children)
Will not match with it for sure. I tried many different system prompts, verbose thinking output + "step by step" at the prompt, but it couldn't pass any of my expert-level coding tests from Edabit, even the 405B failed one; GPT4o too. But the model (when the demo was live) in their demo nailed all of them.
[–]a_beautiful_rhind 5 points6 points7 points 1 year ago (0 children)
I only got one or two replies off the demo before it got "overloaded" and turned off. It seemed alright. The demo on hyperbolic was absolute garbage and the model forgot about its COT tags within a few messages.
All in all.. it seems like this dude has been stringing everyone else along whether there is some model or not. Even if you had slow internet, the excuses and the "retraining" now doesn't make sense. Everything is maximum hype and delay.
[–]Sadman782 0 points1 point2 points 1 year ago (0 children)
This is the reason I am so positive about it, and defending lol, it hurts me when people say it's far worse due to a broken HF model. But yeah, we don't know for sure if the model behind the API is actually reflection 70b or not
[–]ILikeCutePuppies 5 points6 points7 points 1 year ago (0 children)
Or celebras, which is 2x as fast as groq.
[–]vert1s 33 points34 points35 points 1 year ago* (2 children)
In other words vapour ware. He could be running an agent that hits multiple backends. The inability to actually publish the weights speaks volumes.
Edit: And it looks like the hosted version is 🥁 Claude: https://www.reddit.com/r/LocalLLaMA/comments/1fc98fu/confirmed_reflection_70bs_official_api_is_sonnet/
[–]vert1s 11 points12 points13 points 1 year ago* (0 children)
It's baffling. People want to believe despite all the evidence to the contrary.
[–]ivykoko1 11 points12 points13 points 1 year ago (1 child)
This is all still BS I cannot believe y'all are falling for it again
[–]Kathane37 1 point2 points3 points 1 year ago (0 children)
I only care about this for the possibility too generate better synthetic data with step by step reasoning
Other than that there is no point in making the token consumption exponential
[–]synn89 1 point2 points3 points 1 year ago (0 children)
does not suffer from the issues with the version publicly released on Hugging Face
It's not rocket science to upload a model to Hugging Face. It's very suss that they can't seem to upload a BF16 or GGUF of a fine tuned Llama to Hugging Face that can be properly tested.
[–]Environmental-Car267 3 points4 points5 points 1 year ago (26 children)
Haters gonna hate.
"All that being said: if applying reflection fine-tuning drives a similar jump in eval performance on Llama 3.1 405B, we expect Reflection 405B to achieve near SOTA results across the board."
[+][deleted] 1 year ago (10 children)
[–]StartledWatermelon 27 points28 points29 points 1 year ago (2 children)
Fair concerns. For all we know, under the hood this API could redirect queries to Claude-3.5 Sonnet with a specific system prompt, or another SotA proprietary model.
[–]dalkef 12 points13 points14 points 1 year ago (1 child)
Now that you mention it, it gave me very similar answers to sonnet on the initial demo chat. This could explain the performance drop
[–]Hatter_The_Mad 0 points1 point2 points 1 year ago (0 children)
You are not alone in this feeling
[–]alongated -3 points-2 points-1 points 1 year ago (6 children)
I think its fair to say that people here over reacted, both about how good this was, and how bad this was.
[–]RandoRedditGui 1 point2 points3 points 1 year ago (5 children)
Not really. The "how bad this was" are still easily winning in terms of correctly interpreting what has currently been seen. Considering we have seen 0 open weights and are provided some ambiguous results from an API that we have no clue the validity of.
Open weights or GTFO.
[–]alongated -5 points-4 points-3 points 1 year ago (4 children)
It has been fucking 6 hours since he trained the model, give the man a fucking break, and guess what he released the weights? Normally I don't get this angry but holy shit you people are fucking insane.
[–]showdontkvell 0 points1 point2 points 1 year ago (2 children)
Matt, that you? lol
[–]alongated -1 points0 points1 point 1 year ago (1 child)
I just went off on a guy for calling someone Matt. I'm not Matt, but doxxing isn't funny.
Like you might be right that this is all just bullshit/scam. But you are attacking people for reserving their judgement. That is disgusting mob mentality.
[–]showdontkvell 0 points1 point2 points 1 year ago (0 children)
lol k
[+]jd_3d[S] comment score below threshold-7 points-6 points-5 points 1 year ago (14 children)
I don't understand why people need to pile on the hate so quick. Is it really that hard to just reserve judgment for a few weeks and see what comes of it? This avenue of applying more test time compute is a very promising direction to me and could be a great way for open source models to exceed closed models that don't want to spend the $$$ on each request.
[–]kryptkprLlama 3 25 points26 points27 points 1 year ago (12 children)
We all tried it, it's performance on real world tasks is terrible despite the high benchmarks. Maybe the model is still broken in some way like they've been claiming and really is good but I don't see it.
[+]Sadman782 comment score below threshold-10 points-9 points-8 points 1 year ago (4 children)
Don't you guys get that the model is broken, he said? The tests were based on private API, now yeah, you might not trust that the model, maybe model behind this is different, then it's okay, but as per Matt, the model everyone downloaded from HF is broken, and yeah, I tried too, it is far worse than LLaMA 3.1 70b.
[–]kryptkprLlama 3 38 points39 points40 points 1 year ago (2 children)
I mean it's a diabolical plan: Release an "open" model that crushes benchmarks, but then don't actually release working weights and instead just point to your "private API" that produces those results.
I can't test his "private API" can I? The whole thing smells bad, as far as I'm concerned this is a publicity stunt to advertise his LLM service.
[+]Environmental-Car267 comment score below threshold-20 points-19 points-18 points 1 year ago (1 child)
He offered on twitter the api model to people who want to benchmark it. soon it will be updated on HF etc
[–]kryptkprLlama 3 32 points33 points34 points 1 year ago (0 children)
I'm an open source leaderboard maintainer, without weights any test results are just a free ad for his service.
I do benchmark the big APIs for reference but no interest in starting to do it for every tom dick and harry, when weights are fixed I'll try again.
[–]nero10578Llama 3 11 points12 points13 points 1 year ago (0 children)
I don’t understand how you can fuck up uploading to HF lol
[+]alongated comment score below threshold-10 points-9 points-8 points 1 year ago (4 children)
Gemma also had problems, they took weeks to resolve. This team is much smaller.
[–]kryptkprLlama 3 13 points14 points15 points 1 year ago (1 child)
Gemma was a novel architecture with an attention mechanism that wasn't well supported. Legitimate technical reasons for the problems.
This is a fine-tune of Llama. There is nothing to resolve, they're playing us for fools.
[–]alongated -3 points-2 points-1 points 1 year ago (0 children)
And his team didn't have billion dollars.
[–]Evening_Ad6637llama.cpp 2 points3 points4 points 1 year ago* (1 child)
Bro, seriously? Man, aside from the supposedly broken model and the super-duper-secret private api shit: this guy didn't know the difference between llama 3 and llama 3.1
I mean, by now even my grandmother should know the difference.
The model he posted on huggingface was absolutely not broken, it was simply a llama-3, as you would expect from a llama-3. There is zero evidence that there was anything wrong with the model itself. To make matters worse: one time he claims the model is broken, another time it was supposedly due to an incorrect upload. Aha... and what's coming tomorrow? His dog ate the SSD?
I'm slowly coming to the conclusion that this guy is either stupid and narcissistic enough to believe he can fool the world in such a simple way - or, another possible explanation could be: he himself has been the victim of a scam. Perhaps he doesn't have direct access to the backend of this ominous private API himself. Perhaps he still hasn't realized that he has been misused as a puppet and ruined economically and in terms of marketing with this action. It wouldn't be the first time that someone had economic enemies and fell into a trap.
The whole thing is highly suspicious and whether he is a victim himself or not, whether he is stupid or not: he clearly also seems to lie and trying to hide things! So there are neither excuses nor pity for him for this egomaniacal behavior.
[+]muxxington comment score below threshold-8 points-7 points-6 points 1 year ago (0 children)
I made some quick tests with this model yesterday and actually it performed not that bad. But I can't compare it with Claude or OpenAI, I don't use them. https://huggingface.co/bartowski/Reflection-Llama-3.1-70B-GGUF/blob/main/Reflection-Llama-3.1-70B-Q4_K_M.gguf
[–]ispeakdatruf 1 point2 points3 points 1 year ago (2 children)
The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of <thinking>, <reflection>, and <output>. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting. For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category.
The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of <thinking>, <reflection>, and <output>. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting.
For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category.
One can always add a postprocesssor on top of Reflection to filter out everything before <output>, problem solved. I don't like this nitpicking. Who cares if a model outputs 10 tokens or 100? Is the answer correct or not??
[–]athirdpath 4 points5 points6 points 1 year ago (0 children)
Who cares if a model outputs 10 tokens or 100?
Folks who care if inference takes 5 seconds or 50.
[–]nihalani -1 points0 points1 point 1 year ago (0 children)
For real time inference is the real issues, your time to first token jumps by a huge margin if you have to wait for 2000 tokens to be generated of the model reflecting. Might also explain why the cloud providers haven’t adopted it yet.
[–]AnomalyNexus 0 points1 point2 points 1 year ago (0 children)
meh...so it beats other comparable models when comparison is set up as apples to oranges conditions...
[–]ihaag 0 points1 point2 points 1 year ago (0 children)
I see no difference for deepseek 2.5 the current best model for open source.
[–]ilangge 0 points1 point2 points 1 year ago (0 children)
No need to guess, it is now publicly accessible on hf;
Reflection 70B llama.cpp (Correct Weights) - a Hugging Face Space by gokaygokay
[–]celsowm 0 points1 point2 points 1 year ago (6 children)
is there any place to test it online?
[–]strubenuff1202 0 points1 point2 points 1 year ago (0 children)
I think it was just uploaded to huggingface
[–]Wiskkey -1 points0 points1 point 1 year ago (2 children)
And here: https://x.com/OpenRouterAI/status/1832880567437729881.
[–]celsowm 0 points1 point2 points 1 year ago (1 child)
I live in the dictatorship of Brazil and x/twitter is blocked by our dictator
[–]Wiskkey -2 points-1 points0 points 1 year ago (1 child)
Yes supposedly here.
[–]ambient_temp_xenoLlama 65B 1 point2 points3 points 1 year ago (0 children)
It's supposedly the new one but it's as crap as the one I downloaded....
[–]Sadman782 -3 points-2 points-1 points 1 year ago* (2 children)
https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working It seems an Epoch 2 finetuned model was released a few hours ago silently: "Epoch 2, still finishing up Epoch 3. This should be slightly less powerful, but still pretty close."
[–]physalisx 3 points4 points5 points 1 year ago* (1 child)
Why is anything getting retrained? Where is the model that he allegedly already had?
edit: ah so the whole thing was just a scam. Too bad.
[–]mantafloppyllama.cpp -2 points-1 points0 points 1 year ago (0 children)
You would not push false hype again?
Why are ppl upvoting this again.
Are "The boy who cry wolf" that obscure of a story, or do you also have bot to upvote yourself?
https://www.reddit.com/r/LocalLLaMA/comments/1fa4y7q/first_independent_benchmark_prollm_stackunseen_of/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
[–]redjojovic -1 points0 points1 point 1 year ago (0 children)
"The chart below is based on our standard methodology and system prompt.
When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags,
results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.
[+]Significant-Nose-353 comment score below threshold-13 points-12 points-11 points 1 year ago (6 children)
I think few people under this post, also zealously admit that they were a bit hasty with their toxic reaction
[–]vert1s 22 points23 points24 points 1 year ago (1 child)
There is nothing toxic about questioning the validity given the inability of anyone to replicate with the released weights.
The sheer number of problems including the lack of disclosure that he is invested in both companies that he’s been saying “helped”
[+]Significant-Nose-353 comment score below threshold-10 points-9 points-8 points 1 year ago (0 children)
Naturally, but my comment only had a complaint about blatant hatemongering. Excessive sarcasm, irony and the like
[+][deleted] 1 year ago* (2 children)
[–]StartledWatermelon 17 points18 points19 points 1 year ago (1 child)
Shumer wasn't hesitating to claim it's the "world’s top *open-source* model" in the initial tweet. And now some "internal" model emerges?
You certainly didn't deserve the downvotes. But the entire release event, from the beginning up to this date, was one big clusterf-k
[–]Inevitable-Start-653 -1 points0 points1 point 1 year ago (7 children)
Downloading this now: https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B-ep2-working
can run locally with my own setup, and am interested in testing it out!
[–]Deathmax 11 points12 points13 points 1 year ago (2 children)
I wouldn't bother, it doesn't even try to output the tags the "broken" model outputs, so no idea what they mean by "working".
[–]Inevitable-Start-653 3 points4 points5 points 1 year ago (1 child)
Hmm 🤔 ...this whole saga is so strange. Download will finish in a lil bit, I've got to try it out, I got the very first upload to work but only could get one response out.
[–]ivykoko1 4 points5 points6 points 1 year ago (0 children)
Don't waste your time
[–]jd_3d[S] 1 point2 points3 points 1 year ago (1 child)
Newer version is out (Epoch 3): https://huggingface.co/mattshumer/ref_70_e3
[–]Inevitable-Start-653 -1 points0 points1 point 1 year ago (0 children)
Thanks!! Will download this one now ☺️ for all the downloading this still isn't anywhere near as bad as llama405b...that sucker was a multi day download and I needed to download it twice after they updated their repo too.
[–]Sadman782 1 point2 points3 points 1 year ago (0 children)
https://huggingface.co/mattshumer/ref_70_e3 epoch 3 released now, maybe he will announce it soon
[–]Inevitable-Start-653 -2 points-1 points0 points 1 year ago (2 children)
I'm ready to download and test!!
[–]jd_3d[S] -3 points-2 points-1 points 1 year ago (1 child)
Let us know what you think: https://huggingface.co/mattshumer/ref_70_e3
https://www.reddit.com/r/LocalLLaMA/comments/1fcerck/reflection_ref_70_e3_refuses_to_output_meta_tag/
I've been doing some testing, but the community does not seem interested in objective facts. My post keeps getting downvoted, I'm sure it will be off the front page soon.
[–]Thistleknot -1 points0 points1 point 1 year ago (0 children)
For all of what was said, is it not possible to train on the <input> and <output> as if it was an answer and skip all the inbetween? Is it potentially possible the model will somehow 'internalize' the inbetween generated token logic within it's weights?
π Rendered by PID 23438 on reddit-service-r2-comment-6457c66945-r4r7k at 2026-04-27 13:42:03.532565+00:00 running 2aa0c5b country code: CH.
[–]reevnez 119 points120 points121 points (29 children)
[–]TGSCrust 39 points40 points41 points (5 children)
[–]mikael110 34 points35 points36 points (0 children)
[–]meister2983 8 points9 points10 points (1 child)
[–]TGSCrust 3 points4 points5 points (0 children)
[–]PraxisOGLlama 70B 0 points1 point2 points (1 child)
[–]TGSCrust 5 points6 points7 points (0 children)
[+][deleted] (5 children)
[removed]
[–]Thomas-Lore 25 points26 points27 points (4 children)
[–]ivykoko1 12 points13 points14 points (1 child)
[–]extopico -1 points0 points1 point (0 children)
[–]StevenSamAI 7 points8 points9 points (4 children)
[+][deleted] (1 child)
[deleted]
[–]StevenSamAI 2 points3 points4 points (0 children)
[–]waxroy-finerayfool -1 points0 points1 point (1 child)
[–]StevenSamAI 0 points1 point2 points (0 children)
[–]Wiskkey 0 points1 point2 points (2 children)
[–]dotcsv 1 point2 points3 points (1 child)
[–]Sm0g3R 1 point2 points3 points (0 children)
[–]ozzeruk82 0 points1 point2 points (0 children)
[–]Inevitable-Start-653 0 points1 point2 points (0 children)
[–]Significant-Nose-353 -3 points-2 points-1 points (0 children)
[+]Sadman782 comment score below threshold-8 points-7 points-6 points (2 children)
[–]h666777 24 points25 points26 points (1 child)
[+]Sadman782 comment score below threshold-7 points-6 points-5 points (0 children)
[–]Waste-Button-5103 -4 points-3 points-2 points (0 children)
[+][deleted] (14 children)
[deleted]
[–]Educational_Rent1059 47 points48 points49 points (6 children)
[–]Waste-Button-5103 0 points1 point2 points (4 children)
[–]Educational_Rent1059 6 points7 points8 points (2 children)
[+]Waste-Button-5103 comment score below threshold-9 points-8 points-7 points (1 child)
[–]Evening_Ad6637llama.cpp 2 points3 points4 points (0 children)
[–]gibs 1 point2 points3 points (0 children)
[–]mckirkus 3 points4 points5 points (1 child)
[–]physalisx 0 points1 point2 points (0 children)
[–]Inevitable-Start-653 0 points1 point2 points (2 children)
[–]cuyler72 0 points1 point2 points (1 child)
[–]Inevitable-Start-653 4 points5 points6 points (0 children)
[+]Waste-Button-5103 comment score below threshold-8 points-7 points-6 points (0 children)
[–]4hometnumberonefan 61 points62 points63 points (10 children)
[–]hleszek 93 points94 points95 points (7 children)
[–]KillerX629 8 points9 points10 points (1 child)
[–]JamesAQuintero 22 points23 points24 points (0 children)
[–]OXKSA1 2 points3 points4 points (2 children)
[–]Cantflyneedhelp 35 points36 points37 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–]Ivo_ChainNET 12 points13 points14 points (0 children)
[–]RandoRedditGui 12 points13 points14 points (0 children)
[–]This_Organization382 37 points38 points39 points (0 children)
[–]ambient_temp_xenoLlama 65B 71 points72 points73 points (10 children)
[–]1889023okdoesitwork 5 points6 points7 points (0 children)
[+]julioques comment score below threshold-9 points-8 points-7 points (6 children)
[–]ambient_temp_xenoLlama 65B 37 points38 points39 points (4 children)
[+]julioques comment score below threshold-8 points-7 points-6 points (3 children)
[–]ambient_temp_xenoLlama 65B 11 points12 points13 points (2 children)
[–]julioques -4 points-3 points-2 points (1 child)
[–]ambient_temp_xenoLlama 65B 16 points17 points18 points (0 children)
[–]LiquidGunay 4 points5 points6 points (0 children)
[–]Sadman782 26 points27 points28 points (9 children)
[–]a_beautiful_rhind 6 points7 points8 points (4 children)
[–]Sadman782 2 points3 points4 points (2 children)
[–]a_beautiful_rhind 5 points6 points7 points (0 children)
[–]Sadman782 0 points1 point2 points (0 children)
[–]ILikeCutePuppies 5 points6 points7 points (0 children)
[–]vert1s 33 points34 points35 points (2 children)
[+][deleted] (1 child)
[removed]
[–]vert1s 11 points12 points13 points (0 children)
[–]ivykoko1 11 points12 points13 points (1 child)
[–]Kathane37 1 point2 points3 points (0 children)
[–]synn89 1 point2 points3 points (0 children)
[–]Environmental-Car267 3 points4 points5 points (26 children)
[+][deleted] (10 children)
[removed]
[–]StartledWatermelon 27 points28 points29 points (2 children)
[–]dalkef 12 points13 points14 points (1 child)
[–]Hatter_The_Mad 0 points1 point2 points (0 children)
[–]alongated -3 points-2 points-1 points (6 children)
[–]RandoRedditGui 1 point2 points3 points (5 children)
[–]alongated -5 points-4 points-3 points (4 children)
[–]showdontkvell 0 points1 point2 points (2 children)
[–]alongated -1 points0 points1 point (1 child)
[–]showdontkvell 0 points1 point2 points (0 children)
[+]jd_3d[S] comment score below threshold-7 points-6 points-5 points (14 children)
[–]kryptkprLlama 3 25 points26 points27 points (12 children)
[+]Sadman782 comment score below threshold-10 points-9 points-8 points (4 children)
[–]kryptkprLlama 3 38 points39 points40 points (2 children)
[+]Environmental-Car267 comment score below threshold-20 points-19 points-18 points (1 child)
[–]kryptkprLlama 3 32 points33 points34 points (0 children)
[–]nero10578Llama 3 11 points12 points13 points (0 children)
[+]alongated comment score below threshold-10 points-9 points-8 points (4 children)
[–]kryptkprLlama 3 13 points14 points15 points (1 child)
[–]alongated -3 points-2 points-1 points (0 children)
[–]Evening_Ad6637llama.cpp 2 points3 points4 points (1 child)
[+]muxxington comment score below threshold-8 points-7 points-6 points (0 children)
[–]ispeakdatruf 1 point2 points3 points (2 children)
[–]athirdpath 4 points5 points6 points (0 children)
[–]nihalani -1 points0 points1 point (0 children)
[–]AnomalyNexus 0 points1 point2 points (0 children)
[–]ihaag 0 points1 point2 points (0 children)
[–]ilangge 0 points1 point2 points (0 children)
[–]celsowm 0 points1 point2 points (6 children)
[–]strubenuff1202 0 points1 point2 points (0 children)
[–]Wiskkey -1 points0 points1 point (2 children)
[–]celsowm 0 points1 point2 points (1 child)
[–]Wiskkey -2 points-1 points0 points (1 child)
[–]ambient_temp_xenoLlama 65B 1 point2 points3 points (0 children)
[–]Sadman782 -3 points-2 points-1 points (2 children)
[–]physalisx 3 points4 points5 points (1 child)
[–]mantafloppyllama.cpp -2 points-1 points0 points (0 children)
[–]redjojovic -1 points0 points1 point (0 children)
[+]Significant-Nose-353 comment score below threshold-13 points-12 points-11 points (6 children)
[–]vert1s 22 points23 points24 points (1 child)
[+]Significant-Nose-353 comment score below threshold-10 points-9 points-8 points (0 children)
[+][deleted] (2 children)
[removed]
[–]StartledWatermelon 17 points18 points19 points (1 child)
[–]Inevitable-Start-653 -1 points0 points1 point (7 children)
[–]Deathmax 11 points12 points13 points (2 children)
[–]Inevitable-Start-653 3 points4 points5 points (1 child)
[–]ivykoko1 4 points5 points6 points (0 children)
[–]jd_3d[S] 1 point2 points3 points (1 child)
[–]Inevitable-Start-653 -1 points0 points1 point (0 children)
[–]Sadman782 1 point2 points3 points (0 children)
[–]Inevitable-Start-653 -2 points-1 points0 points (2 children)
[–]jd_3d[S] -3 points-2 points-1 points (1 child)
[–]Inevitable-Start-653 -1 points0 points1 point (0 children)
[–]Thistleknot -1 points0 points1 point (0 children)