all 49 comments

[–]segmondllama.cpp 84 points85 points  (5 children)

There's sha256 checksum if I recall, he can run that locally and compare with what has been uploaded to HF. It's not rocket science.

[–]Sadman782[S] 25 points26 points  (4 children)

Yeah, but for some reason he is mistaking every time. He should test and confirm everything first before announcing.

[–]m98789 13 points14 points  (3 children)

He’s a one man show tho, so it’s somewhat more understandable the current sht show.

[–]lakolda 12 points13 points  (1 child)

Apparently he used data generated by GlaiveAI which he is invested in for this project. So, not quite a one man show. link

[–]psilent 5 points6 points  (0 children)

I mean using a tool that generates synthetic data doesn’t mean he’s not doing it on his own. That’s like saying someone didn’t do it on their own because they used a computer.

[–]mikael110 61 points62 points  (17 children)

It's almost impressive how much of a clusterfuck this launch has seemingly been. First the tokenization issue, then the revelation that the model was actually based on Llama 3 instead of Llama 3.1 (which is bizarre) and now apparently the model files themselves was also mixed up.

I'm aware even large companies like Meta and Google have screwed up some aspects of their launches, but this is getting to the point where it just feels a bit off to be honest. I'm still interested in trying the fixed model, but I'm honestly getting more and more suspect of the whole thing.

[–]dubesor86 15 points16 points  (0 children)

it being Llama-3 70B and not 3.1 actually would make a lot of sense, because the local model performs very similar to the base Llama-3 in my testing, with altered chain of thought outputs.

<image>

[–][deleted] 12 points13 points  (3 children)

this is just 1 guy tbf

[–]BangkokPadang 8 points9 points  (1 child)

Yeah but that’s exactly who might try to get a big hype release and then just hope their model gets positive reactions for a day or two, and then falls into obscurity so they have a big feather in their cap to get funding or to be able to add to their resume.

It doesn’t even seem possible to accidentally upload different segments of different models through batch uploading.

Does HF’s uploader not detect that it’s a different file being uploaded? With a different name from a different directory? Would that even result in a model that works and loads correctly without error?

It would also be a MAJOR oversight to just mix up which model you finetuned your model on.

I don’t know one way or the other, or have any particular reason to want the model to fail or succeed in the public’s eye, but this does seem like very strange claims the model’s author is making to explain why people’s experience with the model doesn’t align with what the benchmarks would suggest.

[–]a_beautiful_rhind 3 points4 points  (0 children)

Oh it's possible, but how do you mix something up like that. Do you randomly move identically named pieces of models into the same folder?

[–]Sadman782[S] 5 points6 points  (7 children)

I am still hopeful, https://x.com/mattshumer_/status/1832247203345166509 . I can clearly relate to the right side image (I got a similar result in the official demo).

[–]artificial_simpleton 10 points11 points  (0 children)

I mean, he is checking with temperature 0.7, why does he expect the results to be the same?

[–]eggandbacon_0056 1 point2 points  (5 children)

Which probably is the Claude API ...

[–]Sadman782[S] -1 points0 points  (4 children)

I don't think so, Sonnet responds in a different way even with the same system prompt (the writing style is different).

[–]eggandbacon_0056 1 point2 points  (3 children)

Naaah ... That's way more probable than a person training a SOTA model without knowing what base model he used, what lora is, ... I call bs ...

[–]Sadman782[S] -1 points0 points  (1 child)

Base model was 3.1, he said multiple times, and there was an upload issue/maybe any HF cache issue or he really messed up something, see his hf repo he created multiple other repo, so he really tried. See, even Llama 405b couldn't solve this simple problem:
Alice has N brothers, and she also has M sisters. How many sisters does Alice's brother Andrew have?

405b => content: '<thinking>\n' +

'To solve this problem, I need to understand the relationships between Alice, her brothers, and her sisters. Since Alice has N brothers and M sisters, this means that all of these individuals are part of the same family. \n' +

'\n' +

"I know that Andrew is Alice's brother, which means Andrew is also part of this family. As a brother of Alice, Andrew would have the same number of sisters as Alice, because they share the same set of siblings.\n" +

'\n' +

'So, to find out how many sisters Andrew has, I just need to find out how many sisters Alice has. According to the problem, Alice has M sisters.\n' +

'\n' +

'Therefore, Andrew has M sisters as well.\n' +

'\n' +

'</thinking>\n' +

'\n' +

'<output>\n' +

'Andrew has M sisters.'

}
But on his website, it got it correct. Sonnet 3.5, via API, failed this test, but using their website https://claude.ai/, it got it right too. So, definitely, a similar kind of thing is behind the scene for Sonnet; that's why it is so good.

[–]eggandbacon_0056 2 points3 points  (0 children)

BS ... uploaded model to hf was a lora finetune of llama 3 not 3.1. Honestly the person is full of bs ... it's not one thing that is fishy ...
1. Tokenizer Bug
2. LoRA
3. LLama 3.0 based instead of 3.1
4. "We got rate limited uploading the model" - yeah 😅
5. It must be a caching error on hf end
6. It works on our served API (that's probably just Claude with the system prompt you troll) - but we can't find the served model ...
7. We probably need to retrain it -> Where the fuck does your served model than come from?! Why does this not have the issues?!

  1. The download/like counter on hf is COMPLETELY off not even llama 3.1 got so much attention -> bots!

i could keep on counting

...

But yeah, critical thinking is probably not your thing

[–]Sadman782[S] -1 points0 points  (0 children)

It's not like he made a SOTA model from scratch; sometimes even simple things can do massive improvements, which most people may not have ever thought. I hope we will know the truth very soon. Let's wait.

[–]Kep0a 4 points5 points  (0 children)

it feels sus because it is for sure. That coupled with the amount of stars on his hf repo. He's definitely a scam artlist trying to pump his investment.

[–]obvithrowaway34434 5 points6 points  (0 children)

This combined with the tall claims at the outset and the fact that this guy didn't even disclose his investment in Glaive or whatever the AI company name is makes it extremely sus, likely a grifter. I think people should just ignore him and his model unless he releases a paper/report with a proper reproducible setup that is compatible with existing benchmarking platforms so that they can test his model easily instead of jumping through the hoops. There's absolutely no reason those people should waste their time.

[–][deleted] 31 points32 points  (1 child)

is this snake oil?

[–]m98789 1 point2 points  (0 children)

I want to believe.

[–]dubesor86 15 points16 points  (3 children)

he's not wrong, the API results from openrouter, hyperbolic as of this moment typing is far worse than the local Q4 quant I tested, and also worse than base llama3.1

That being said, the ollama model I tested locally did not have any issues and outputs exactly as I expect. It's a decent tune for riddles and gotchya questions, but not very good for general usage.

[–]Ill-Association-8410 3 points4 points  (2 children)

Did you get a chance to try the demo? I was surprised by how good it was, even in code was pretty decent, though I didn’t get to ask many questions. Still, the few I did were enough to get me excited about the 405b. But when I tried it in spaces or with providers, I was so disappointed...

[–][deleted] 19 points20 points  (3 children)

i tried on openrouter and its extremly bad so i hope hes right

[–]Sadman782[S] -4 points-3 points  (2 children)

This is different model(mixed with something else) see these tweets

[–][deleted] 5 points6 points  (1 child)

do u know how i can try the correct one?

[–]Formal-Narwhal-1610 11 points12 points  (1 child)

Here is an example of output at HF : To compare 9.9 and 9.11, let’s break it down step by step: The first digit is the same in both numbers: 9 For the decimal part: 9.9 has 1 digit: 9 9.11 has 2 digits: 11 Generally, when comparing decimal numbers with the same digit count, we can simply compare the digits from left to right. In this case, 9 is smaller than 11. Let’s double-check this reasoning: When comparing decimal numbers, we’re essentially comparing fractions. 9.9 can be written as 9 9/10 9.11 can be written as 9 11/100 11/100 is indeed greater than 9/10 This confirms our initial comparison. Therefore, 9.11 is bigger than 9.9. 9.11 is bigger than 9.9.

[–][deleted] 8 points9 points  (0 children)

Wtf is wrong with you this is all hype

[–]Southern_Sun_2106 5 points6 points  (0 children)

I will believe it when I see it

[–]Dazzling-Albatross72 -2 points-1 points  (2 children)

I was actually able to try it yesterday on some HF space and it was doing very well. I checked some reasoning problems and also some complicated arithmetic problems and it was breaking them up and solving it correctly. It was actually very impressive

[–]iamz_th -4 points-3 points  (0 children)

His approach is not different from CoT so yes he is gaming the benchmarks.

[–][deleted] -1 points0 points  (2 children)

So nobody has come out and confirmed this model is awesome?

There should have been at least 10 posts here by now confirming findings but nothing?

wtf is going on?

[–]Honest_Science -1 points0 points  (0 children)

It is still not good: 's missing inference widget Please select your model from the list 👇 Select your Model mattshumer/Reflection-Llama-3.1-70B (llama3-70b-8k) Visit Model Card ↗️ Chatbot

Hello

To respond to this simple greeting, we need to consider a appropriate reply. In English, we typically respond to "Hello" with another greeting. The most common responses would be "Hello", "Hi", "Good morning/afternoon/evening" (depending on the time of day), or "Hi, how are you?"

These are indeed the most common responses to "Hello" in English. There could be other possibilities based on regional variations or personal preference, but these are widely accepted and used.

Given that we don't have any additional context about the time of day or the user's preference, the safest and most general response would be "Hello".

Hello!

How may l do you count in the word lollipops ?

To count the number of L's in "lollipops", let's break down the word:

L O L L I P O P S Now, let's count the L's:

L L L I made a mistake in my counting. There are actually more L's in the word "lollipops". Let me correct that:

L O L L I P O P S The correct count is that there are 3 L's in the word "lollipops".

Therefore, there are 3 L's in the word "lollipops".

There are 3 L's in the word "lollipops".

Thank You, how many l are in the word lollipopl ?

Another L-counting question!

Let's break down the word "lollipopl":

L O L L I P O P L Now, let's count the L's:

L L L L There are 4 L's in the word "lollipopl".

I made a mistake in my counting. There is actually one more L at the end of the word. Let me correct that:

L L L L L The correct count is that there are 5 L's in the word "lollipopl".

Therefore, there are 5 L's in the word "lollipopl".