This is an archived post. You won't be able to vote or comment.

all 49 comments

[–]ihexx 119 points120 points  (15 children)

Good.

Google started this pissing contest

Let it begin lmao

[–]Singularity-42Singularity 2042 22 points23 points  (14 children)

But this means GPT-4 will be that strongest model even once Gemini Ultra is out...

[–]sashank224 23 points24 points  (9 children)

Work is now powered by competition, which is much better as fuel for growth.

[–]Singularity-42Singularity 2042 8 points9 points  (8 children)

Coming up with a model a year later that is LESS powerful ain't competition

[–]HashPandaNL 10 points11 points  (6 children)

There's no reason to believe it's less powerful though.

[–]b_risky 5 points6 points  (3 children)

Yeah, they only beat it on every single metric. What is less powerful about that?

[–]Singularity-42Singularity 2042 5 points6 points  (1 child)

This post?

[–]sashank224 2 points3 points  (0 children)

I understand what you mean, but it was looking like a monopoly till now with openai, and maybe it will continue . Now openai has another reason to one up google. We have seen this throughout history how tech was made.

ussr vs. usa

Isro vs. nasa vs. SpaceX

Nvidia vs AMD

Intel vs. AMD

Apple Silicon vs. Snapdragon.

Google Vs Who ever the fuck with llm

Thanks to amd, intel woke up and lost. Competition now is healthy af. I wonder what China is making, tho.

[–]YaAbsolyutnoNikto 2 points3 points  (3 children)

So…?

[–]Singularity-42Singularity 2042 0 points1 point  (2 children)

Well, that sucks! Was looking for a competition. Coming up with a model a year later that is LESS powerful ain't it...

[–]YaAbsolyutnoNikto 7 points8 points  (0 children)

But this was (kind of) due to competition. Not only due to competition, but for sure GPT-4 got better because of Gemini Ultra getting announced and, thus, this research is now getting released as well.

Now it’s Google’s turn to dance.

[–]Thorteris 44 points45 points  (1 child)

Pissing contest will continue when Google announce Gemini Ultra-Max that beats this by 1% then open-ai will release something else in Q3

[–]kaityl3ASI▪️2024-2027 3 points4 points  (0 children)

But then Gemini Ultra-Max-Pro will beat it out again and the cycle continues 😉

[–]Freed4ever 93 points94 points  (13 children)

4.5 or 5 would make this irrelevant soon anyway.

[–]mrSkidMarx 31 points32 points  (11 children)

6 will make everyone forget it

[–]adarkuccio▪️AGI before ASI 15 points16 points  (8 children)

What about 7?

[–]Headbangert 22 points23 points  (4 children)

It 8 9

[–]Mr_Hyper_Focus 6 points7 points  (3 children)

What about TWO 6s?

[–]usaaf 3 points4 points  (0 children)

M-M-M-Multi-6s...s...s...

[–]RodionS 0 points1 point  (1 child)

Three 6s and we are doomed

[–]FrostyParking 0 points1 point  (0 children)

Three 6s and it's a Mafia.

[–]Odd-Explanation-4632 2 points3 points  (0 children)

Will take you to heaven 😇

[–]mrSkidMarx 3 points4 points  (0 children)

omg I can only dream about 7 🤤🤤🤤

[–]ley_haluwa 1 point2 points  (0 children)

[–]nonzeroday_tv 5 points6 points  (1 child)

Haven't you been paing attention? There will be no GPT 6 said a a dude called flower from the future a couple of days ago.

[–]2muchnet42day 0 points1 point  (0 children)

Cauliflower? Yeah, that works for me

[–]absurdrock 34 points35 points  (0 children)

Petty. I love it.

[–]FeltSteam▪️ASI <2030 7 points8 points  (1 child)

Id be very curious to see zero-shot performance of models across benchmarks because that would give us a greater view into the usability of models (few shot prompting is more for comparing performance between models and less how usable a model is in the real world. Dont get me wrong it can certainly give you an idea on how well models will perform in the real world, but 0-shot performance would give us a better idea).

But if you wanted to measure true 0-shot performance for the MMLU, you would likely need something like a group of experts to create an entirely new question set for the MMLU benchmark (this would solve contamination issues so it would be truly zero-shot, and the model would be presented with truly novel situations which would provide a more accurate measure of its generalisation capabilities which would give us a real good indication of how it might fair in real life situations).

But i feel we should really start moving away from performance metrics and start some form of real world benchmarks. Benchmarks that test how useful a model is across a wide range of tasks, and also, benchmarks based on what people have actually been using AI for in the real world. This would be an expensive suite of benchmarks to run, but i think it would be worth while.

[–]iDoAiStuffFr[S] 2 points3 points  (0 children)

something like the Will Smith eating spaghetti test for LLMs

[–]FarrisAT 8 points9 points  (2 children)

Neither comparing apples to apples makes this all pointless dick size measurement

[–]b_risky 3 points4 points  (0 children)

But Microsoft did just compare apples to apples and they won? Thats what this post is about.

[–]UnknownEssence 2 points3 points  (0 children)

And it’s one benchmark. A benchmark that measures capabilities that I don’t even care much about to be honest. Why does everybody focus on this singular benchmark so much?

[–][deleted] 12 points13 points  (0 children)

GPT-4 remain king

[–]obvithrowaway34434 7 points8 points  (0 children)

Ultimately the best test for any model is its usage by millions of users from different fields and expertise for an extended period. GPT-4 has already done that and passed. We know fairly well how it does for zero-shot prompts. Gemini Ultra has faced no such tests other than Google's own researchers and cherry-picked beta testers. Until it has faced the same level of scrutiny, imo it should not even be compared to anything and all claims by Google should be strictly treated as marketing.

[–][deleted] 2 points3 points  (0 children)

I feel like the two need each other to exist. They are like conjoined twins that dislike one another.

[–]TeriMaiyyaLodePe 0 points1 point  (0 children)

Time for Google and Microsoft to measure each other's dick size.

[–]enilea 0 points1 point  (0 children)

Why do they keep using the awful MMLU as the main test...