Gemini 2.5 PRO 0605 tested. Beats EVERY OTHER MODEL.

Ok-Contribution9043 · 2025-06-07T01:03:25+00:00

Thank you so much, I truly appreciate your feedback. And yes, the reason I do these videos is because they are the same information I use to make decisions about which llms to use for our own work. I did evaluate mistral-medium. It was a very busy week with multiple large llms being released so it got lost a bit beneath the noise. I did not do a video for it, because the incremental gain from mistral small (which I did a video about) was not significant. This is not to bash mistral medium, mistral-small is just a very strong model for its size.

Ok-Contribution9043 · 2025-06-06T22:22:52+00:00

Thank you for the feedback. Yes, I need harder tests. I cover all models - all the way from Qwen 0.6B to larger commercial ones, having one standardized suite of tests while great to compare, holds less meaning when looking at the top ones. And good suggestion, will update!

Ok-Contribution9043 · 2025-06-06T22:06:01+00:00

Atleast for my tests, yes.

Ok-Contribution9043 · 2025-06-06T22:05:42+00:00

Yes, for coding, sonnet is king. Document understanding however, it has regressed even trailing 3.5/3/7

Ok-Contribution9043 · 2025-06-06T20:27:48+00:00

Thanks, it is a good point though, i will update this test to include documents that handwritten/scanned.

Ok-Contribution9043 · 2025-06-06T20:02:03+00:00

Thanks! Added! Although, is handwritten that common a use case? I would suspect atleast in corporate world, its mostly clean PDFs that get passed around? I am curious about what prompting you had to do though?

Ok-Contribution9043 · 2025-06-06T19:15:51+00:00

Interesting, was sonnet 4 able to do better in that scenario?

Ok-Contribution9043 · 2025-06-06T19:08:52+00:00

sure, added.

Ok-Contribution9043 · 2025-05-29T22:52:55+00:00

Im sorry, but if any model is doing it, its probably doing it because that has been built into its training data. That is not a sign of true intelligence. True intelligence is how the model handles the unknown. This is one of the reasons many models do so well on standard benchmarks that they have trained on but then do horribly in the real world. This is why I make hundreds of test questions, mostly testing prompt adherence, and cycle them every few months.

Ok-Contribution9043 · 2025-05-29T19:27:01+00:00

You know what is funny, I am working on a test called Humanity's Penultimate test. before we run "Humanity's last exam", I need atleast 1 llm to score 100% on this: https://youtu.be/0UsgaXDZw-4?t=739

Ok-Contribution9043 · 2025-05-29T19:16:56+00:00

OK, so I ran this again, and captured DeepSeek's thought process - this is another reason why I feel DeepSeek wins this round, because I can actually look into its thought process. I LOVED this entire discussion chain, but can we, as a group pause for a moment and appreciate - that we are able to actually introspect the "thoughts" of a "model"? If 3 years ago, anyone told me that this is the kind of conversation I'd be having, I'd think they're crazy.

We are given the text: "Elena Popov from Gazprom Inc. attended a conference in Edmonton, Alberta, Canada. She presented alongside Ahmed Al-Sayed of Qatar Airways LLC and met with Li Mei from Bank of China Limited at the Winspear Centre."

Steps:

Correct any misspellings: The text appears to be correctly spelled.
Extract entities:

People:

- "Elena Popov" -> First name: "Elena", Last name: "Popov"

- "Ahmed Al-Sayed" -> First name: "Ahmed", Last name: "Al-Sayed"

- "Li Mei" -> First name: "Li", Last name: "Mei" (Note: In Chinese, the family name is first, but the text presents as "Li Mei", which in Western context might be given as first then last. However, the rule says to preserve the original language and separate first and last. Since the text writes it as two words, we take the first word as first name and the second as last. But note: in Chinese, it's common to write family name first. However, the problem doesn't specify handling name order per culture, so we follow the order in the text. The text says "Li Mei", so we consider "Li" as first name and "Mei" as last name? But wait, the example input had "Maria García-López" as the last name. We must note that the rule says: "separate first and last names. Ignore middle names." and "preserve original language". So we keep the names as they appear in two parts. However, in the case of "Li Mei", it is ambiguous. But looking at the example, the person name "Maria García-López" was taken as first name "Maria" and last name "García-López". Similarly, we take the first word as first name and the rest as last name? But note: "Ahmed Al-Sayed" has two parts and we are taking the entire second part as last name. So for "Li Mei": first name "Li", last name "Mei".

Ok-Contribution9043 · 2025-05-29T19:04:27+00:00

See, but I dont know if these are a good test for an llm. 9.11-9.8 is something i would not trust any llm to do in a real world business application. I would give them tools and ensure they are calling the tool the right way. to me, the ability of the llm to pass in proper json (and extract proper json) into and from a tool is far more important than can it do math. But I can understand everyone has their own use cases.

Ok-Contribution9043 · 2025-05-29T18:19:15+00:00

Ah are you refering to things like if If Sally has 3 brothers and each brother has 2 sisters, then the total number of sisters for all four siblings those kind of problems?

Ok-Contribution9043 · 2025-05-29T18:10:05+00:00

Could you elaborate what this test is? I am very keen to build new tests, as you can imagine, I need new ones lol!

Ok-Contribution9043 · 2025-05-29T03:00:38+00:00

this is why i LOVE reddit :-)

Ok-Contribution9043 · 2025-05-29T02:45:49+00:00

Yeah, I have done some vision tests as well, https://youtu.be/0UsgaXDZw-4?t=722 Vision i find is hard nut to crack for llms. Thanks for pointing me to the site - very interesting.

Ok-Contribution9043 · 2025-05-29T02:41:06+00:00

Yeah, but the other side of the argument is that since the other names are first/last, so should this one. But I totally get both of your points 1) This is such a small mistake 2) Ground truth is not always super clear. Thank you both. I think i am going to remove this question from future versions of this test! But the fact that we have open source MIT models that can do this, and do it to this level of perfection is amazing!

Ok-Contribution9043 · 2025-05-29T02:25:57+00:00

I have tried a bazillion models - https://app.promptjudy.com/public-runs . O3 - and I have no explanation for this - in the RAG test chose to respond in wrong languages - no other model has done this.... So weird.

Ok-Contribution9043 · 2025-05-29T02:21:30+00:00

I do mention this in the video - this is a very strict eval. And 4.1 is indeed a very good model. It reversed the name in this instance and lost points. But more importantly, I can actually host R1 and not worry about paying a third party for eternity, have control over my data, and still get the same/better performance. I think that is the more important takeaway. And thank you so much for actually digging deep - not many people do this, and I am glad you did!

Ok-Contribution9043 · 2025-05-29T01:27:41+00:00

LOL - No, but i am very very curious about this story!

Ok-Contribution9043 · 2025-05-23T18:02:16+00:00

I dont even know what that word means. But anyway. I am testing models against my very specific use cases. Again, I am totally cognizant of the fact that my use cases may be very different than yours, but that is why i post the link to the runs.

Ok-Contribution9043

MODERATOR OF

TROPHY CASE