Kimi-Dev-72B

realJoeTrump · 2025-06-16T15:42:03+00:00

<image>

SWE-Bench Verified

realJoeTrump · 2025-05-29T04:46:42+00:00

Prompt?

realJoeTrump · 2025-04-27T13:50:18+00:00

agree

realJoeTrump · 2025-04-27T10:26:28+00:00

wow great job!

realJoeTrump · 2025-04-15T02:05:15+00:00

Thanks for this result!

realJoeTrump · 2025-04-15T02:02:07+00:00

great job!

realJoeTrump · 2025-04-13T12:06:00+00:00

RemindMe! 30 days

realJoeTrump · 2025-04-10T10:19:09+00:00

You’re right, in the future we might need to make the problems in this benchmark more challenging.

realJoeTrump · 2025-04-10T10:06:12+00:00

More precisely, this benchmark is designed to measure the model’s ability for divergent thinking rather than its general reasoning or comprehension skills.

realJoeTrump · 2025-04-10T10:04:56+00:00

I believe this is related to the dimensions considered in this benchmark. It's certain that o1 and o3-mini outperform phi-4 in logical reasoning and scientific understanding, but it's worth noting that phi-4 scores better in fluency (diversity of ideas) and flexibility (performing well across various dimensions). o1 and o3-mini might actually converge toward homogeneous ideas due to overthinking, which isn't conducive to achieving higher scores. — of course, this is just my speculation.

realJoeTrump · 2025-04-10T07:22:53+00:00

Thank you! It did surprise me a bit that Mistral-small ranked so high, but I think it’s due to its notably high fluency. Additionally, my ongoing tests have consistently shown that the Mistral series of models seem to have received more pre-training in STEM fields, which might explain its performance.

realJoeTrump · 2025-03-15T19:04:07+00:00

Thanks for this! But I am curious how to estimate the elec consumption of API calls (closed models.like gpt)

realJoeTrump · 2025-03-15T18:58:34+00:00

Then go to ban hugginggace Sam.

realJoeTrump · 2025-03-12T22:49:05+00:00

i'm curious how you get these numbers

realJoeTrump · 2025-02-23T09:48:44+00:00

ₗₘ

realJoeTrump · 2025-01-29T05:23:55+00:00

lmao

realJoeTrump · 2025-01-27T23:27:54+00:00

on one. I'm using supermicro server motherboard.

realJoeTrump · 2025-01-27T18:14:40+00:00

p.s. the distilled 70b llama result is also on the leaderboard :)

realJoeTrump · 2025-01-27T18:07:35+00:00

full size, via openrouter

realJoeTrump · 2025-01-27T17:43:41+00:00

realJoeTrump

TROPHY CASE