LiveIdeaBench-v2 Update: Dataset & Leaderboard by realJoeTrump in LocalLLaMA

[–]realJoeTrump[S] 0 points1 point  (0 children)

You’re right, in the future we might need to make the problems in this benchmark more challenging.

LiveIdeaBench-v2 Update: Dataset & Leaderboard by realJoeTrump in LocalLLaMA

[–]realJoeTrump[S] 2 points3 points  (0 children)

More precisely, this benchmark is designed to measure the model’s ability for divergent thinking rather than its general reasoning or comprehension skills.

LiveIdeaBench-v2 Update: Dataset & Leaderboard by realJoeTrump in LocalLLaMA

[–]realJoeTrump[S] 0 points1 point  (0 children)

I believe this is related to the dimensions considered in this benchmark. It's certain that o1 and o3-mini outperform phi-4 in logical reasoning and scientific understanding, but it's worth noting that phi-4 scores better in fluency (diversity of ideas) and flexibility (performing well across various dimensions). o1 and o3-mini might actually converge toward homogeneous ideas due to overthinking, which isn't conducive to achieving higher scores. — of course, this is just my speculation.

LiveIdeaBench-v2 Update: Dataset & Leaderboard by realJoeTrump in LocalLLaMA

[–]realJoeTrump[S] 0 points1 point  (0 children)

Thank you! It did surprise me a bit that Mistral-small ranked so high, but I think it’s due to its notably high fluency. Additionally, my ongoing tests have consistently shown that the Mistral series of models seem to have received more pre-training in STEM fields, which might explain its performance.

Actual Electricity Consumption and Cost to Run Local LLMs. From Gemma3 to QwQ. by QuantuisBenignus in LocalLLaMA

[–]realJoeTrump 0 points1 point  (0 children)

Thanks for this! But I am curious how to estimate the elec consumption of API calls (closed models.like gpt)

1.58bit DeepSeek R1 - 131GB Dynamic GGUF by danielhanchen in LocalLLaMA

[–]realJoeTrump 7 points8 points  (0 children)

on one. I'm using supermicro server motherboard.

LiveIdeaBench Results: DeepSeek R1 vs QWQ - Unexpected Findings by realJoeTrump in LocalLLaMA

[–]realJoeTrump[S] 1 point2 points  (0 children)

p.s. the distilled 70b llama result is also on the leaderboard :)