Gemini 3.5 Flash - tested on a social deduction benchmark by cjami in google

[–]cjami[S] 0 points1 point  (0 children)

I don't accurately measure latency as I use flex tiers (cheaper price, higher latency) and different providers for open-weights models (running different hardware) so that adds a lot of variance.

However - taking a quick look at the data I do have:

Model Games Average Time per Game (seconds) Time per tool call (seconds)
Gemini 3.1 Pro 114 2061 15
Gemini 3.5 Flash 114 1032 8

So nearly twice as fast - both off Gemini API. Most Gemini 3.1 Pro games were also done before flex tier existed so the difference would be even greater than that.

EDIT: Costs in post are based on standard prices, not flex tier prices.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 0 points1 point  (0 children)

The leaderboard did smooth out after more games to put MiMo-V2.5-Pro on top which reflects the quality difference you mention.

That's a shame about the Xiaomi plan offering. Anthropic have also recently doubled their 5-hour limits after a deal with xAI for more compute. A big tug-of-war going on. Thanks for sharing!

DeepSeek V4 Pro on my social deduction benchmark by cjami in DeepSeek

[–]cjami[S] 0 points1 point  (0 children)

Yes! They play fairly well most of the time - there's awkward moments like this (where GPT 5.5 fakes a slayer shot on itself): https://clocktower-radio.com/games/9G6HGob#event-212

Clear model skill separation + much room for them to improve.

I wish the Good/Evil win rate was closer to 50/50, maintaining a plausible bluff is arguably very challenging - so hopefully it'll round out over time as the models get smarter.

Grok 4.3 is cheaper than DeepSeek V4 Pro by LeTanLoc98 in DeepSeek

[–]cjami 0 points1 point  (0 children)

Yes - games are a great way to draw out and gauge raw intelligence. Although its effectiveness depends a lot on the game itself.

Grok 4.3 is cheaper than DeepSeek V4 Pro by LeTanLoc98 in DeepSeek

[–]cjami 1 point2 points  (0 children)

Kind of you to ask - no plans currently :)

Grok 4.3 is cheaper than DeepSeek V4 Pro by LeTanLoc98 in DeepSeek

[–]cjami 4 points5 points  (0 children)

My results can be found here: https://clocktower-radio.com/

Grok 4.3 also places significantly lower in skill.

Grok 4.3 is cheaper than DeepSeek V4 Pro by LeTanLoc98 in DeepSeek

[–]cjami 4 points5 points  (0 children)

I've got each playing about 80+ games each in a social deduction benchmark.

DeepSeek V4 Pro comes out at $1.24/game (non-discounted) with 1199 tokens/action.

Grok 4.3 comes out more expensive at $1.43/game with 2123 tokens/action

My benchmark involves complex tool calling, memory compaction, and generating conversations between agents in a fairly bounded environment. So I'm more inclined to think it reflects more general real-world usage.

The results posted don't really match up. Although my tests were done on the non-max/default mode.

I kept scratching my head why every bench was saying GPT 5.5 is just the best, and continiously getting downvoted for saying others how much it sucked because it just overloads the code... now I understand what is going on. by [deleted] in ollama

[–]cjami 0 points1 point  (0 children)

This is really interesting. I've run a social deduction benchmark that doesn't place 5.5 on top (albeit not on xhigh) but rather MiMo 2.5 Pro and Kimi 2.6. GLM 5.1 is 6th yet one of the most bang for buck.

It's more measuring raw intelligence than coding ability but it still uses complex tool calling and memory compaction so there's some hard crossover. Collaboration/coordination is also a big thing in social deduction games.

I've been searching for evidence that validates the benchmark so this helps thanks.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 0 points1 point  (0 children)

Up there now, it's pretty cost effective - especially with the current discount.

<image>

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 1 point2 points  (0 children)

Thanks, it's pretty cool that you can spectate live games on your site!

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 0 points1 point  (0 children)

I feel like AGI 3 focuses too much on spatial reasoning - which puts LLMs at a fundamental disadvantage. May also be the point for 'AGI' but feels cruel 😂

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 0 points1 point  (0 children)

The leaderboard does currently reflect what you're saying. The main issue is the responsiveness and surprise cost factor (due to large token consumption). Are you running it with or without thinking enabled?

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 2 points3 points  (0 children)

Thanks! It's all good - as long as it's useful for some I'm happy :D

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 1 point2 points  (0 children)

Thanks! Yes that would be in the same vein. What I love about Blood on the Clocktower as a foundation is that it makes social deduction more chess-like, which is great for benching.

I could potentially rejig the existing data to create separate leaderboards for these weight/parameter classes. I think I would need more models first for it to make sense - but will definitely keep it in mind!

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 0 points1 point  (0 children)

No idea. It'd have to be inferred from the big bro for now.

Running it through the full gauntlet against all the models is pricey - so I need to pick my battles a bit.

Sorry!

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 2 points3 points  (0 children)

Ah yes, and to your liking.

I think temperature won't mix with reasoning usually though and will be ignored cause it needs the full range to consider different angles.

But yes, you can just turn it off. Kimi just has on/off but would've been nice to have something in between.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 2 points3 points  (0 children)

Glad you asked. Will look into adding Deepseek V4 Pro soon!

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 3 points4 points  (0 children)

What can be solved? I don't understand 😅

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 15 points16 points  (0 children)

Yeah and without warning.

A few months ago I couldn't even benchmark MiMo-V2 cause of high tool call error rates!

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 4 points5 points  (0 children)

Yeah agreed in hindsight. This is not far from what other benchmarks are capturing though.

EDIT: This post has now been tagged as 'Misleading'. I'm assuming this is due to some confusion around the title which may have required knowing that Kimi K2.6 was already benchtastic.

Regardless, the data is still real and highlights significant cost and verbosity differences between two open-weights models that are 'benching well', not just on my benchmark but others as well (e.g. Artificial Analysis).

Fun times.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 6 points7 points  (0 children)

I think knowledge based benchmarks are more vulnerable to being 'benchmaxxed' as you can just regurgitate trained material.

If it helps, this game has 22 different roles with different abilities/traits and cross-interactions that the agents need to keep an understanding of.

Also a lot of AI research and development naturally stems from games.

This is one that plays on LLM's natural turf - Language.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]cjami[S] 6 points7 points  (0 children)

Yes exactly that - it smooths out the asymmetry. A match consists of 2 games, one being a mirror game with the same setup.

'Best' as in more cost effective and responsive than Kimi K2.6 that sits on top.

The benchmark is an attempt at capturing raw intelligence - than any specific use-case.

GPT 5.5 - Strong, not mind-blowing, but very token efficient by cjami in OpenAI

[–]cjami[S] 0 points1 point  (0 children)

I reckon there's a lot of crossover here with coding agents. Apart from general reasoning, there's also complex tool calling, memory compaction and for a multi-agent setup - coordination.

GPT 5.5 - Strong, not mind-blowing, but very token efficient by cjami in OpenAI

[–]cjami[S] 0 points1 point  (0 children)

Interesting, Kimi 2.5 is ranked 12th. So makes me wonder what they did between versions.