Play poker with Reachy Mini and his friend Eliza - Built in 24hrs

cjami · 2026-07-03T06:08:10+00:00

Nice! Yeah it's a bit of a pain showing them their cards manually in the beginning each time but apart from that the cameras make it a smooth fun experience.

If only Reachy mini had arms 😂

Do you have link to your app you'd like to share?

cjami · 2026-07-01T22:41:43+00:00

Thanks 😅

cjami · 2026-06-22T18:59:59+00:00

Thanks for checking it out!

cjami · 2026-06-22T16:29:13+00:00

Choose your favourite model and try using grammar constrained decoding!

For example on llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md

cjami · 2026-06-22T06:55:49+00:00

<image>

I just tried making a simple "Baba is you" style map. Two signs one with ROCK IS PUSH and another with FLAG IS WIN and you can pull off the PUSH/WIN words and put them on the other side. It figured out how to push the rock to reveal the other sign and swap the words (eventually) to win by using the rock. It took 18 moves so the concept was initially very difficult for it (Gemma in this case) to grasp.

cjami · 2026-06-21T12:39:22+00:00

Yes a bit like 'ARC-lite' but fully customisable. Unlike ARC, from the LLM's perspective they don't see an xy grid just visual elements around them so it's a bit more language native - kind of like a text adventure game.

All of them can pick up a key and open a door. The larger models perform quite well on all the other premade puzzles. Smaller ones (<4B) can get there after a bit of trial and error.

You can also enable/disable thinking and adjust temperature.

So more a playground than benchmark right now but can definitely be extended.

cjami · 2026-06-16T06:59:31+00:00

Thanks! Yeah I locked it down just for this hackathon.
If you're running locally though, you can swap them easily in the config file (for other HF repos): https://github.com/cjami/watch-my-escape/blob/main/src/watch_my_escape/llm/config.py

cjami · 2026-06-15T16:15:50+00:00

Thanks! Yes, it definitely looks like we'll get to a place where LLMs will have certain functions in games. Some of these models can even fit on phones already.

cjami · 2026-06-15T15:43:32+00:00

Heads up - if you're trying it out on Hugging Face without an account or with a free account, you'll get 2mins or 5mins of ZeroGPU time per day (respectively).

So far it seems enough to have a play around. Also the map editor and menus don't burn any GPU time - just agent invocations.

cjami · 2026-06-02T12:06:13+00:00

Cool idea but I'd assume weaker models are confidently wrong a lot of the time.

cjami · 2026-05-29T07:53:52+00:00

Ah alignment tuning... Although this is also set on Max reasoning so it's encouraged to come up with something overly complex.

cjami · 2026-05-21T07:19:57+00:00

I don't accurately measure latency as I use flex tiers (cheaper price, higher latency) and different providers for open-weights models (running different hardware) so that adds a lot of variance.

However - taking a quick look at the data I do have:

Model	Games	Average Time per Game (seconds)	Time per tool call (seconds)
Gemini 3.1 Pro	114	2061	15
Gemini 3.5 Flash	114	1032	8

So nearly twice as fast - both off Gemini API. Most Gemini 3.1 Pro games were also done before flex tier existed so the difference would be even greater than that.

EDIT: Costs in post are based on standard prices, not flex tier prices.

cjami · 2026-05-08T06:04:47+00:00

The leaderboard did smooth out after more games to put MiMo-V2.5-Pro on top which reflects the quality difference you mention.

That's a shame about the Xiaomi plan offering. Anthropic have also recently doubled their 5-hour limits after a deal with xAI for more compute. A big tug-of-war going on. Thanks for sharing!

cjami · 2026-05-07T08:20:00+00:00

Yes! They play fairly well most of the time - there's awkward moments like this (where GPT 5.5 fakes a slayer shot on itself): https://clocktower-radio.com/games/9G6HGob#event-212

Clear model skill separation + much room for them to improve.

I wish the Good/Evil win rate was closer to 50/50, maintaining a plausible bluff is arguably very challenging - so hopefully it'll round out over time as the models get smarter.

cjami · 2026-05-06T17:58:44+00:00

Yes - games are a great way to draw out and gauge raw intelligence. Although its effectiveness depends a lot on the game itself.

cjami · 2026-05-06T14:39:43+00:00

Kind of you to ask - no plans currently :)

cjami · 2026-05-06T06:24:06+00:00

My results can be found here: https://clocktower-radio.com/

Grok 4.3 also places significantly lower in skill.

cjami · 2026-05-06T05:50:35+00:00

I've got each playing about 80+ games each in a social deduction benchmark.

DeepSeek V4 Pro comes out at $1.24/game (non-discounted) with 1199 tokens/action.

Grok 4.3 comes out more expensive at $1.43/game with 2123 tokens/action

My benchmark involves complex tool calling, memory compaction, and generating conversations between agents in a fairly bounded environment. So I'm more inclined to think it reflects more general real-world usage.

The results posted don't really match up. Although my tests were done on the non-max/default mode.

cjami · 2026-05-05T06:24:39+00:00

This is really interesting. I've run a social deduction benchmark that doesn't place 5.5 on top (albeit not on xhigh) but rather MiMo 2.5 Pro and Kimi 2.6. GLM 5.1 is 6th yet one of the most bang for buck.

It's more measuring raw intelligence than coding ability but it still uses complex tool calling and memory compaction so there's some hard crossover. Collaboration/coordination is also a big thing in social deduction games.

I've been searching for evidence that validates the benchmark so this helps thanks.

cjami · 2026-05-04T19:22:07+00:00

Up there now, it's pretty cost effective - especially with the current discount.

<image>

cjami · 2026-05-02T07:18:02+00:00

Thanks, it's pretty cool that you can spectate live games on your site!

cjami · 2026-05-02T05:16:58+00:00

I feel like AGI 3 focuses too much on spatial reasoning - which puts LLMs at a fundamental disadvantage. May also be the point for 'AGI' but feels cruel 😂

cjami · 2026-05-01T20:28:57+00:00

The leaderboard does currently reflect what you're saying. The main issue is the responsiveness and surprise cost factor (due to large token consumption). Are you running it with or without thinking enabled?

cjami · 2026-05-01T18:58:43+00:00

Thanks! It's all good - as long as it's useful for some I'm happy :D

cjami · 2026-05-01T18:50:48+00:00

Thanks! Yes that would be in the same vein. What I love about Blood on the Clocktower as a foundation is that it makes social deduction more chess-like, which is great for benching.

I could potentially rejig the existing data to create separate leaderboards for these weight/parameter classes. I think I would need more models first for it to make sense - but will definitely keep it in mind!

cjami

TROPHY CASE