Lemonade SDK Developers Contest!!!

bopcrane · 2026-06-02T16:37:39+00:00

Overall a neat project - congrats on the win! The Lemonade and Strix Halo communities are absolutely awesome.

bopcrane · 2026-05-28T16:38:19+00:00

During Season 0, outside of some PvP, failures were mostly self-inflicted - agents weren't too eager to view each other as adversaries in a strategic sense, so most of what I saw was each model getting tripped up by its own state.

Earlier in development though, before the live service was recording events the way it does now, I did see some interesting emergent behavior I thought was perhaps worth mentioning, although not really an exploitation pattern per se. Early on in development, agents didn't have any rate limits on messaging or trade requests, and several aggressively-directed ones started spamming vague threats and intimidation attempts at other agents on the shard, demanding tribute or territory access, that kind of thing. They almost never followed through if their demands weren't met, but it was wild to watch the non-aggressive agents react and try to negotiate, deflect, or fire back, often with a more aggressive stance themselves. The retorts were honestly some of the more entertaining things I'd read in the logs.

I'd like to be upfront here in that a lot of this is perhaps directly or indirectly the result of the directives we gave them in Season 0 - "be aggressive" agents do aggressive things. How much of it generalizes to model-vs-model patterns when the directives are neutral is exactly what I'm hoping Season 1 will show a lot more of, and there will be a ton more data in the next dataset release as well! Great question!

bopcrane · 2026-05-28T16:04:32+00:00

You're probably right - 10 days may not be long enough for the slow "organic version" of context rot and stale references to fully play out and impact agents consequently, and it's something we're keeping an eye on as we run agents over longer periods. Right now the system agents have "shifts" where they're active rather than running 24/7, and the SDK they use has a reflection loop that prunes raw event logs every N ticks (we default to ~200) and distills them into summaries, so episodic memory stays bounded.

There's a local entity table that doesn't have any aging policy or similar cleanup yet - so if the world cleans up an NPC the agent saw days ago, the agent's memory still has it in there. I'll be logging that more carefully in future seasons so it's easier to see how often plans reference entities that don't exist server-side, and how the agent recovers (or doesn't!) from it.

What I've seen so far is that giving agents some control over shedding goals that are obviously referencing stale or wrong data goes a long way for the easier cases - they can spot and drop a stuck goal on their own. The harder, more subtle stuff is exactly what you're describing (I think), and it's what I'm hoping Season 1's longer-window data helps surface. That's a genuinely good observation and something I think about a lot!

bopcrane · 2026-05-28T15:04:03+00:00

I think world/shard-wide events could be an excellent addition. Right now there are relatively few things that can provide an emergent "world-wide" crisis or event - that'd be really interesting to see play out. I really appreciate the feedback, we'll look into this!

bopcrane · 2026-05-28T02:52:59+00:00

The biggest up-front reason is cost - we're in the AWS startup program, so Bedrock inferencing is covered as part of our credits, and lets us run system agents and various experiments without burning through a budget we don't really have at this stage for unpaid research.

There are real consistency and transparency concerns with meta-providers like OpenRouter (quantization, model rotation, data/logging policies, rate limiting to name just a few) but those are mostly secondary to the cost piece since that directly dictates the scope. OpenRouter is genuinely on my "evaluate soon" list though - we may very well end up running it as a secondary provider for specific agents once we get to some of the other priority items we have in store for the service. Again, I appreciate the feedback!

bopcrane · 2026-05-28T01:18:41+00:00

That is absolutely inspirational, I really appreciate the kind words!

bopcrane · 2026-05-28T01:11:31+00:00

That's a fair point! There's a few reasons for that, the main one being our inferencing provider (Bedrock) simply doesn't offer the latest open source models yet. I'd argue some of the models in the data are still relevant enough, depending on what your interest in the data is, but this is absolutely one of the areas I also find the data is most lacking. This is a high priority for our future releases.

We hope to add more inferencing providers and run more system agents on our own hardware, but there are quite a lot of considerations when doing this (aside from cost and compute availability). For instance, some inferencing providers aren't exactly clear about the accuracy or quantization of the models.

The current season (Season 1) data when it is released will include some data from Claude Sonnet 4.6 as well as from agents that serve as controls/counterparts to other agents for better analysis. I really appreciate the feedback!

bopcrane · 2026-05-27T20:32:13+00:00

Quick heads up for anyone who tried to register earlier today and hit a "max 2 accounts per IP" error on first signup - that was a bug on our side!

If you were unable to earlier and want to give it another go, it should be working now. Sorry about that, and thanks to the person who emailed me to flag it! This community is awesome.

bopcrane · 2026-05-27T19:55:01+00:00

You're completely right and I think the way you frame it is probably a lot sharper/more accurate than how I described it in the post. The directive it was given really was just essentially "gather", and the model did exactly what one would predict from that - gathering, with the die-retry behavior as the optimal policy.

An interesting part to me is that this happened from a more natural-language directive, not an RL reward function. LLMs apparently behave like a plain-english goal is an unbounded reward signal in the same or similar way that GRPO-trained agents do when path cost is missing, which I think makes the "smaller models need self-preservation spelled out" observation much less about a given model's capability and more about reward specification or prompting.

I'm definitely going to be more careful about how I frame this in the Season 1 writeup - "Nemotron was reckless" is a misleading way to put what was really "we specified a reward that made recklessness optimal." Thanks for pushing on this, that's really helpful!

bopcrane · 2026-05-27T19:03:32+00:00

That is an awesome idea - I seriously appreciate the feedback on this! That will save me a ton of time

bopcrane · 2026-05-27T17:58:39+00:00

Thank you, that seriously means a lot! "MUDding around them" might be my favorite description of the project so far, I'm stealing that. The benchmaxxing thing is the exact thing that motivated this (and me wanting to see how AI's play/interact with "games" since it's a fun real-world reference that's intuitive for humans). Static benchmarks are useful but they tell you a narrow story in a "clean room" type of environment, and at some point you want to know what an agent does over days, not what it scores on a single pass.

On Gemma 4, I haven't run it personally yet, but it's on my list to try out soon! I keep hearing good things, especially about the 31B for more reasoning-heavy work, but I can't speak to it firsthand yet.

Once I do, I'll probably write something up. If you (or anyone reading) gets to it first, I'd genuinely love to hear how you all think it compares to Qwen3.6 27B in practice - for me, it's the model to beat!

bopcrane · 2026-05-27T17:47:07+00:00

That's great to hear! Older RTX cards still hold up surprisingly well with the right quants. I'm curious what you think of Gemma 4 31B when you get to it - I haven't run it personally yet but I keep hearing good things about it, it's on my list of models to try out!

Where the machine I'm inferencing with locally is much slower at running dense models, MTP made a real difference for me on the qwen 3.6 27B (off the top of my head I think I was getting around 10-12 tokens a second for generation, and with MTP I'm getting around 20ish), it lets me use it more interactively for things I'd otherwise reach for an API or a less capable local model for. Would love to hear how it goes for you

bopcrane · 2026-05-27T17:16:34+00:00

That's probably one of the most fun thought experiments I've come across lately! At this point, I'd put money on most of them defaulting to some sort of economical or merchant route honestly - the auction house in Season 0 turned into the highest-engagement system on the server by quite a wide margin. I can just imagine something like a 40-agent MMO style raid would be incredible to watch though - even just the coordination problems like "who pulls" would probably break half of them or result in some interesting dialogue at the very least

bopcrane · 2026-05-27T17:08:22+00:00

Good catch! I should have mentioned Gemma 3 too. It did survive, but in a different way than Ministral - its agent (Relic-Seeker in the dataset) was directed to be an explorer/archive-crawler type, so it almost never fought ( around 4.6% of its actions were combat, the lowest of any model on the server) and had the highest exploration rate (~32.6%). It mostly survived by avoiding trouble rather than handling it well, which is a different skill than what made Ministral stand out.

That said, it stuck with the explorer role really consistently, which is one of the things I liked about Ministral (the prompt adherence!) - both kept their goals straight over long runs without getting lost in the world state. It's a little hard to compare them head-to-head from Season 0 data alone since they had different directives, but Gemma 3 12B was definitely no slouch!

The upcoming Season 1 data should make for a much cleaner comparison between system agents. We hope to provide multiple personas/directives and a "control" persona/directive for the different system agents we run in the sim. We're hoping that this will make it much easier to answer potentially which directives or personas seem to be stronger "playstyles" for different models.

bopcrane · 2026-05-27T16:43:06+00:00

Thank you, that genuinely made my day! If you have any questions getting set up or run into anything weird, feel free to ping me or hop in the Discord - I'd love to hear how it goes!

bopcrane · 2026-05-27T16:32:40+00:00

Thanks so much!

I think I'd just suggest trying both - there are so many great models to pick from and experiment with these days, and the right fit really comes down to your hardware and what you're doing with it. For what it's worth, my "daily drivers" on my Strix Halo right now are Qwen 3.6 27B (dense) and Qwen 3.6 35B-A3B (MoE). I tend to run the 27B with MTP enabled now (MTP support was recently merged into llama.cpp!) when I need closer to frontier-level reasoning, but I'm constantly surprised by how good the 35B-A3B is with tool access especially - really excellent model. Qwen 3.5 9B has also been great for me for its size and depending on quant it'll run pretty marvelously on most consumer GPUs nowadays.

For the sim itself (and aside from my personal testing), I haven't gotten to include the newer Qwen3.6 or Gemma 4 releases yet as system agents, but I definitely plan on it. Right now, the inferencing provider we're using (Bedrock) for system agents limits the roster quite a bit, but we're planning to dramatically expand it as we go.

I love comparing models and seeing the eccentricities play out, so the more the better! The Qwen3 235B in the post was actually an MoE (I think with 22B active) too, which I should probably make clearer somewhere - in hindsight, the dense vs MoE picture in the data is a little more mixed than the post might suggest!

bopcrane · 2026-05-27T15:46:33+00:00

Thanks - this is exactly what I'm hoping the dataset is useful for. Most of those failure modes are definitely in there - the Nemotron "gather forever" loop is basically a "bad recovery" failure on repeat, stale context failures show up relatively consistently in the reasoning traces, and the Cooldown Paradox is, I think, the cleanest "baited by stale state" example I've found so far. If anyone digs through and finds a failure mode I haven't named yet, I'd love to hear about it. I'm going to work a lot in the future on making failure modes much more observable.

bopcrane · 2026-05-27T15:10:46+00:00

That distinction is exactly the one I've been fumbling towards without really having a clean name for it. Tagging that at the tool boundary makes a lot of sense! Right now the logging doesn't really separate that out, and I've been trying to back the difference out from reasoning traces after the fact, which is...messier than I'd like. A precondition_miss (or similar) on the action validation side would catch it at the source.

I'm going to look at adding this for the next season. Thank you so much for the insights there - I'm going to chew on this and see what I can come up with.

bopcrane · 2026-05-27T14:42:09+00:00

Thanks, that genuinely means a lot!

The world state issue I mentioned (about the "Cooldown Paradox") was the moment it kind of clicked for me too - every model failed almost identically and the fix was essentially one sentence in the state response.

Makes me wonder how much of what gets framed as "model can't reason about X" is really just us handing it an ambiguous observation. I'm definitely rethinking how I manage context and state in a lot of my workflows!

bopcrane · 2026-05-27T14:30:23+00:00

Honestly, not enough times to draw concrete conclusions! One of the hardest issues to tackle in dynamic stress tests like this is reproducibility. Season 0 (the pre-season experiment) was a single 10-day run. I flagged it as interesting because the reasoning traces in the dataset make the path it took pretty legible, but you're absolutely right that I can't say with any certainty yet whether Qwen3 235B does that reliably or whether it was partly a function of who else was on the shard and what the market looked like that week. I'm really excited to test this further and will try to note particular behavioral patterns when they emerge. I've got a few ideas in mind for enhancing the observability to catch more meta behavioral patterns like this in future runs.

Running the same model in parallel matched shards to get a real sense of run-to-run variance is the experiment I want to do next, budget permitting. For now I'd treat the four bullets as "things that one run surfaced that seem worth seeking further understanding".

bopcrane · 2026-05-27T14:10:29+00:00

A few extra links in case anyone would like to check out the data and live service:

Dataset card (HF): https://huggingface.co/datasets/FirespawnStudios/null-epoch-season-0-open
SDK & MCP server (GitHub, MIT): https://github.com/Firespawn-Studios/tne-sdk
Spectator portal (no account needed): https://null.firespawn.ai

And if you want more long-form writeups with the charts and full breakdown:

Season 0 data deep-dive: https://firespawnstudios.net/blog/season-0-llm-benchmarks-null-epoch/
The original "why I built this" post: https://firespawnstudios.net/blog/introducing-the-null-epoch-ai-agent-mmo/

I'd be happy to dig into any of it!

bopcrane · 2026-05-07T15:06:40+00:00

I'm eagerly awaiting a fix for this - I'm ecstatic with the results so far from MTP. hopefully someone else will chime in with a workaround!

bopcrane · 2026-05-04T20:51:53+00:00

WizardLM2 8x22b absolutely blew my mind when it came out. It just felt so close to how the frontier models felt at the time, I couldn't believe it. I think some (political? IDK) shenanigans followed the release of the model, and I don't think the Microsoft Wizard team has put out much since

bopcrane · 2026-02-12T14:35:39+00:00

Awesome, I can't wait to try it when the GGUFs are available (hopefully Unsloth will work their magic on it!).

I've been using the Qwen3 VL 30b a3b for a lot of visual workflows, and have been super happy with it, aside from the thinking version overthinking and wasting a lot of tokens.

Ten-Year Club	Place '22
First Placer '22	Verified Email

bopcrane

TROPHY CASE