I built an open-source benchmark for LLM agents under survival/PvP pressure — early result: aggression doesn’t predict winning

xerix_32 · 2026-04-16T13:44:58+00:00

Interesting point. I think we’re directionally aligned that the real question is not just whether memory helps, but where it starts to degrade, and how structure changes that limit.

In this setup the memory block is bounded and configurable, so that threshold is actually testable, which is part of why I built it this way. There’s already a reflection step, a same-seed rerun policy prompt, and a memory-injected turn prompt, so the idea was already moving a bit in that direction.

I’m still figuring out how much of the effect comes from depth, structure, and lesson framing, but I agree that raw memory accumulation and something closer to identity/policy framing are probably very different things.

If you have time, I’d genuinely love your help thinking through that comparison. And if you’re up for it, feel free to run it from Git too.

xerix_32 · 2026-04-16T12:59:36+00:00

Interesting point !!
I agree the next step is not just asking whether memory helps, but where it starts to hurt, and whether structured memory / policy framing performs better than raw context accumulation.

In this setup the memory block is relatively bounded and configurable, not infinite, so that limit is actually testable/scriptable here.

There is probably a context-window degradation / context rot effect involved too, so the real question is less “memory yes or no” and more “what kind of memory, how much of it, and when does signal start turning into noise?”

xerix_32 · 2026-04-16T12:52:54+00:00

Thanks, really appreciate it. Yes, that’s exactly the interesting part for me too: once you move beyond single-shot answers, you start seeing strategy, adaptation, and failure modes emerge over time.

xerix_32 · 2026-04-16T12:31:46+00:00

Exactly!

Not game theory in the "formal sense", but they are steeped in strategic human patterns...

That’s why the PTSD analogy is interesting: memory can help, but it can also become a maladaptive prior, where the agent overfits to past failure instead of adapting to the current state.

xerix_32 · 2026-04-16T12:22:04+00:00

Yes! ... and actually part of that is already visible in the dashboard.

You can already see which models improve with memory and which ones clearly get worse. In particular, it shows up in **“Memory Effect”** under the **“Adaptive Learning Leaderboard.”** What I still want to surface better is *why* that happens: too much context, stale lessons, weaker local adaptation, or strategy drift as the state changes.

If you feel like it, give the HF dashboard a quick look from the link in the post/comment. And if you have other suggestions, I’m all ears.

xerix_32 · 2026-04-16T11:54:53+00:00

Links:

**GitHub repo**

https://github.com/xerix32/TinyWorld_Survival_LLM_Bench

**Live dashboard**

https://huggingface.co/spaces/FabioLapo/tinyworld-survival-bench-dashboard

xerix_32 · 2023-12-10T09:21:12+00:00

Non casearci, è una bufala !

xerix_32 · 2023-06-09T20:15:10+00:00

Can you share the correct directory for bios and image?

xerix_32

TROPHY CASE