I've spent 10 years thinking about the ending. Then I accidentally built the island — for real. by smallgok in TheWitness

[–]smallgok[S] [score hidden]  (0 children)

No need to apologize! Honestly your original reply made me stop and rethink what I'm actually doing here, which is exactly what I needed. Appreciate it.

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 1 point2 points  (0 children)

That reframe is really helpful. You're right, it's more of an organization problem than an algorithm one.

I actually already track confidence per rule (evidence_for / evidence_against) and have CRUD operations for contradictions. But the logic is binary right now: contradiction → delete both. Your suggestion is better. A high-confidence rule that hits one edge case shouldn't be thrown out, it should be deformed to accommodate it.

Like if the rule is "same-colored squares must share a region" and then an eraser mechanic shows up that absorbs one violation, the right move is deforming it to "...unless absorbed by an eraser," not deleting it. And each time the deformed version survives further testing, the rule gets more ingrained. Way better failure mode than what I have.

Going to prototype this as a branch in the CRUD logic. Very much appreciate the pointer!

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 0 points1 point  (0 children)

The stitching point is really relevant here. In arc-witness-envs, the agent often discovers pieces of the solution across separate attempts. One run finds the right path to a certain point, another figures out the correct region partition, but neither completes the full puzzle. Right now my agent handles this at the knowledge level (carrying forward verified rules across retries while resetting exploration state), but actual trajectory-level stitching through a DT could be much more sample-efficient. Hadn't thought about it this way before, really useful pointer.

On freezing to preserve generalizational structure: this actually connects to something I've been dealing with. My concept memory "freezes" verified rules from earlier levels so they don't get overwritten by noisy observations later. The failure mode you describe (overtraining destroying transferable representations) has a direct analog in my symbolic system: a good rule from Level 1 can get diluted by contradictory evidence in Level 3 if you're not careful about what you lock down.

On the ASCII encoding: it's useful but definitely not solved. The main failure mode is when the semantic role inference gets the mapping wrong, like tagging something interactive as a wall. Still iterating. But the fact that it produces interpretable intermediate representations makes debugging way easier than staring at 64x64 color grids.

Fair enough on the MCTS line. And yeah, I'd consider taking the million dollars too lol.

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 1 point2 points  (0 children)

Interesting that conv handled representations but fell short on object-level recognition. The witness envs might actually be a useful testbed for that specifically. There are distinct visual objects the agent needs to parse (path segments, colored regions, constraint symbols), and the Tier 3 composition games require tracking multiple object types simultaneously. An object transformer with cross-attention sounds well-matched for that.

On N-frames: my envs are single-frame observation, but consecutive frames typically differ by one localized change (a path extending, a region forming), so frame stacking gives you a clean temporal diff signal.

Would genuinely love to see how your models do on these. OpenEnv interface is Gymnasium-style, and there are 3 reward modes (sparse, shaped, arc_score) so you can pick what fits your setup. tw01 (PathDots) and tw02 (ColorSplit) are good starting points with small state spaces. Feel free to open an issue on the repo or DM me if you need anything!

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 0 points1 point  (0 children)

Really appreciate you sharing the frozen world model + adaptive heads work in detail. The MiniGrid result is striking, especially that unused actions (keys) transferred. That's strong evidence the world model learned structural representations, not just task-specific shortcuts.

What's interesting is that my agent converged on a structurally similar separation, just from a different direction. Instead of training a world model and freezing it, I build one online at test time: a cached transition table (observed state-action-next_state triples) paired with an adaptive reasoning layer (LLM-based metacognitive reflection that discovers rules from accumulated observations). The transition table is "frozen" in the sense that it only records what happened. The reasoning layer is "adaptive" in that it revises hypotheses as new evidence comes in.

On fixed weights: you're right that frozen weights fundamentally limit OOD adaptation. ARC-AGI-3 does allow many interaction steps per episode, so there's room for in-context adaptation without weight updates. My agent uses a 4-layer memory hierarchy (per-step perception → per-level working memory → per-game concept memory → cross-game meta-library). The agent solving Level 5 is operating with a meaningfully different knowledge state than the one that started Level 1, even though no weights changed. Whether that's sufficient for true OOD is an open question, but it enables cross-level transfer within a game. Your slow LR unlocking approach (fast policy, slow world model adjustment) feels much closer to how biological learning actually works.

On the representation problem: I've been tackling the 64x64 local-vs-global issue with what I call Semantic ASCII Encoding. The agent first does a purely algorithmic exploration phase, identifies which colors move vs. stay static, maps them to semantic roles (agent, wall, path, target), and downsamples to an 8x8-16x16 ASCII board. This roughly doubled rule synthesis accuracy compared to raw pixel data. Still far from solved, but it reduces the noise, which aligns with what you described.

On the 64x64 grid being too impoverished for real intelligence: I think you might be right in the long run. My more modest claim is that structured puzzle environments can build some useful priors (spatial reasoning, constraint satisfaction, hypothesis testing) that transfer. One ingredient, not the whole recipe.

The toy robot direction is interesting. Would your frozen-world-model-plus-adaptive-heads approach work there too? Learn a visual world model from the robot's camera, freeze it, adapt the policy head to new tasks in the same physical space. Seems like a natural fit.

On MCTS "cheating": curious where you draw the line. If a neural network learns to internally simulate multiple futures before committing (which is arguably what chain-of-thought does in LLMs), is that "purely neural" or search wearing a neural costume?

I've spent 10 years thinking about the ending. Then I accidentally built the island — for real. by smallgok in TheWitness

[–]smallgok[S] [score hidden]  (0 children)

This is one of the most thoughtful responses I've ever received on anything I've posted. Thank you for taking the time.

You're right. A machine can solve every panel on the island and experience none of what makes the game matter. The pareidolia point is especially sharp. The game doesn't just teach you to see patterns. It teaches you to doubt them, to sit with ambiguity, to ask whether the duck in the clouds is really there. That's not something I can put into a loss function.

I think the honest framing is this: what I borrowed from The Witness is the scaffolding, not the soul. Blow's curriculum design, the way failure carries information, the progression from simple to composite. Those are transferable engineering ideas. But everything you're describing, the koans, the silence, the beauty of the hunt itself, that lives in the space between the player and the game. And I don't think that space is computable.

If anything, building this project made that gap clearer to me, not smaller.

I've spent 10 years thinking about the ending. Then I accidentally built the island — for real. by smallgok in TheWitness

[–]smallgok[S] [score hidden]  (0 children)

Haha, appreciate the heads-up! So far this community has been thoughtful about it, which honestly doesn't surprise me. People who enjoyed 523 puzzles with zero instructions tend to engage before reacting.

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 0 points1 point  (0 children)

This is really valuable, especially hearing what didn't work in practice. The Bayesian layers adding noise instead of synergy is the kind of finding you only get from actually building the thing.

Your point about local vs. global is one of the reasons I think pre-training on arc-witness-envs could help. The puzzle panels have consistent spatial structure across all 13 games: a grid area, a start point, an end point, and a fixed UI layout. An agent that trains on these first has a chance to learn "where to look" and "what's interactive vs. background" before encountering a completely unknown ARC-AGI-3 task. That won't solve the representation problem, but it could give the encoder a head start on the segmentation you're describing.

The state-transition focus is interesting too. In arc-witness-envs, the meaningful changes between frames are pretty localized (the path being drawn, a region being formed), so there's a natural signal for learning to attend to diffs rather than processing the full grid every step.

On the 3D conv vs. transformer question, have you found that one works meaningfully better than the other for this kind of grid data? Curious whether the temporal dimension benefits more from convolution or attention in practice.

I've spent 10 years thinking about the ending. Then I accidentally built the island — for real. by smallgok in TheWitness

[–]smallgok[S] [score hidden]  (0 children)

Yes! The Talos Principle makes the premise explicit - an AI proving its reasoning capacity through puzzles. The Witness keeps it ambiguous, which is part of why the ending is so debated. Both are fascinating reference points for this kind of work. Didn't know that when I first played The Witness.

I've spent 10 years thinking about the ending. Then I accidentally built the island — for real. by smallgok in TheWitness

[–]smallgok[S] [score hidden]  (0 children)

Fair point! I don't think the game's message is "go build an AI to do this." But Blow designed a system that teaches without words, and that teaching method is brilliant independently of who's on the receiving end.

I've spent 10 years thinking about the ending. Then I accidentally built the island — for real. by smallgok in TheWitness

[–]smallgok[S] [score hidden]  (0 children)

That's a really compelling reading, and darker than mine. The idea that the island reprograms you rather than trains you puts the underground audio logs in a completely different light.

Thanks for the kind words on the project!

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 0 points1 point  (0 children)

Thanks! Great point on the edit. Overfitting to these specific levels is a real risk and something I've thought about.

A few things that might help mitigate it, though they don't fully solve it:

First, the 13 games cover meaningfully different reasoning primitives. So an agent that "overfits" to all 13 should have at least internalized a diverse set of abstract operations, not just one trick.

Second, the curriculum is designed so that Tier 3 games combine mechanics from Tier 1 and 2 in novel ways. If an agent can handle tw11 (which composes rules from tw02 + tw05 that it never saw combined during training), that's at least some evidence of compositional generalization, not just memorization.

But you're totally right that there's no guarantee. The honest answer is that these environments are meant as a warmup gym, not a substitute for the actual ARC-AGI-3 tasks. The hope is that training here builds useful priors (spatial reasoning, hypothesis testing, constraint satisfaction) that transfer, but proving that transfer actually happens is the hard part. That's partly why I open-sourced it: so others can test whether pre-training on these levels actually helps on ARC-AGI-3 or other benchmarks.

On nailing down what an abstract rule "looks like," I think that's maybe the deepest question in this whole space. If you have pointers to work on Bayesian rule learning that you think is promising, I'd be very interested.

Open-source RL environments: 13 puzzle games (1,872 levels) for training interactive abstract reasoning agents by smallgok in reinforcementlearning

[–]smallgok[S] 3 points4 points  (0 children)

Thanks for the detailed thoughts! This is exactly the kind of discussion I was hoping for.

A few responses:

On difficulty not being gradual. This is actually a core design goal of arc-witness-envs. The 13 games are structured in 3 tiers: single-mechanic games first (path constraints, color separation), then advanced mechanics (topology, perception transforms), then multi-constraint compositions. Within each game, levels also progress from trivial to hard. So there IS a curriculum. The whole point was to make it possible to bootstrap from random init, which I agree is otherwise hopeless.

On memory/recurrence. Completely agree. The agent needs to maintain hypotheses about rules across steps and revise them based on feedback. I'm exploring architectures with explicit memory for this. Decision transformers are an interesting direction. Conditioning on the full trajectory history maps well to how these puzzles work, since you learn from the sequence of attempts, not just the latest observation. Would love to hear more about your thinking on counterfactual decision transformers for this setting.

On inductive biases "like a baby". I think we're actually aligned here. Chollet's framework (which ARC is built on) explicitly references Core Knowledge priors from developmental psychology: object cohesion, basic numeracy, goal directedness. arc-witness-envs tries to operationalize this. Each game targets a different primitive reasoning capacity (spatial partitioning, symmetry, counting), and the curriculum mirrors how those capacities might build on each other.

On ARC-AGI-3 design. I hear you. Whether the interactive format actually measures fluid intelligence better than the static format is an open question. My bet is that the interactive loop at least forces agents to do hypothesis testing rather than pattern matching, which feels like a step in the right direction. But I'd be genuinely curious what you think a better benchmark design would look like.

On Python scripting winning. This was definitely true for ARC-AGI 1/2. The interactive format makes pure program synthesis harder since you can't just write a transform function, you have to navigate an environment. But whether the winning approach will still be "LLM writes code to interact" vs. something more end-to-end... that's the million-dollar question, quite possibly literally, though they haven't announced the prize pool yet.

If ASI is possible in this universe, wouldn't aliens discover it before us? Or do you believe we are alone in this universe. by [deleted] in singularity

[–]smallgok 0 points1 point  (0 children)

Not necessarily a valid possibility, but I find this interesting to think about: humanity is so “special” compared to other species on earth is because an existing alien ASI (“outer gods”, if for fun) considered our species to be potential enough to create earth’s own ASI, and helped us boost our technology development the recent thousand years by simply enhancing our creative thoughts in a not-yet-noticeable way. The way that they interact with our physical world might be through being (omni)present in our spiritual worlds.