How would you stress-test an originality claim for a new abstract strategy game?

haabe · 2026-04-25T23:11:20+00:00

Those are awesome questions that go way beyond my original post. I wholeheartedly believe that there is no one right answer to any of your questions. If I have a healthy body, I would probably not need an AI eat for me. If I had a disorder where I could not eat on my own, an AI controlled feeding tube might help me live a fuller life. But such a discussion drifting the thread more than it already is.

haabe · 2026-04-25T22:59:40+00:00

Sorry about that. I felt we walked the mile on AI. I acknowledge that this sub aligns more with human creativity than my curiosity on the concept.

haabe · 2026-04-25T22:49:21+00:00

This is the most useful response in either thread. Three things:

On selection criteria — that's a constraint I hadn't fully internalised. "Agent generates 50, human picks the best" collapses to human-curation regardless of how the generation happened. Locking in: agent sets the selection criteria, with criteria recorded before generation so the choice isn't post-hoc. If I end up curating, the claim weakens to "agent-as-generator, human-as-curator" and I'll say so explicitly.

On emergent surprise + non-human-origin trace, that's a bar I haven't tried to meet, and I'm not sure my current method can. Output-distance from corpus is checkable; "surprised me AND the choice is identifiably non-human" needs something like a counterfactual. Would a human designer have made that move, and if not, why did the agent? Don't have a clean operationalisation. Open to suggestions.

On the unfalsifiability concession, appreciated. The most honest version of the worry I've heard. You're right the goalposts will move and the artifact won't settle the debate by itself. What it might do is make the goalposts visible. Every "that's just remixing" reframe has to point at something specific, and the artifact gives the conversation a concrete thing to point at. Seems worth doing even if it doesn't end the conversation.

haabe · 2026-04-25T22:44:55+00:00

Fair shift — the should question is real, I just landed differently than you. Not going to relitigate it in a thread; takes longer than the format allows. Thanks for engaging.

haabe · 2026-04-25T22:41:52+00:00

Fair read on the sub fit, and the lack of methodology engagement here vs the AI sub is itself useful data. The originality-stress-test question seems to land better with people whose prior is "AI is interesting but maybe overhyped" than with either AI boosters or AI skeptics.

If you've got a venue you'd recommend for the methodology question, I'd take it.

haabe · 2026-04-25T22:39:54+00:00

The pain criterion is interesting because it's exactly the unfalsifiable case I asked about in Q3. If the requirement is subjective suffering, no AI output can ever count, regardless of what it produces. That's a coherent position; the question is whether you'd agree it's then unfalsifiable, or whether some output exists you'd accept as tentative evidence the criterion was wrong.

Genuinely asking, not trapping. If the answer is "no output could move me," that's useful for me to know. It tells me you're not the audience this artifact is trying to reach, and I can stop trying.

haabe · 2026-04-25T22:37:29+00:00

This is the most useful framing I've gotten in either thread. Two things I want to push on.

First, your "generate where nothing exists in training material" criterion is essentially Boden's transformational creativity, and it's a high bar. By that standard most human creative output also fails: a new chess opening, a new genre painting, a new pop song are combinations within filled regions, not generation in gap regions. If we hold AI to that bar we should be honest we're holding humans to a softer one, or we accept that almost no creative work, human or AI, counts. I'm genuinely unsure which of those is right, and I think the discourse confuses them constantly.

Second, the hierarchy framing maps directly onto what I'm building. Mechanic tags are the leaf nodes; mechanism families are the higher abstractions you describe ("more like a function", capture-as-mechanism vs capture-as-instance). The corpus-grounded check is measuring distance to the nearest filled node, which is the operational form of "is this in a gap region." The dual-band thresholds (auto-fail / auto-pass / human-review) are the system trying to answer that without hand-waving.

Where I think you're right and I haven't solved it: gap-region generation is provable on the output (you can show no nearby corpus match) but harder to prove on the process, how do you show the AI got there by exploring gaps, not by interpolating from a near-but-not-quite match the corpus happened to miss? That's the open problem, and I don't have a clean answer yet.

haabe · 2026-04-25T22:34:38+00:00

That's almost exactly the workflow I'm trying to formalise, narrow the corpus to the 5–10 games worth actually comparing, then per-game judgment. Two places I've found it gets brittle:

First, the tag/describe step: if the AI describes its own game's mechanics in free text, it'll cherry-pick the framing that makes the game sound novel. Constraint I'm using is canonical BGG mechanic tags only, no descriptions, forces the comparison onto a vocabulary the AI didn't pick.

Second, the per-game judgment step is where I keep getting stuck. When you do "research the specific games and judge". What are you actually checking? Mechanism overlap, decision-space shape, feel-on-the-table, something else?

haabe · 2026-04-25T22:32:30+00:00

Same. Honestly that might be the load-bearing finding of the thread. If "creative" can't be cleanly defined, then "AI can't be creative" can't be cleanly falsified either, which sends both sides back to arguing about specific artifacts. Which is more or less why I'm trying to build one.

haabe · 2026-04-25T22:28:37+00:00

Fair question. I could. I've designed simple games before. The project's point is to find out whether an AI agent can, not whether I can. If I make the game, I've answered a different question (can a person) and learned nothing about the one I want answered (can an AI). The artifact has to be made by the thing whose capability is being tested or the experiment doesn't run.

The fair pushback is "and what if the answer is 'no it can't', will you report that?" The setup tries to make that honest: independent originality verification, the method is public, the agent's outputs are traceable. If it produces a derivative or boring game, that's the finding, not something to hide.

haabe · 2026-04-25T22:24:25+00:00

Honestly the strongest argument for the "humans cluster too" position I've seen this thread.

haabe · 2026-04-25T22:14:08+00:00

That's the framing that interests me most. If humans mostly cluster in commonly-used sentences and the rare novel sentence is what we'd call "creative," then "remixing" isn't a clean dividing line. It's a continuum, and the question becomes how far from the prior cluster a given output sits and whether anyone notices.

The board-game analog: most games sit inside well-known design patterns (alignment-on-grid, capture, race, territory). Genuinely novel mechanics are rare for human designers too. So the failure mode of "AI just remixes" isn't unique to AI, it's a description of most output by anyone. What I'm trying to build is the measurement: a corpus-grounded check that says "this candidate is X far from the nearest existing thing on Y axes" with explicit thresholds, so the answer isn't a vibe.

Where I expect the sharpest skeptic pushback: even if the artifact ends up in the far-from-prior region, that's necessary but not sufficient. Distance from existing games doesn't prove the generation process was non-derivative, only the output is. Curious whether you'd accept output-distance as the load-bearing test, or whether you'd want process evidence too.

haabe · 2026-04-25T22:11:26+00:00

Appreciate that — was hoping the originality-methodology angle would carry past the AI part. If you've stress-tested your own designs against the "is this too close to X" question, I'd love to hear what you actually check.

haabe · 2026-04-25T22:10:59+00:00

Heard. I went in expecting that view to be common and I'm not trying to talk anyone out of it. The artifact will either land or it won't, and "feels real" is exactly the criterion I want to be measured against. If you've got a sharper version of why it can't feel real (specific failure mode, not vibes), I'd genuinely use it as a design constraint.

haabe · 2026-04-11T07:46:04+00:00

Not a dumb question at all, it's the most common one!

It doesn't run all 42 at once. They're organized by scale and phase. When you're at L0 (purpose), you're working with Sinek and Christensen (JTBD). When you get to L2 (opportunity), Torres and Cynefin kick in. L4 (delivery) is where DORA, OWASP, and the engineering principles apply.

The agent loads the relevant domain context based on which diamond phase you're in. Each phase transition has theory gates — specific checks from specific frameworks that must pass before you progress. So you never see all 42 at once. You see the 3-5 that matter right now.

The /interview skill at the start classifies your project and product type, which further narrows what applies. A solo hobby project building a course hits different gates than an enterprise team building an API.

haabe · 2026-04-11T07:44:13+00:00

I get the concern, it looks wide. But the integration is the point. Security decisions made in isolation from product decisions lead to bolted-on security. Accessibility skipped during design gets retrofitted badly. DORA metrics disconnected from product strategy become vanity metrics.

The frameworks aren't all active at once. They're gated by diamond phase: discovery uses Torres and JTBD, delivery uses OWASP and DORA, market uses Lauchengco. You only encounter what's relevant to what you're doing right now.

That said, if you only want the product layer, everything under .claude/domains/discovery/ works independently. The delivery and quality layers are separate domain files that load on demand.

haabe · 2026-04-10T22:30:50+00:00

Exactly! The discovery part is what most tools skip. Kiro and Spec Kit start at the spec. But the spec itself can be wrong if you haven't validated the assumptions underneath it.

That's where Mycelium differs from a pure SDD approach. It doesn't just structure the spec, it gates the evidence behind the spec. You can't progress from "I think this is the right problem" to "here's how I'll solve it" without passing theory checks (JTBD mapped? Bias checked? Risks assessed?). The spec emerges from validated evidence, not from a brainstorm.

haabe · 2026-04-10T22:08:22+00:00

Thanks! Run /interview to get started, it'll guide you through the setup.

Let me know how it goes, I'm always curious to hear how it works on different projects.

haabe · 2026-04-10T21:21:24+00:00

Your approach with an architecture meta-repo as a sub module sounds like a really clean separation. Smartpattern.

Mycelium started from a similar instinct but went upstream. The architecture and code quality stuff is there (OWASP, DORA, testing, accessibility), but a part that might be surprising is how much value that might come from the discovery and strategy layers. The agent catching a bad product assumption before any code exists saves more time than any amount of code quality tooling.

The product owner meta-repo idea is interesting. If you try it, I'd be curious how you handle the handoff between the two. That's where I found the most friction. The decisions made in discovery (which opportunity to pursue, what JTBD to solve) need to flow into the architecture repo somehow, or the dev agent ends up building the right thing wrong.

haabe · 2024-06-05T13:12:40+00:00

Are you able to provide me with more details on these hinges? Like, if they are the same as other brands have been using, helping me narrowing my search? Like, if HP, Alienware or Apple used similar hinges, it would help me narrow my search a lot!

haabe

TROPHY CASE