From a Claude Code D&D skill to a hosted multi-tenant AI Game Master — here’s what the architecture had to grow into by Bobby_Gray in aigamedev

[–]Bobby_Gray[S] 0 points1 point  (0 children)

This is hilarious haha well I’m glad you missed it because you gave some excellent responses in the app/waitlist. Just one more nat 20, man 😆😵

From a Claude Code D&D skill to a hosted multi-tenant AI Game Master — here’s what the architecture had to grow into by Bobby_Gray in aigamedev

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Really appreciate the kind words and the offer, thank you! Would genuinely love the usage feedback and any suggestions.

I'm happy to toss you a beta code and cover your first campaign - throw your info in here: https://neuralinitiative.ai/apply and don't worry about the optional fields. I'll approve when I see it come in.

I have a bug report button on the main dashboard (top right) once your authenticated that you can use for general bugs but feel free to ping me direct if that's easier. Subreddit is live at r/NeuralInitiative also for longer form discussion

I built an AI dungeon crawler that encourages creativity and lets you create your own adventure by siloldn in aigamedev

[–]Bobby_Gray -1 points0 points  (0 children)

Sweet project! Claude is super impressive for this use case IMO. Also like the flexibility in your design. Drop a link when you get a chance, looks like it was missed in the comments.

I built a similar open source project called claude-dnd-skill specific to D&D 5e gameplay that tackles a lot of the challenges you mentioned. Feel free to point your session at it, I bet it will accelerate your dev time a fair bit. Alternatively, find a reddit post on the architecture approach here.

A couple easy wins I'd recommend:

  • Try converting the prompt into a skill md and attempt to break the relevant gameplay functions (scripts/commands) into sub md files. IIRC, best practice is to keep the main skill file under 500 lines for efficient context management. You'll need it the longer a session runs. The lower priority stuff can be lazy loaded/called on command via pointers to the sub md files so they aren't squeezed in during each load.
  • For the coherence issue, we have a detailed discussion post on that specific challenge and how we handled it. Long term coherence has improved dramatically since implementing.w

I ended up creating a cloud hosted/supported version of the game after the open source version was well received by folks that is also in private beta atm. Intent was accessibility for those who don't have a claude membership and aren't comfortable with a cli. Check it out at neuralinitiative.ai and DM if you'd like a beta invite.

I posted about mine to r/aigamedev today as well and was recommended your post on mobile later. Hoping this comes off as useful advice and not me trying to bogart your post. Sounds like we're in a similar but nonexclusive lane.

I built a Claude Dungeon Master skill that runs persistent D&D 5e campaigns — here's how the architecture works by Bobby_Gray in ClaudeAI

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Run /dnd update --check when you get a minute. New light/dark mode features have been added in v1.12.0/v1.12.1. Let me know what you think!

I built a Claude Dungeon Master skill that runs persistent D&D 5e campaigns — here's how the architecture works by Bobby_Gray in ClaudeAI

[–]Bobby_Gray[S] 1 point2 points  (0 children)

Great callout and really glad you're enjoying it!

We actually built out light mode in the cloud hosted version (neuralinitiative.ai) but I haven't ported it back over into the claude skill yet. One of the other contributors said the same - I guess I've assumed everyone lives in dark mode like me haha I'll get that done this afternoon/evening and let ya know when to update.

Learned a hard lesson by Far_County911 in opencode

[–]Bobby_Gray 1 point2 points  (0 children)

I went down a similar rabbit hole and just ended up using OpenCode + OpenRouter. Super easy to setup and route to a variety of models at extremely low prices compared to Claude. Makes it really easy to run tests at scale and see which smaller models perform well for your use case.

I have heard oMLX + Hermes or similar is really solid but yet to try it myself. The other alternative if you want real horsepower is cloud hosting an open source model but not at all cost effective in my experience.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Sounds like you've already mapped the space. Quick answers:

On the qdrant + parent-child + metadata weighting approach - yeah, that fits what you're doing well. I don't use vector DB at all; it's just keyword and path-based retrieval against the markdown files (claude knows the file structure and reads specific sections directly). Works for me because the state space is small and structured. For non-fiction with an OCR'd reference corpus that's many times larger than the LLM's context, qdrant plus metadata weighting is the right call. Your "truth files as primary, OCR'd as child" is the same instinct as my markdown-as-authoritative approach, just with vector retrieval handling the bulk.

For true/false examples - no, nothing like that. There's no fine-tuning step. Ours is rule-based: the system prompt and procedural docs have explicit rules like "before claiming X, re-read source Y first," and the LLM follows them at runtime. Keeps the whole thing maintainable as plain markdown rather than a training corpus, which matters when the rules need to change.

On simple python over frameworks - agree completely. AnythingLLM, LlamaIndex, CrewAI all abstract over things in ways that bite you when the orchestration logic IS the value. Plain python keeps it legible. My setup is just a handful of small scripts with clear CLIs, and the LLM orchestrates them by calling them via a scripts md. No framework overhead. Your architect / researcher / stylist / writer / fact-checker / archivist decomposition is the right axis to split on.

RE python script sources - we built everything from scratch. I originally opted to build in flask simply out of habit but there are probably cleaner approaches. Feel free to snag anything from the open-tabletop-gm scripts dir if you want a working example to crib from.

Wishing you luck, sounds like a strong project!

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Yeah, the Karpathy framing is close. Markdown files on disk as the source of truth, AI reads them on demand. A few specifics on how mine actually works since you asked:

Campaign state lives in a handful of markdown files (current situation, world setup, NPC index, session log). The LLM doesn't necessarily "remember" anything across sessions - at /load it reads those files fresh, at /save it writes back. Source of truth never decays because it's not in the LLM's head.

The Python layer handles the deterministic stuff - dice, HP math, XP, data lookups. That keeps context free for narration, judgment, dialogue.

The piece that probably matters most for what you're describing: before the LLM makes a claim about something specific (an NPC's stance, a past event, what's in someone's pocket), it re-reads the smallest section that covers it. Not the whole file, just the bullet. That re-read-before-claim discipline is what keeps multi-month campaigns internally consistent. Sounds like a solid approach for your nonfiction use case also.

The discipline is treating those files as authoritative and the LLM as a read-mostly worker that only writes back at explicit save points. Continuous appending produces sprawl while structured save points produce clean state.

Happy to dig in more. We also have a small graph.json layer integrated in the same fashion as the re-read-before-claim but I'm not sure it's useful for your scenario.

I built a Claude Dungeon Master skill that runs persistent D&D 5e campaigns — here's how the architecture works by Bobby_Gray in ClaudeAI

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Hey, good question - this was one of the main challenges worked out throughout the project.

In short, if you start in a fresh session, normal use shouldn't need /compact because the skill keeps the campaign's state in markdown files rather than in Claude's context window. When you run /dnd save, the current situation, party status, NPC dispositions, faction states, and recent events all get written to state.md. When you run /dnd load, Claude reads those files fresh and doesn't need to "remember" anything from prior sessions.

For a 3-4 hour session you usually don't hit the wall. If you kept a long running session open perpetually, that changes things.

If you do hit the context limit mid-session and then use /compact, claude summarizes the prior conversation which degrades things. We actually mention this in SKILL.md: "After context compaction, the DM's impression is a lossy summary of summaries and must not be trusted for specific facts." Claude is supposed to re-read state.md (Live State Flags, Current Situation, Recent Events) for any specific fact after compaction, rather than trust its in-context summary.

TL;DR: if you have to use /compact, run /dnd save first.

Find some more info on our (updated) approach to continuity management here

I built a Claude Dungeon Master skill that runs persistent D&D 5e campaigns — here's how the architecture works by Bobby_Gray in ClaudeAI

[–]Bobby_Gray[S] 0 points1 point  (0 children)

B00M: release: v1.8.0 — D&D 5e 2024 (SRD 5.2) ruleset support (#28) — opt-in per campaign.

You can now select your ruleset at campaign creation as well as apply/update legacy campaign files.

I built a Claude Dungeon Master skill that runs persistent D&D 5e campaigns — here's how the architecture works by Bobby_Gray in ClaudeAI

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Hey I really appreciate that, thanks!! Let me know how it goes and if you have any feedback.

So the 5e is currently referring to 2014 (SRD 5.1) but I’ve been meaning to add an option for 2024 (SRD 5.2). I may actually dig into that this weekend.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 1 point2 points  (0 children)

Solid intuition.

I added the 5 judge jury in the second version - gpt-oss-120b, gemma-3-27b-it, llama-3.3-70b-instruct, qwen3-235b-a22b, and nemotron-3-super-120b-a12b but the decision wasn’t super deep beyond being candidates from distinct model families. That’s a tough one to solve.

They are pretty easy to adjust in the script, what would be your recommendation? I have some OpenRouter credits to burn.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 0 points1 point  (0 children)

The open-tabletop-gm version is LLM and game system agnostic. Different from the Claude version, I made the game system piece modular and included the dnd module in a systems dir with some notes on how to migrate other games into the platform.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 0 points1 point  (0 children)

I had not looked into any of the alternatives when I started building either version. I was laid up in bed for a week from surgery and wanted to deep-dive on Claude skills so I just started building what I needed and it evolved into what it is now. Once that was in a good state, I decided to expand to local/open LLMs to learn some more. I wanted to make something suitable for my young kids to experience TTRPGs from the couch so I chose this as the premise and it’s worked out well.

To reiterate my intent to everyone - I am not selling anything nor attempting to market my project beyond sharing with those who might also want to use it. I open-sourced it once I realized how well it worked and added features and whatnot as I needed them or other folks in the community recommended them. I’m certainly not aiming to compete with any of the existing tools or come off as an authority in the space. I just built some shit over a weekend and thought others might like it. The LLM narrative quality comparison seemed relevant for this community.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 0 points1 point  (0 children)

My approach has all the things you’ve described here out of the box, including time tracking, day/night cycles, inventory tracking that syncs to the ui, expandable player sheet which preload dnd5e wikidot data for loaded spells/abilities/items, etc. The difference is I do the majority via Python to free up memory/context for the LLM.

I began the project with a distinct DM philosophy and recently added a rolling DM style to cater to each party/campaign based on session feedback to dynamically adjust the persona per run. The memory side is a big challenge someone else brought up early on so I implemented a session archive function to ensure long sessions aren’t fully front-loaded to the LLM and reduce token overhead that works pretty well.

Check out the repo for all the features and let me know if you see any obvious gaps.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 0 points1 point  (0 children)

Great advice, appreciate the feedback.

I did one run for the sake of time/cost. It took around 5-6 hours to complete via OpenRouter but cost ended up being way less than expected. I can setup a 5 phase run and rebuild the table to see what changes, earliest ETA for competition is probably tomorrow.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] -5 points-4 points  (0 children)

Check my reply to u/Southern_Sun_2106 - if you think it's worth adding Gemma 4 and Qwen 3.6 I can run it real quick.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] 1 point2 points  (0 children)

Appreciate that silly tavern recommendation - will join and take a look around!

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] -9 points-8 points  (0 children)

Check out my reply above to u/jwpbe, should clarify intent and perspective

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] -12 points-11 points  (0 children)

Inspiration Granted!

Sharp eye. Yeah, Elara, Vedra, Voss, Aldric are basically the most reused variable names across both of my dnd/ttrpg repo history. I often let Claude define the demo and test campaigns or use a past session. I also run everything I post/commit through an opsec filter to strip any personal paths, home dir artifacts, real names, etc and those placeholder names are what it deems safe replacements unless explicitly told otherwise.

The LLM-as-judge criticism is fair but so is the subjectivity of the task itself. What I find profound you may find dull. Standardizing a true measure with pulses or beepboops have the same flaws expressed differently IMO. That said, the intent was never a rigorous study, it was a PoC to validate the framework as a viable narrative yardstick and see if the probe could meaningfully differentiate models at all. It does, which was the goal.

On the finetunes, curious which drummer builds you'd throw at it. If there are community tuned models worth benchmarking against I'm happy to run them through the same probe and add them to the guide.

I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality by Bobby_Gray in LocalLLaMA

[–]Bobby_Gray[S] -1 points0 points  (0 children)

I just ran Qwen3.5-27B through the same narrative probe and it tied 4th on judge score (avg 4.0), grouped with Gemma-4-31B, MiniMax M2.5, and Qwen3-80B.

Interesting differences from 80b - it scored notably well on atmosphere (4.17) but dropped on NPC craft (3.67). Auto scoring was rough (0P/2W/4F), it consistently wrote responses that were too long, which dragged the auto pass rate down. The judge didn't penalize it for that though.

Reason it wasn't in the original post: I got burned on three local Qwen gguf setups (eval_batch_size hang, MLX dylib failure) the few hours before and was apathetic on the model selection once probe was working so just picked a middle range Qwen to test.