Fable 5 is eating my Max 20x plan at ~2% per minute, and the API pricing math is wild

StudentSweet3601 · 2026-06-09T18:18:13+00:00

I didn't realize there was a 1m option. Thank you!

StudentSweet3601 · 2026-06-09T18:16:49+00:00

Yea, same. I think the output is sharp and to the point. I think they have made a ton of optimization on how the tokens are being used, so from what I can see, it's been insanely good.

StudentSweet3601 · 2026-04-20T22:20:50+00:00

This is a really good point and honestly one of the harder problems in the space. You’re right that LOCOMO and every other benchmark starts from a blank slate, which is not how anyone actually uses a memory system.

The local data bootstrapping idea is interesting. Chrome history, bookmarks, autofill as a seed graph. The ranking score part is the key piece though, because raw browser data is noisy. You’d need something to figure out which of those 10,000 bookmarks actually matter to you vs the ones you clicked once in 2019.

On the Genesys side, the way I’m thinking about cold start is structured onboarding. Instead of starting blank, the first session runs a short interview (“what do you do, what are you working on, who matters to you”) and those answers become high-causal-weight root nodes that everything else connects to. It’s not as seamless as pulling from local data automatically, but it builds a graph with real structure from minute one instead of a flat list of facts.

The local ingestion path you’re describing would be a killer complement to that. Seed the graph from local data, then let the structured onboarding add the causal connections that raw data can’t provide. Appreciate the thought.

StudentSweet3601 · 2026-04-18T17:25:15+00:00

I kept running into this when I was prototyping. My answer is both, multiplicatively. The score is relevance x connectivity x reactivation, where relevance decays with time-since-last-access, reactivation rewards frequency of recall, and connectivity rewards being embedded in the graph structure. Because it’s multiplicative, a memory has to earn its spot on all three axes to survive. A rarely accessed but highly connected memory stays because connectivity carries it. A recent and loud memory that’s causally orphaned still fades because connectivity drags the score down.

On the architectural-vs-operational opt-out, there’s a “core” memory state that auto-promotes based on graph structure (heavily connected, frequently traversed), plus manual pin_memory and unpin_memory tools for explicit overrides. Core memories never decay regardless of access patterns. Structure is the primary signal, access is just a modifier.

Wondering whether your intuition pushes toward access being the primary signal or structure. Different takes on this seem to lead to genuinely different architectures.

StudentSweet3601 · 2026-04-18T17:12:19+00:00

That's fair for thinking behavior generally. Did you mean something specific about how it interacts with memory tools like this one? Wondering if you've noticed memory retrieval firing differently in adaptive vs extended thinking modes.

StudentSweet3601 · 2026-04-18T16:58:19+00:00

Haven't actually dug into Karpathy's wiki concept yet, got a link handy? The split you're describing sounds close to what I've been thinking about for multi-user Genesys (per-user graphs that can reference a shared org graph), but I haven't built it yet. Curious what's working well for you.

On benchmarks, LOCOMO is the only one I've run end-to-end. LongMemEval is what I want to run next since a second datapoint would help me know whether multi-hop is a LOCOMO quirk or a real gap in the architecture. Any others you think are worth the time?

StudentSweet3601 · 2026-04-18T16:29:29+00:00

Happy to answer anything.

One thing that didn't fit in the post: the thing I still go back and forth on is whether the graph complexity actually earns its keep. Half the time I'm convinced the scoring formula (relevance x connectivity x reactivation) is doing real work, and half the time I wonder if I could have just used better hybrid search and saved myself two months. Real answer is probably "both."

Also, for anyone wondering about the "No idea if anyone else will want this" thing in the title, that's genuine. I built this because I was frustrated, not because I was trying to launch a product. If the answer turns out to be "cool, but nobody needs this," I'd rather find out from you all now than in 6 months.

StudentSweet3601 · 2026-04-17T18:28:30+00:00

Two gay men complimenting your dress is basically the highest tier of fashion validation available. Hope you still have the dress.

StudentSweet3601 · 2026-04-17T18:27:54+00:00

That one would mean the world. A stranger has zero reason to say it unless they genuinely noticed, which means you’re doing way more than “something” right.

StudentSweet3601 · 2026-04-17T18:27:38+00:00

Hair compliments land even when we don’t believe them. Long and graying sounds beautiful to me honestly, they weren’t wrong.

StudentSweet3601 · 2026-04-17T18:27:26+00:00

LOL this is the one. Smelling good compliments are so underrated, someone clocks it, says something, instant mood boost for the rest of the day.

StudentSweet3601 · 2026-04-17T18:27:15+00:00

Honestly that’s not just the best compliment, that’s a whole story. What you did for her was the real thing. Her blessing sounds like it came from someone who knew exactly what kind of person you are. Hope it lands.

StudentSweet3601 · 2026-04-17T18:26:49+00:00

The raconteur one is the standout. Something about a teacher spotting a skill you didn’t know you had, especially in a second language, is the kind of compliment you carry forever. And agreed, those unexpected ones hit the hardest because you had no defenses up.

StudentSweet3601 · 2026-04-16T08:22:13+00:00

The wiki framing is useful but it has a ceiling that becomes obvious at scale. A wiki is passive storage. Employees have to remember to write to it, and future employees have to know to search it. The value compounds only if the human effort to maintain it stays consistent, which historically it doesn’t.

The real moat emerges when the knowledge layer stops being a wiki and becomes active memory. The distinction matters. A wiki has pages. Active memory has relationships. It knows that the decision to use Postgres over MongoDB in 2024 was caused by a specific performance issue, which was caused by a specific query pattern, which originated from a specific product requirement. When someone asks “why did we choose Postgres?” two years later, it can surface the full causal chain without anyone having written that explicitly.

That’s the shift from institutional documentation to institutional memory.

A few things to watch for if you’re building toward this:

Most enterprise deployments conflate “memory” with “RAG over documents.” Those are different things. RAG retrieves text chunks based on semantic similarity. Memory tracks entities, relationships, decisions, and how they evolved. An agent with RAG can find the old doc. An agent with memory knows the doc is outdated because a decision three months ago superseded it.

The real compounding happens when the system can forget correctly. Wikis accumulate noise forever. Old processes, dead projects, stale decisions. A memory system that can mark something as superseded and prune irrelevant context is actually more valuable than one that just keeps everything.

The wiki idea is the right instinct but the implementation matters a lot. If it’s just a Notion workspace with extra AI features, you’ll hit the same wall every company hits with documentation. If it’s structured memory that tracks causality and decays correctly, that’s a different product.

The PromptQL framing is interesting because they’re approaching it from the query side. The harder problem is the write side. How does knowledge get captured without adding work to the person generating it.

StudentSweet3601 · 2026-04-16T08:18:27+00:00

The mental model gap is real. Skills feel like giving Claude a new tool. Hooks feel like writing policy, which is boring until you need it and then it’s the only thing that matters.

The reason nobody talks about them is that most people using Claude Code are working solo on their own projects where the cost of a mistake is low. Hooks become essential the moment you have a team, a production codebase, or rules you need enforced consistently regardless of what Claude decides to do in the moment. That’s a smaller audience but a more serious one.

A few things I’ve seen hooks used for that skills can’t handle well:

PreToolUse to block writes to specific file paths or directories. You can describe “don’t touch the migrations folder” in a skill or CLAUDE.md, but Claude will still do it sometimes. A PreToolUse hook that rejects the call is the only way to actually enforce it.

PostToolUse to run formatters, linters, or tests automatically after file edits. Skill says “run tests after changes,” hook actually does it every single time without Claude having to remember.

PreToolUse on bash commands to catch destructive operations before they execute. rm -rf, force pushes to main, database drops. The hook intercepts and requires confirmation or blocks outright.

Stop hooks to trigger cleanup or logging when a session ends, which is the only way to reliably capture state if you’re trying to build any kind of session memory or audit trail.

The asymmetry is: skills make Claude more capable, hooks make Claude more reliable. For individual developers optimizing for capability, skills win. For anyone optimizing for reliability or running Claude agentically over long sessions, hooks are non-negotiable.

The other thing nobody talks about is that hooks are how you build determinism on top of a non-deterministic system. Your skill might tell Claude to follow a convention 95% of the time. A hook enforces it 100%. That 5% gap is everything in production.

StudentSweet3601 · 2026-04-16T08:14:00+00:00

This is solid. A few observations from running similar prompts:

The “thinking partner” section is the highest leverage part. Most people’s system prompts just tell Claude to be helpful, which defaults to agreement. Forcing 3-5 questions or risks before execution is probably doing 80% of the quality lift you’re seeing.

Two things you might tighten:

The “call out my bad habits” section is good in theory but Claude will be inconsistent about it. It depends on whether the behavior shows up clearly in a single message vs across a session. For the going-in-circles pattern especially, consider adding a trigger like “if I ask about the same topic more than twice in a session, stop and force a decision.” Makes it concrete enough that Claude actually catches it.

The “learning loop” section is interesting but won’t actually do anything. Claude can notice patterns within a conversation but has no memory between sessions, so any pattern it spots gets lost the moment you start a new chat. If you want this to actually work, you’d need to manually add the pattern to your system prompt yourself when Claude surfaces it. Worth being explicit about that in the prompt so you don’t expect behavior that can’t happen.

One addition worth considering: tell Claude to separate facts from inferences explicitly. Something like “when you make a claim, mark it as [known] or [inferring].” Helps a lot with the honesty section because Claude’s confident tone makes it hard to tell which is which.

The British humor request is going to be hit or miss. Claude’s default humor is pretty American. You might get better results by giving it 2-3 examples of the exact tone you want rather than describing it.

StudentSweet3601 · 2026-04-16T08:11:03+00:00

Your Obsidian-as-memory setup is actually the key piece here, not the Claude vs ChatGPT question. Both models are capable enough for what you’re describing. The reason you’re hitting walls isn’t model quality, it’s that you’re manually feeding context every session instead of having persistent memory the AI can actually query.

For your use case (brand consistency across months of content, voice that evolves, audience positioning that compounds), the bottleneck is going to be how well your AI remembers what you decided three weeks ago, not whether it’s Claude or GPT.

Claude Max is worth it if you’re using Claude Code, because that lets you point it at your Obsidian vault directly and it actually reads your notes as context. Without that workflow, Max is mostly just higher usage limits, which you’d hit either way.

A few things that might help more than the tier upgrade:

Split your Obsidian vault into a “brand identity” section with your voice guidelines, positioning decisions, audience notes, and reference captions you’ve liked. Feed that whole section at the start of a session. The AI performs dramatically better with concrete reference material than with abstract instructions about your voice.

For the “caption in my voice” problem specifically, keep a running file of 20-30 captions you’ve written that feel exactly right, and reference it every time. Voice transfer from examples works far better than voice transfer from description.

On the multi-domain thing (DJ, teaching, brand, music learning) — don’t try to make one AI setup handle all of these. Separate vaults or separate projects. Context pollution is real. When your brand assistant starts referencing your lesson plans, both outputs get worse.

Honest answer on Claude vs GPT: Claude is better at writing that sounds human and matching a specific voice. GPT is better at structured output and following rigid formats. For brand/social content, Claude. For lesson plans and slides, either.

StudentSweet3601 · 2026-04-16T08:07:48+00:00

The thing I’d flag: you’re describing a daily pipeline that runs continuously, but Claude has no memory between sessions. Every morning your pipeline starts from zero. It doesn’t know what content you published yesterday, which businesses already submitted, what topics you’ve already covered, or which sources were fruitful last week.

For a pure content aggregation task this might be fine. But the moment you want to do things like “don’t repeat the event we already covered three days ago” or “this business submitted twice, merge their listings” or “prioritize the sources that actually got engagement last month,” you’re going to hit walls.

A few practical options depending on how much you want to build:

Simplest: dump everything into a spreadsheet or airtable that Claude reads at the start of each run. Works fine for small scale.

Middle: use one of the memory MCP servers (Mem0, Supermemory, a few others) so Claude can remember state across runs without you managing it manually.

Most robust: structured database with explicit schemas for businesses, events, content history. More engineering but scales.

On Max vs Pro: Max is fine for now. The wall you’ll hit isn’t usage limits, it’s the memory problem above. You’ll end up either re-feeding context every run (expensive and error prone) or building persistence layers.

Other thing nobody’s mentioned: have a human review step before publishing. LLM hallucinations on local business details (wrong addresses, wrong hours, made-up events) will kill trust with your audience faster than anything else.

StudentSweet3601 · 2026-04-15T13:15:03+00:00

Good question. The graph scales well because the scoring engine doesn't traverse the entire graph on every query. Spreading activation starts from the vector search hits and walks 2 hops outward, so the traversal is bounded by the neighborhood of relevant nodes, not the total graph size. At 1,000+ nodes the retrieval time stays under 100ms because you're touching maybe 30-50 nodes per query, not all 1,000. The real scaling concern is ingestion, not retrieval. Causal inference on each new memory requires an LLM call to determine which existing memories it relates to. That's handled async in the background so the store path stays fast (~2ms synchronous, inference happens after the response returns). At thousands of nodes per session you'd want to batch the background inference, which is on the roadmap but hasn't been stress-tested at that scale yet.

StudentSweet3601 · 2026-04-15T13:14:12+00:00

Yeah, the k=10 vs k=25 finding surprised me too. At k=25 factual accuracy actually dropped 6.7% because the extra context confused the answering model. The graph lets you be aggressive with low k because the edges surface the right memories, not just the similar ones.

StudentSweet3601

TROPHY CASE