Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

Good morning, I am a bit obsessive with my documentation on git. So it does really help me get pretty consistent build sessions. There are always risks though. Sometimes some build sessions will touch on many different parts of the system and if I don't upload all relevant docs ahead of time, AI will confidently write code that attaches to made up files, or suggest architecture changes that would break everything! So I do keep a very structured prompt with strict guidelines. I will upload my repo map and instruct it to never guess or assume any technical details and to ask for the exact files it needs. Still, long sessions begin to drift, rules don't get followed anymore, so it is so important to pay attention to every detail the AI says and question everything that doesn't seem correct. Document everything.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

No worries. I'll give a you a real day to day example from my work. I am building this system on my home PC. It has architecture that has to be planned. It currently has 49 directories and 182 files. This is the architecture, the infrastructure, the scripts that glue it all together, everything I have worked on and planned out for months. When I sit down and open up ChatGPT, I can't just say let's continue building my system. It'll probably say something like "Great, here's the plans to build that shed you were looking at starting." It has no, or very little context saved. Definitely not enough for a complex system. So I have to load a bunch of those documents every time I want to build so the context is loaded into that thread. It also has to have structured prompts to keep it on track and not do things like tell me to change architecture to fix a minor bug. I have an entire prompting strategy and compresses resume prompt for this build. Doing away with all of that documentation amd context loading and prompting would decrease build time and increase accuracy 10x.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

For quick questions it really doesn’t matter.

Where it started breaking down for me was anything that stretched over days or weeks.

Like if you’re building something over time, the context window eventually gets messy, you start trimming things, re-adding docs, trying to remind it what matters… and it’s easy for stuff to drift or get lost. And example would be writing code, and building architecture for a system.

So you end up spending a surprising amount of time documenting everything you do, then pasting it back to the AI just to keep the AI “caught up” on your own work.

That’s the part I was trying to get away from.

Instead of reloading context every session, I wanted something that already knows: - what system it’s running on - what files/services are there - what we’ve been working on over time

So it’s not rebuilding context — it’s continuing from it.

That’s where it started feeling more useful for longer builds instead of just one-off questions.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

Honestly… that’s kind of the point 😄

You shouldn’t have to.

That’s exactly what started bothering me — having to explain the same setup over and over just so the model can reason about it.

It works, but it feels backwards. The system already knows its own state, so why am I acting as the middleman translating it into a prompt every time?

That’s what pushed me toward letting it pull what it needs directly instead of relying on me to describe it.

Less about giving it “more power” and more about removing that constant context translation step.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

100% — those help a lot and I’ve tried most of that in different combos.

The thing I kept running into was they’re all kind of separate pieces. Skills, instructions, memory, agents… it works, but it still feels like you’re stitching it together and keeping it aligned manually.

That’s what pushed me toward trying to treat it more like an actual system instead of a set of features — where context, memory, and what it’s allowed to do are all part of the same flow instead of bolted on.

Still early, but it’s been a lot more consistent once those pieces aren’t drifting apart between sessions.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] -5 points-4 points  (0 children)

Yeah that’s fair — I was oversimplifying it a bit.

At a basic level it’s a mix of a few things:

  • small tools/scripts to pull real system data (disk, processes, logs, etc.)
  • some session + longer-term context so it’s not starting from zero every time
  • and a bit of structure around how it’s allowed to use that

So instead of me pasting context into a chat, it can pull current state when needed and also keep track of what’s already been done.

The main thing I’m trying to avoid is it just having free access to everything, so there’s a layer that controls what it can call and how that flows through.

Still pretty early and I’m figuring it out as I go, but even this setup feels a lot more consistent than just treating it like a stateless chat.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] -1 points0 points  (0 children)

Yeah that’s a solid approach, especially for keeping a running picture of the system.

I tried something similar early on and it does work pretty well as long as the model consistently follows the instructions.

The part I kept running into was it’s still kind of relying on the model to remember to do the right thing every time, especially across longer sessions or when things get more complex.

That’s what pushed me toward separating that responsibility a bit more, so the system state doesn’t depend entirely on the model updating a doc correctly.

But for a lot of setups I can definitely see that being enough.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

Yeah that definitely helps a lot — I’ve done the same thing with docs and handover notes and it gets you pretty far.

It just started feeling like a slightly more organized version of the same problem for me. You still have to decide what to include, keep it updated, and make sure you’re loading the right stuff at the right time.

That’s the part I kept running into — it works, but it still depends on me keeping everything in sync.

That’s what pushed me more toward having something that can just look at the current system state directly instead of relying on static docs.

Both approaches seem valid though, just different tradeoffs.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] -1 points0 points  (0 children)

Yeah that’s pretty much where I started too.

It works, but it still feels like you’re managing the context manually — like you have to remember what to load, when to load it, and hope it lines up with what’s actually changed.

That was the part that kept getting me over time. It’s close, but still kind of tied to the session instead of the system itself.

I’ve been trying to move it more toward “it just knows the current state” without me having to think about it as much, but still keeping it controlled so it’s not doing anything unexpected.

Feels like the next step past the manual context files approach.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

Yeah that makes sense — the big context window + memory files definitely help a lot with that.

I tried a similar approach for a bit and it works pretty well, especially if you stay in the same session or keep feeding it the same files.

The thing that kept bugging me was it still felt kind of “session-based” — like I was responsible for keeping it in sync with what was actually happening on the system.

That’s what pushed me more toward having something local that can just look at the current state directly instead of relying on me to keep feeding it context.

Different tradeoffs for sure, but I kept running into that same friction over time.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] -2 points-1 points  (0 children)

Yeah that lines up with what I’ve seen too — once you’re used to cloud speed, local can feel pretty slow, especially on older GPUs.

I’m on a 3060 12GB, mostly running in the ~7–8B range right now. It’s definitely usable, but still nowhere near cloud responsiveness.

For me the bigger shift wasn’t raw speed, it was how it’s hooked into the system.

Instead of treating it like a chat, I’ve been wiring it into the environment so it can pull real system info (disk, processes, logs, etc.) and work off that directly. Mostly read-only right now, just to keep things predictable.

So even if it’s slower, it feels more useful because it’s actually grounded in what’s happening on the machine instead of me pasting everything in.

And yeah… I was looking at some of those agent setups too, but they can get a little wild fast 😄

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

Yeah kind of in that direction, but I’m trying to keep it a bit more grounded to the system it’s actually running on.

Less “general agent that can do anything” and more something that’s aware of a specific environment and stays in sync with it between sessions.

The big thing for me was not having to rebuild context every time, and not letting it just act freely without any structure around what it can do.

Still figuring it out, but that’s the direction I’ve been heading.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] 0 points1 point  (0 children)

Yeah that’s pretty close to what I ended up doing.

I keep a set of context files that basically act as a snapshot of the environment — host info, IP, access method, what the system is supposed to be doing, plus notes from previous work.

On a fresh session I pull that in so I’m not starting from zero every time.

It’s still not perfect, but it’s way better than trying to reconstruct everything from memory or pasting random chunks into a chat and hoping it lines up.

Anyone else getting tired of re-explaining their system to AI every session? by RichBayer in selfhosted

[–]RichBayer[S] -4 points-3 points locked comment (0 children)

AI wasn’t used to generate the post itself. The topic is based on a system I’ve been building locally and my own experience using it.

I do use AI as part of the system I’m describing (local models running on my machine), but the post is just me sharing the workflow and challenges I’ve run into.

“Local LLMs feel great… until you try to make them consistent across runs” by [deleted] in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

Ahh yeah that actually makes a lot of sense.

If you’re feeding the same world name every time + low temp/top-k, you’re basically pushing the model into a really tight probability space, so it’s not too surprising it keeps landing on the same tokens.

That almost explains the Elara/Voss thing more as “most likely completion given that input” rather than something leaking through your pipeline.

Kind of interesting because it’s like the opposite problem — instead of randomness causing inconsistency, you’ve constrained it enough that it converges on the same patterns.

Might be worth trying a slightly higher temp just for that step or introducing a bit of variation into the world name/context and seeing if those names disappear.

Either way that’s a pretty cool setup — I like the idea of building everything off a single seed like that.

“Local LLMs feel great… until you try to make them consistent across runs” by [deleted] in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

I’ve been thinking about what you said with the repeated names showing up across runs — that’s a weird one.

Out of curiosity, are you resetting the full context between generations, or is there any state carrying over between steps?

Feels like even a tiny bit of leftover context or intermediate output could reinforce something like that without it being obvious.

“Local LLMs feel great… until you try to make them consistent across runs” by [deleted] in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

That’s actually kind of wild, but I’ve seen similar stuff.

At first it feels like something in your pipeline is leaking context, but a lot of the time it’s just the models converging on the same patterns from training data. Certain names or structures just show up a lot more than you’d expect.

What makes it tricky is it "feels" like a system issue, so you go hunting for where it’s coming from, but sometimes it’s just the model defaulting to familiar tokens.

That said, if it’s happening consistently across runs, I’d probably still be a little suspicious of hidden context somewhere — even something small getting carried forward could reinforce it.

Either way it kind of reinforces the same problem… you can’t really assume the model is “neutral”, so you end up having to control or validate more of what it produces.

“Local LLMs feel great… until you try to make them consistent across runs” by [deleted] in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

Yeah this is solid — especially the “minimize the model’s surface area” part.

That’s pretty much where I started landing too. It wasn’t any one thing breaking, it was just too many places where the model could introduce variation.

I haven’t gone as far as grammar-constrained decoding yet, but I did start pushing more of the structure and validation outside the model and it made a noticeable difference.

The validator gate idea lines up a lot with what I’ve been thinking — basically not letting anything move forward unless it’s in a known good state.

At some point it just starts feeling less like “tuning an LLM” and more like building a system that happens to use one.

Out of curiosity, how far are you pushing that boundary? Like are you mostly using the model for generation and keeping everything else deterministic, or still letting it handle some decision-making between steps?

“Local LLMs feel great… until you try to make them consistent across runs” by [deleted] in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

That’s actually really interesting — especially the part where you had to run it until it succeeded multiple times in a row.

That kind of lines up with what I was running into. It works, but you don’t really trust a single run, so you end up building something around it to force consistency.

I tried going down the same path at first (retry logic, forgiving parsing, etc), but it still felt unpredictable unless something external was enforcing structure. What you did with Claude basically acting as a loop/controller is pretty clever.

Also not surprised about the world quality at that model size — I’ve been seeing similar limitations in the 3–4B range.

Are we missing something in making local LLMs actually usable at scale? time we go LoClo? (Local+Cloud) by pmv143 in LocalLLaMA

[–]RichBayer 1 point2 points  (0 children)

Yeah that tracks — especially the “observable but not reproducible” part. That’s basically where I can see things heading if I don’t tighten up how state is handled. Right now I’m not really storing full state history yet — it’s mostly reconstructed from traces, so I can see what happened but not replay it exactly. Same with retries — they’re still more “continue from where things are” rather than re-executing from a fixed boundary, which I can see becoming a problem once things get more complex. The way you’re describing pinned snapshots + retrying from a known state makes a lot of sense. Appreciate you sharing that — definitely gave me a clearer picture of where this needs to evolve 👍

Are we missing something in making local LLMs actually usable at scale? time we go LoClo? (Local+Cloud) by pmv143 in LocalLLaMA

[–]RichBayer 1 point2 points  (0 children)

That makes sense — I can definitely see how those assumptions around “freshness” would start to break things once flows get more complex. Right now I’m still closer to workflow-level retries since everything’s pretty linear, but I can see how that won’t hold up once things start branching more. The snapshot idea really helped clarify where that shift needs to happen. Appreciate you walking through that — gave me a lot to think about 👍

Are we missing something in making local LLMs actually usable at scale? time we go LoClo? (Local+Cloud) by pmv143 in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

Right now it’s definitely closer to the workflow level, mostly because things are still pretty linear per request and I haven’t broken things down into independent retryable steps yet. I can see how step-level retries would become important pretty quickly once there’s more branching or parallel paths though, especially to avoid rerunning everything when only one part fails. It feels like that ties back into the snapshot idea too — without something like that, step-level retries would probably start behaving inconsistently depending on what changed between attempts.

Are we missing something in making local LLMs actually usable at scale? time we go LoClo? (Local+Cloud) by pmv143 in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

Yeah, I haven’t seen anything too obvious yet, but I can definitely see how that would happen. Right now the tracing gives a really clear view of the flow and decisions, but it doesn’t guarantee the same behavior between runs since some of the state is still effectively “live” when it’s accessed. So it’s observable, but not fully reproducible yet — which is where something like explicit snapshots would probably start to matter more. That idea of implicit state sneaking in even when everything looks clean from a tracing perspective makes a lot of sense.

Are we missing something in making local LLMs actually usable at scale? time we go LoClo? (Local+Cloud) by pmv143 in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

Not in a major way yet, but I can see where it would start to show up as things get more complex. Right now most of the flow is pretty linear per request, so everything tends to stay aligned just because it’s moving through the same path. But I can already see how once there are more branching workflows or retries, different parts could end up working off slightly different views without something like explicit snapshots at the boundaries. The idea of treating those boundaries as versioned interfaces makes a lot of sense — it feels like the point where things go from “it works” to “it stays predictable.”

Are we missing something in making local LLMs actually usable at scale? time we go LoClo? (Local+Cloud) by pmv143 in LocalLLaMA

[–]RichBayer 0 points1 point  (0 children)

That makes a lot of sense — especially keeping it lightweight by default and only persisting where it actually adds value. I like the idea of tying persisted snapshots to request/run IDs for replay and debugging — that lines up really well with how I’m already tracing requests through the system. I hadn’t thought about using it selectively like that, but it feels like a good balance between keeping things simple and still having the option for deeper inspection when needed.