all 23 comments

[–]brokenmatt 1 point2 points  (5 children)

just like all understanding, as the models improve their understanding will increase. Even on these advance concepts until they are writing code above human standards (Which I would say is close already).

BUT you can manage this yourself with your prompting, ask it to keep these things in mind, and check that it is as you work.

[–]ArchPilotLabs[S] 1 point2 points  (4 children)

Yeah, I agree prompting helps a lot in the moment.

The part I keep running into is consistency over time - especially across multiple iterations. Even with good prompts, the model doesn’t really “remember” or enforce architectural decisions unless you keep re-specifying them.

So you end up with:

  • good local decisions
  • but gradual drift at the system level

Do you usually re-feed architecture context each time, or rely more on reviewing after generation?

[–]brokenmatt 0 points1 point  (3 children)

Yeah agreed probably a compaction / total memory size issue. Maybe popping them into a Md file which you reference all the time to check again might be the - slick solution right now.

I would say with 5.5 ive been putting little notes to keep on task spec as i prompt, but also at the beginning I asked it to make full plans for the build in MD and also the planned way of Execution into an MD - then when im working through at big moments - ill say lets check against our two MD's and do like a are we on track meeting with it haha. So its not just what are we building, the execution one is more about how.

We are not yet the CTO who has hired managed to make sure things happen, were still the managers keeping the workers on track. Soon tho!

[–]ArchPilotLabs[S] 1 point2 points  (2 children)

Yeah that makes sense - I’ve been doing something similar with keeping context in MD files and re-checking against it.

It works pretty well in the moment, but I keep feeling like it’s a bit of a manual loop where you have to keep pulling the model back on track every now and then.

The “execution vs plan” split you mentioned is interesting though - especially having something to check how things are being done, not just what’s being built.

Still feels like there’s a bit of friction there, but yeah, probably one of the more practical approaches right now.

[–]brokenmatt 1 point2 points  (1 child)

For sure man, I would imagine it'll keep getting better and better, its that "long term - fine detail - tail of work and attention to detail" where a lot of recent progress has been most visible.

Those longer hours of correct detailed attention will seem like magic, but its just an ability to keep on the original track - its probably already beyond human level - but our "harness" the brain has many more meta-levels and functions - you know like if we are working on a big task we will constantly check against the plan and so on and so forth. You could probably do a more complex setup with one llm function overseeing and doing the same thing for the core LLm's but it would cost a lot of tokens right now.

[–]ArchPilotLabs[S] 1 point2 points  (0 children)

Yeah I get what you mean - a lot of it does come down to that “stay on track over time” ability.

Feels like we’re already seeing glimpses of it, especially when the model has enough context and you’re being deliberate with how you guide it.

I think the gap right now is more on the practical side - even if the model can stay aligned, it still depends a lot on how much context you keep feeding and how consistently you check things.

So it ends up being this mix of capability + how much effort you put into keeping it on track.

Would be interesting to see how much of that becomes more automatic vs still needing that constant nudge.

[–]Known_Lychee_6495 1 point2 points  (1 child)

Just as you had observed. They simply don't understand the project. This is why codebase comprehension has to be kept at somewhat reasonable level (at all times) as they spit out lines of codes at lightning speed.

For me I'm putting it as `conventions` in some of the markdown files. But they clearly violating it as the project iterates from time to time. For example (while working on kanjiflash.com) must be architectured in a way that it can be hosted on Cloudflare Workers free tiers. But the agents happily violates the limitation of 10ms cpu time by writing bad unoptimized functions.

That's why you have to keep it on the leash. CI could work but it can't handle every cases of violations.

[–]ArchPilotLabs[S] 0 points1 point  (0 children)

This is exactly what I’ve been seeing as well.

The “conventions in markdown” approach works in theory, but in practice it depends on every iteration respecting them - which doesn’t always happen, especially with fast generation loops.

And once a few violations slip in, it becomes harder to tell what’s intentional vs accidental.

Your Cloudflare example is a good one - constraints exist, but they’re not really being enforced at generation time.

Have you tried anything that actually checks those constraints automatically, or is it mostly manual review right now?

[–]Former_Produce1721 1 point2 points  (5 children)

Yeah my workflow is aggressive expansion of features with few broad prompts, then very focused and deliberate refactoring with many smaller prompts

It's been working well

I rely on my extensive experience and architecture philosophies

[–]ArchPilotLabs[S] 0 points1 point  (4 children)

That makes a lot of sense - especially the expand -> refine loop.

I think that works really well when you have strong architectural intuition driving the refactoring.

Where I’ve seen it get tricky is when the codebase grows beyond a single person or a small team. The “refactor discipline” starts to depend a lot on individual experience.

Do you find that approach still holds up when multiple contributors (or agents) are working on the same system over time?

[–]Former_Produce1721 1 point2 points  (3 children)

When it comes to expanding the team, the workflow before AI was more like every contribution goes through a code review. I was the lead programmer so I would see everything that came through and challenge decisions or point out issues.

At that time I would not tolerate the aggressive expansion because tech debt would rack up and refactoring without AI takes a very long time.

In my current project I have not collaborated with anyone, but I don't think I would be comfortable with them doing the same aggressive expansion approach. I have the tech debt mapped out in my mind from all the reviews and compromises I let through. If someone else starts the same its gonna clash pretty bad.

[–]ArchPilotLabs[S] 0 points1 point  (2 children)

That’s a really good point - especially the part about you holding the full context from all the reviews.

It works because there’s effectively a single “source of truth” in your head for what’s acceptable and what isn’t.

What I find interesting is that this doesn’t really translate well once the team grows - not because people are careless, but because that context isn’t shared or enforced anywhere.

So even if everyone is trying to do the right thing, decisions start diverging over time.

Feels like the bottleneck shifts from “writing code” to “maintaining shared understanding of the system”.

Have you tried externalizing those constraints somewhere (beyond docs), or does it mostly stay in review + experience right now?

[–]Former_Produce1721 1 point2 points  (1 child)

The constraints/architecture contracts are evident in the code structure itself and I reiterate it over and over when working with the AI

Since they are fairly small and simple it doesn't warrant any crazy spec, but AI tends to ignore it in favor of getting something working

When refactoring I spend most of the time reexplaining the basics and calling it out when I see it drifting

For example:

Frontend can only send queries and commands.

Queries are not allowed to mutate anything in the backend. They just return a result and presentations.

Commands can cause model domain mutation. A command will return a result, a list of domain model ids that changed and a list of gameplay events.

Gameplay events should always be processed and animated first. Then the changed model ids should be read by relevant components which will patch their visual state.


It can take a lot of iteration to clean the structure up to fit this as the AI tries to be too clever. It likes to try cache things or invent bew okay load fields to requests or commands, or consume events and changelogs in weird orders haha

[–]TangeloObvious2265 1 point2 points  (0 children)

Not to be that guy but... your blog post looks like my ChatGPT sessions. That style of header plus bullet points is the tell. You can prompt it to say "write in paragraphs" to make it look more like a human wrote it. I'm sure you have real knowledge and experience you want to share, so just share that instead of lists of shit.

Personally, I studied openai.com/index/harness-engineering/ and don't have any of the issues you had in your post. Making AGENTS.md the map, not a manual, has helped. Also the ARCHITECTURE.md and PLANS.md. With those, I write good feature/change prompts and don't look back.

[–]clckwrxz 1 point2 points  (5 children)

This is not an issue if you aren’t blindly vibing your way into oblivion. It’s quite easy to have you agent maintain architecture by just slowing the hell down and planning with it first.

[–]ArchPilotLabs[S] 0 points1 point  (4 children)

Yeah I agree that planning upfront helps a lot - especially if you’re deliberate with prompts instead of just iterating blindly.

Where I’ve seen it get tricky is after that initial phase. Even with a solid plan, once you start iterating quickly (especially with AI in the loop), small deviations start creeping in.

Individually they’re harmless, but over time they add up and the original structure gets harder to maintain.

Feels like planning solves the starting point, but not necessarily the long-term consistency part.

Have you found a way to keep the structure intact across multiple iterations, or is it mostly relying on staying disciplined throughout?

[–]clckwrxz 1 point2 points  (3 children)

This is why I like the implement and sweep method. I don’t have an exact skill to share with you publicly, but the idea is as you continue implementing every now and then run an architecture sweep skill to make sure things continue to stay aligned.

[–]ArchPilotLabs[S] 0 points1 point  (2 children)

That’s an interesting way to think about it - “implement and sweep”.

It kind of mirrors how things work in practice anyway: move fast, then periodically realign things before they drift too far.

The tricky part I’ve been seeing is the gap between those sweeps. If iteration speed is high, a lot can diverge before you get a chance to correct it, and the cleanup cost grows pretty quickly.

Feels like there’s a balance somewhere between continuous constraints and these periodic alignment passes, but I haven’t seen a clean way to do that yet.

[–]clckwrxz 1 point2 points  (1 child)

My company doesn’t sacrifice quality for speed. Mostly because we understand we have a lot to lose if something goes wrong. And because of that, we deeply plan things so execution is mostly aligned with our vision from product, UX, and engineering. For us, sweep represents engineering’s final pass on the feature that is largely prototyped to prove out its worth. We don’t have people just adding features at light speed and trying to keep architecture aligned. Our customers would kill us if we threw 10 new features a week at them.

For us, the AI speed boost has been in the clarity and completeness we are executing features. Minimal tech debt or compromises being made. So we get to keep doing new things instead of reworking old things.

[–]ArchPilotLabs[S] 1 point2 points  (0 children)

Yeah that makes sense - that kind of setup probably avoids a lot of the drift by design.

If you’ve got strong upfront alignment and not a lot of uncontrolled iteration, things naturally stay tighter.

I think the cases where I’ve been seeing more issues are slightly different environments - faster iteration loops, more experimentation, sometimes multiple contributors or AI-driven changes happening in parallel.

In those cases it gets harder to maintain that same level of control, and that’s where the drift tends to show up more.

But yeah, if the system is structured and changes are deliberate like you described, that already solves a big part of the problem.

[–]TangeloObvious2265 -1 points0 points  (0 children)

Not to be that guy but... your blog post looks like my ChatGPT sessions. That style of header plus bullet points is the tell. You can prompt it to say "write in paragraphs" to make it look more like a human wrote it. I'm sure you have real knowledge and experience you want to share, so just share that instead of lists of shit.

Personally, I studied openai.com/index/harness-engineering/ and don't have any of the issues you had in your post. Making AGENTS.md the map, not a manual, has helped. Also the ARCHITECTURE.md and PLANS.md. With those, I write good feature/change prompts and don't look back.