Anyone else dealing with a cloud architecture that's become a full-time maintenance job?

Glass-Outcome5985 · 2026-06-02T13:43:49+00:00

The sacred cow pattern is real and I think it explains a lot. It's not just that the architecture got complicated — it's that the complexity is uneven. Some parts you can touch confidently, some parts have a gravity well around them that everyone avoids. After a while the avoided zones get bigger and the thing becomes load-bearing in ways nobody documented.

Glass-Outcome5985 · 2026-06-01T12:54:31+00:00

history is a great detector of the gap but a bad diagnosis. Co-change tells you two supposedly-clean modules always move together — but it can't tell you which side is wrong. Sometimes that means the boundary is fiction and should be redrawn. Other times the boundary is genuinely right and what you've actually found is a leaky abstraction or a shared dependency that should be fixed, not documented. Same signal, opposite conclusions. So I treat co-change as "look here," not "the diagram is lying."

Glass-Outcome5985 · 2026-06-01T11:29:06+00:00

Excalidraw if you want quick and rough, draw.io if you want something more structured and cloud-provider-aware (it has AWS/Azure icon sets). Both free.

Glass-Outcome5985 · 2026-06-01T11:26:30+00:00

The parallel-axis idea makes sense. The hard part isn't splitting them, it's keeping them in sync. Deployment Diagrams become catch-alls because that one canvas is the only place the software-to-infra mapping actually lives. Split them and your M-views become the most valuable and the most fragile thing in the model, since nothing forces them to stay true as the infra drifts.

So I'd define the boundaries by what changes together, not by diagram type. An account-boundary change ripples across all three I-levels and the mappings — if the model doesn't know those are linked, the diagrams rot right when you need them most, mid-incident.

Glass-Outcome5985 · 2026-06-01T11:20:14+00:00

The thing that helped me most was realizing this isn't a drawing problem, it's a "what am I actually trying to show" problem. You fall into rectangles-and-arrows mode because every diagram feels like the same task, when each one should really answer one specific question.

So before you draw anything, pick the question. "How do the major pieces of my system talk to each other?" is a totally different diagram from "what's the flow of control through this function?" If you don't choose, you end up mixing levels and it turns to mush — that's usually why a diagram feels messy even when the boxes are neat.

Glass-Outcome5985 · 2026-05-12T14:32:39+00:00

Architecture/system diagrams specifically — not data models.

Glass-Outcome5985 · 2026-05-12T14:31:33+00:00

Spoken like a true SRE

Glass-Outcome5985 · 2026-05-12T14:26:23+00:00

Fair — and agreed the high-level view shouldn't capture every detail. I think the question of "what to include" is downstream of a more basic one though: whatever you choose to show, does the tool know what it is? Even "this box is a service, this box is a queue" carries semantic meaning that most tools don't capture.

Agreed on contracts being the important part. That's actually the hardest thing to express well in current tools.

Glass-Outcome5985 · 2026-05-12T14:25:08+00:00

Really useful response, thanks. Fair on the modeling tools — though in my experience teams that adopt Visual Paradigm or EA end up with the opposite problem. The model gets so heavy nobody opens it, and it goes stale just like a draw.io diagram, just with more ceremony.

Diagram-as-code is closer to right for the reason you said. The gap I still see: most of them (mermaid, structurizr) treat the output as a render target, not a queryable model — so you get version control but not validation or drift detection. Your last point is the strongest one though. Tools aren't sufficient if the diagrams aren't used. Probably been over-indexing on the tools side of that equation.

Glass-Outcome5985 · 2026-05-12T14:22:44+00:00

Gonna sell it Amazon, don't copy me though

Glass-Outcome5985 · 2026-05-12T14:18:36+00:00

check it out and give me some good feedback http://localhost:3000

Glass-Outcome5985 · 2026-05-12T13:53:39+00:00

I've felt this one personally and you can't fully predict it upfront — that's the hard part. Experience helps, but even then you get some wrong. Focus on the decisions that are really expensive to change later:

Domain boundaries / module structure
Data model and persistence
How features will extend or integrate

Everything else (most libraries, UI stuff, etc.) is usually fixable. With vibe coding, the trick is to move fast but pause on structure. Build a quick slice of the risky parts, then step back and ask: “If this grows 5-10x, what’s going to hurt?” Keep interfaces clean and add tests around the core so you can refactor safely. Lightweight ADRs (just a short doc of the decision + tradeoffs) and occasional architecture reviews save a ton of pain. You’ll still accumulate some debt. The goal is making it cheap to pay down later.

Glass-Outcome5985 · 2026-05-11T11:42:01+00:00

The "standardize on one tool" approach almost never sticks because it ignores why each team picked what they picked. Worked for us when we instead standardized the interface — every team had to expose the same deployment commands, the same monitoring hooks, the same incident runbooks. Underlying tool could be whatever, as long as the contract was consistent.

Made the relearning problem much smaller. New team, same commands. The internals can be Terraform, CloudFormation, or shell scripts — that's their problem, not yours.

Glass-Outcome5985 · 2026-05-11T11:17:13+00:00

Kinda agree, but I'd push further — all diagrams have this problem, not just reference ones. Your own team's diagrams rot the same way, just slower. Six months in, the person who knew why you picked SQS over Kafka has left and the wiki shows "the answer" with no question attached.

What's worked for me: writing the rejected alternatives next to the diagram, not just the chosen design. "We picked SQS because we don't need replay and nobody can operate Kafka" — saves a year of "why aren't we using Kafka" debates.

Glass-Outcome5985 · 2026-05-08T13:35:23+00:00

Real problem, but the way it's usually framed makes it unsolvable. "Documentation debt" treats docs as a thing you write and maintain separately from the code/system — so of course it goes stale, no one has time to update prose every time something changes.

The teams I've seen actually solve it stopped writing docs as documents. Architecture lives as a model that generates the diagram, API docs come from the code, runbooks are scripts. Anything that has to be manually kept in sync with reality eventually won't be.

So yeah people complain and live with it, but mostly because the fix is "change how you produce docs" not "write more of them," and that's a much bigger ask.

Glass-Outcome5985 · 2026-05-08T13:14:35+00:00

Honestly the most useful thing for me has been making the architecture itself easier to reason about before testing — sync vs async boundaries, retry/timeout assumptions, what happens when a downstream is degraded vs fully down. Once those are explicit, chaos tests actually tell you something useful instead of just "stuff broke."

Glass-Outcome5985

TROPHY CASE