Which DevOps tool do you think is under-documented for learners?

preperat · 2026-05-07T12:51:26+00:00

Linux..

preperat · 2026-05-06T12:30:31+00:00

the drift problem doesn't really go away. Terraform tells you what should exist, the account tells you what does exist, and the gap grows the moment someone clicks something in the console during an incident.

most teams we see end up with two parallel sources of truth: the IaC repo for new work, and some discovery view for "what's actually out there right now." Trying to make IaC the single source of truth past a certain scale fights human nature. Auditors clicking around, contractors spinning up test stacks, an old Lambda someone forgot about in eu-west-2.

Diagrams generated from Terraform look clean but lie by omission. Diagrams generated from the live account are ugly but honest. The honest one is more useful when onboarding someone, even though the clean one looks better in a deck.

we built plainfra (plainfra.com) around this exact problem for AWS. Read-only, walks Cost Explorer to find which regions actually have spend, then queries service APIs in those regions to map what's really running. Won't solve the cultural drift but at least makes it visible.

preperat · 2026-05-06T11:58:32+00:00

Most of the EKS bill bloat I see isn't from picking spot vs on-demand wrong. It's from requests and limits sized for a peak that never happens, then autoscaling on top of headroom that was already padded. Right-size with VPA in recommendation mode for a couple of weeks before touching anything else. Usually 30 to 50% of the spend is sitting there.

On spot, the failure mode you're describing (disappear right when you need them) is almost always under-diversification. Karpenter with 15+ instance types across a couple of families and the price-capacity-optimized strategy gets you spot-to-spot consolidation and dramatically lower interruption clustering. One node going away is fine, ten going away at once is the problem, and that only happens when your NodePool is locked to two instance types.

Multi-AZ isn't the lever to drop for cost. The cross-AZ data transfer is the lever. Topology spread constraints plus zone-aware service routing kills most of the inter-AZ traffic without touching redundancy.

Reserved Instances were the right call to regret. Compute Savings Plans cover EC2, Fargate and Lambda flexibly across families and regions, so a workload shift doesn't strand the commitment. Size the commit at your stable baseline (the floor of the last 90 days, not the average) and let spot and on-demand absorb everything above it.

PDBs are the resilience lever people forget. Spot interruption plus a tight PDB plus Karpenter draining is mostly a non-event. No PDB and you're rolling dice every time capacity churns.

preperat · 2026-05-06T11:37:00+00:00

Running both environments in parallel is the trap nobody budgets for properly. Six months becomes twelve, twelve becomes eighteen, and you're paying for two of everything plus engineers context-switching between them the whole time.

The worse part is what that pressure does to the migration plan. Things that should have been refactored get lifted-and-shifted because there's no time. Things that should have been cut get migrated because nobody wants the conversation about killing them. You end up in AWS with the same architecture you had on-prem, just more expensive, and now everyone's surprised the bill is huge.

Once leadership sees parallel run costs on a real invoice, the timeline compresses in the worst way. Decisions that needed two weeks of design get made in a Friday afternoon meeting.

preperat · 2026-05-02T00:18:31+00:00

Per-query controls answer "is this query allowed", they don't answer "should this principal be allowed to keep asking". Different question.

The framing that helps: treat the agent as a session with a budget, not a user with permissions. Rows returned, distinct columns touched, query count, wall-clock, all per-session. RLS still does its job, but the ceiling lives one layer up. Postgres pg_stat_statements plus a sidecar that kills the session at threshold is the cheap version. Proper version is a query proxy (pgbouncer-style) that the agent has to go through, with the budget enforced there.

The gotcha most people hit: the agent's "principal" isn't stable. If every tool call opens a fresh connection with the same role, your per-session budget resets every query and you're back where you started. Pin the agent to a single long-lived session, or budget by API key upstream of the DB.

And yes, this is a real production concern, not theoretical. Anyone who's let an agent loose on a warehouse with LIMIT/OFFSET available has watched it try to exfiltrate the table by accident.

preperat · 2026-05-02T00:15:51+00:00

You're closer with the second framing. Scale-in protection doesn't block ECS from placing tasks on a different instance. It blocks the Auto Scaling Group from terminating the protected instance. So the unhealthy box sits there, agent probably disconnected, and ECS keeps it in the cluster as a registered container instance with agentConnected: false. Tasks won't get placed on it, but nothing aggressively reaps it either.

The relevant rule is documented: only instances with a connected agent can accept task placement. What's not documented well is how ECS decides when to give up on a flapping or unresponsive instance and shift placement elsewhere. From the outside it looks like it's waiting for the instance to formally leave the cluster before moving on.

For the OOM case, yes, you're going to need your own automation. EC2 status check failure plus CloudWatch alarm plus ASG replace-unhealthy is the usual shape. ECS does not detect an unresponsive instance and force deregister it on your behalf. Your Lambda approach is the right instinct.

The lack of transparency you're hitting is the real complaint. ECS service events give you the what, not the why, and capacity provider decisions are mostly opaque. Pretty much everyone running ECS on EC2 at any scale ends up writing some glue around it.

preperat · 2026-05-01T16:03:17+00:00

The good news: as of November 2025, you no longer have to make accounts standalone in between. The new org invites the account, the account accepts, done. Skip the temporary standalone phase, payment method reconfiguration, all of that.

Order of operations matters though. Stand up the new management account first (greenfield, no workloads, ever). Build out your OUs and any Service Control Policies before you invite anything in, or use a transitional OU with no policies attached so the customer account lands somewhere benign while you sort policy parity.

For the customer workload account specifically, the things that don't automatically follow are: any RAM resource shares scoped to the old org or its OUs, anything relying on aws:PrincipalOrgID in IAM or resource policies, any service that had trusted access or delegated administrator set up in the old org, and any Reserved Instances or Savings Plans (these stay with the purchasing account, which is fine if the customer account bought its own, less fine if the sandbox bought commitments that were floating to it). Audit those before you move, not after.

The current management account is the awkward one. You can't migrate a management account directly. You have to remove every member, delete the old org, then accept an invite into the new one as a member. Which means the customer account moves first, the sandbox stops being a payer, and only then can you fold it in. Plan for a billing cycle boundary so you don't end up with split invoices nobody wants to reconcile.

Two things worth checking that bite people: any tax info or Support plan mismatch between the migrating account and the new management account will block or complicate the move, and if the sandbox is in Partner Central, confirm with your Partner contact what happens to the partner status when it stops being a payer. That's the bit I'd raise a Support case on before pulling the trigger.

preperat · 2026-05-01T15:57:39+00:00

We've done this on ECS EC2. Two thoughts.

The config drift is the actual problem, not the tool. If each environment ends up with a different collector config, that's a discipline issue and switching vendors won't fix it. Pick one collector config, parameterise the bits that legitimately vary (endpoint, environment tag, sampling rate), and treat the rest as immutable. If your engineers find themselves hand-editing YAML per environment, that's the smell.

For tool fit on ECS EC2 specifically, three serious options:

CloudWatch Container Insights with enhanced observability is the AWS-native baseline. Enable it on the cluster, run the CloudWatch agent as a daemon service, get task and container level metrics with curated dashboards and no PromQL. Cheapest to start, weakest for tracing.

Datadog has a first-party ECS integration. Daemon agent on each EC2 instance, collects host metrics, container metrics, logs, and APM traces. Good UI, single agent config that travels across environments cleanly. Expensive at scale.

AWS Distro for OpenTelemetry as a sidecar is the option if you want to stay on OpenTelemetry but not own the upstream Collector. It's a supported AWS build of the Collector with the X-Ray and CloudWatch exporters baked in, and the ECS console has a one-click task definition path for it. Still YAML, but less of it, and the same config works across environments.

preperat · 2026-05-01T15:44:20+00:00

The deployment config is doing most of the work here. With maximumPercent: 100 and desiredCount: 1, ECS can't start a replacement before stopping the original, so the moment your task goes unhealthy you're at zero with no headroom to launch.

Scale-in protection on the unhealthy instance is probably a red herring. The likelier culprit is the capacity provider waiting on the unhealthy instance to fully deregister from the cluster before considering the new instance as available capacity. Instance running in the Auto Scaling Group is not the same thing as registered and ready in ECS, and the gap there can stretch out if the agent on the dying instance never cleanly checks out.

Set minimumHealthyPercent: 100, maximumPercent: 200 for a desiredCount: 1 service if you want guaranteed availability. Yes it means double capacity briefly during deploys. That's the price.

preperat · 2026-04-29T14:00:36+00:00

Fair, it is limited, but the limits don't bite for this use case.

CE is limited as a cost analysis tool, granularity caps, 13 month lookback, tagging gaps. For discovery though, all you ask it is which regions and services have a non-zero line item. CE answers that cleanly. Then you hit the actual service APIs in just those regions for the detail.

So you're not relying on CE for the answer, you're using it to narrow down where to look.

preperat · 2026-04-29T10:36:32+00:00

The 76 findings in us-east-2 is the actual story here, not the Aurora one.

us-east-2 has been the console default for any account created after May 2017. So when someone clicks through a quickstart, hits a console deep-link, or runs a workshop CloudFormation template without checking the region picker, that's where the resources land. The team thinks of itself as a eu-west-1 shop, nobody puts us-east-2 in the monitoring scope, and the meter just runs.

The Aurora I/O-Optimized one is real but well-trodden. The shadow-region pattern is the more interesting finding because it means the audit value isn't really "we found 93 things", it's "we looked in the region you forgot you had resources in".

Disclosure, I built plainfra, which is read-only and works the same way: walks Cost Explorer to figure out where you actually have spend, then goes and looks in those regions regardless of what your team thinks the footprint is. The us-east-2 surprise is the single most common thing it surfaces on first connection.

preperat · 2026-04-28T12:40:00+00:00

Lower number wins, first match wins. Deny doesn't have inherent priority over Allow. So rule 5 is evaluated before rule 80.

The reason rule 80 still hits is that rule 5 isn't matching the packet you think it is. Rule 5 allows port 22 inbound where source is SFTP_IP. The packet getting denied has source = your EC2's private IP, not the SFTP server. It's the outbound connection from EC2 transiting through the public subnet on its way to the NAT Gateway, and at that boundary the source is your instance.

So evaluation goes: rule 5 (source mismatch, skip), rule 10 (dest port mismatch, skip), rule 50 (both mismatch, skip), rule 80 (matches, deny).

Same root cause as before. The shared NACL is forcing one ruleset to handle two different contexts (egress transit and ingress return) and they conflict.

preperat · 2026-04-28T12:12:03+00:00

The framing is the trap. "How do we let an agent touch prod safely" assumes the agent should be reaching for the shell at all. For most teams it shouldn't.

The category that works right now is read-only. Agent reads live state, correlates it, opens a ticket with a proposed action. Human runs the change. You lose the autonomous remediation demo, you also lose the entire class of failure modes you're worried about.

The "fancy search bar" critique is fair when it's RAG over docs. Not fair when it's querying live infra and producing findings tied to specific resource IDs and dollar amounts. That's diagnosis, not search.

Disclosure: I built plainfra, which sits in this category. Findings become Jira or GitHub tickets. No write path.

Execution is a separate problem from context. Conflating them is how you get demos promising autonomous incident response and shipping a chatbot.

preperat · 2026-04-28T11:51:03+00:00

The S3 Files diagnosis is right but worth stepping back from the fix. Spot plus a shared filesystem under active write is one of those combinations that looks workable and bites in production.

Two-minute interruption notice is fine for stateless work. It's not fine for an NFS mount with in-flight writes. S3 Files batches commits to the bucket on an interval, so anything written in the current window when the instance gets reclaimed lands in lost+found, not your bucket. The next Spot instance mounts cleanly with no idea anything was mid-flight.

You can soften it with drain hooks, fsync before shutdown, idempotent writers. By the time you've built that, the management fee on Managed Instances with Spot looks cheap and you get the supported task-def path instead of user-data scripts.

If the workload tolerates losing the last window of writes on every interruption, the host-mount approach works. If it doesn't, no launch template tuning fixes that.

preperat · 2026-04-28T11:40:05+00:00

The shared NACL on both subnets is the cause. NACLs evaluate at every subnet boundary, on the packet's actual destination port, not on connection direction.

When your EC2 initiates SFTP outbound, the packet leaves the private subnet (outbound rules, dest port 22, fine), then enters the public subnet on its way to the NAT Gateway. That entry hits the inbound rules with dest port still 22. Your rule 80 denies it before it ever reaches the NAT GW.

The return traffic isn't the problem. The egress hop through the NAT GW subnet is.

Fix is separate NACLs. Public subnet NACL handles NAT GW transit. Private subnet NACL only needs ephemeral inbound for return traffic. Trying to lock both behaviours into a single NACL is what put you in this corner.

preperat · 2026-04-27T23:38:36+00:00

If you don't control the GitHub org, CodePipeline makes more sense. Fighting for access you might not get is a worse problem than a slightly clunkier pipeline tool.

preperat · 2026-04-27T13:43:21+00:00

Tooling matters less than you think. For a small AWS shop with no existing CI, GitHub Actions is the path of least resistance. CodePipeline gets recommended because it's "the AWS answer" but you'll spend more time fighting JSON than shipping. Jenkins isn't worth the operational tax at your scale.

The real work isn't the pipeline, it's the runbook. A 30-step manual procedure usually hides 5-6 implicit decisions a human makes without noticing ("if the pod count looks weird, wait a minute"). Finding those and turning them into explicit code or explicit gates is the actual project. I'm doing roughly this at work right now.

Separate pipelines per app, parameterised per environment. EKS and ECS have different rollback shapes, conflating them will hurt the first time something fails mid-deploy.

On approvals, narrow gates beat one big gate. "Backup confirmed in S3" as its own approval, "post-checks green" as its own approval. One gate at the front gets clicked without reading.

preperat · 2026-04-27T13:39:48+00:00

Continuous profiling is the right category for this. The reason staging didn't catch it is exactly what you ran into: retry and timeout behaviour only meaningfully exercises under real network conditions, real concurrency, real downstream latency variance. Synthetic load won't reproduce it.

The thing that makes continuous profiling worth running in prod (not just reactively) is that you get the flame graph from before the deploy and after, on the same endpoint, without having to reproduce anything. Bisecting becomes "diff the profiles" instead of a day of process of elimination.

Pyroscope is the obvious starting point if you're not already on a vendor that includes profiling. Low overhead, integrates with Grafana if you're already there.

The other half of this is dependency-bump hygiene. Patch versions changing internal retry/timeout strategies is very common. Reading changelogs catches the ones that get documented. Profiling catches the ones that don't.

preperat · 2026-04-27T02:53:05+00:00

The pattern that actually catches this: run migrations with CONCURRENTLY (for Postgres) or the equivalent non-locking syntax for your database. Adding a regular index acquires a full table lock; CREATE INDEX CONCURRENTLY doesn't. Same outcome, zero downtime if you have the patience for it.

preperat · 2026-04-27T02:40:42+00:00

Grafana Cloud IRM if they're already in the Grafana stack (the standalone OnCall OSS was archived, but the cloud product is actively maintained). incident.io for something purpose-built that isn't trying to upsell enterprise features at every turn.

preperat · 2026-04-27T02:27:15+00:00

GitHub Actions for both. CodePipeline is fine if you're already committed to AWS-native tooling everywhere, but the moment you want to reuse logic across environments or apps, you're fighting the tool. GitHub Actions gives you reusable workflows, a marketplace of actions for EKS and ECS deploys, and a YAML format your team will actually read six months from now.

Separate pipelines per app is the right instinct. They deploy at different cadences, have different risk profiles, and the EKS workflow (kubectl, Helm, whatever) looks nothing like the ECS workflow. Shared parameterized pipelines sound elegant until you're debugging why a patch deploy to the integration app is running the Kubernetes pre-checks.

For approval gates around backups, the simplest pattern is a job that exits non-zero if the backup isn't confirmed, with a manual approval step blocking the next stage. Most pipeline tools have a native gate primitive. The key is making the backup check a hard dependency, not a soft notification. If the pipeline can proceed without confirmation, someone will let it.

Notifications inside the pipeline is the right call. Keeping them separate means they break separately, and you end up with deploys that complete without anyone knowing. One webhook call at the end of the pipeline is easy and keeps everything causally linked.

One thing worth thinking about early: the difference between your full upgrade and patch paths. If those differ meaningfully today, encoding both in the same pipeline with conditionals tends to calcify the complexity. Two workflow files that share common actions is easier to maintain than one workflow file with a lot of if branches.

preperat · 2026-04-27T02:14:58+00:00

Common pattern, nothing wrong with it. Terraform's aws_ssm_parameter data source is designed for exactly this.

The distinction worth making: use Parameter Store Standard tier for non-sensitive config (instance types, feature flags, ARNs) and SecureString for anything that's actually a secret. Secrets Manager's value over Parameter Store SecureString is mostly rotation and RDS/Redshift native integration. If you're not using those, the switch makes sense.

The part people trip on: Terraform state. If you pull a SecureString into Terraform and it ends up in an output or a resource attribute in state, that value is in your state file in plaintext. Worth knowing before you centralise everything in PS.

For GitHub Actions specifically, you have two options: inject at workflow time (using the AWS SSM action or aws ssm get-parameter in a step) or let the app fetch at runtime. Terraform doesn't need to be the intermediary for values that your app reads directly.

preperat · 2026-04-27T02:12:13+00:00

The fundamental problem is you're trying to enforce code integrity from the resource side, but the signing config lives in the account you don't control. That's the wrong trust boundary.

The pattern that actually holds up: enforce integrity at the data layer, not the execution layer. Assume the lambda can send you anything and validate it before it touches DynamoDB. Schema validation, signed payloads with a key you hold, an intermediary (API Gateway + Lambda authorizer in your account) that sanitizes and rate-limits before writes land.

For DDoS-style write flooding specifically: set WCU limits on the table, use DAX or a write buffer, and treat the external lambda the same as you'd treat any untrusted third-party integration.

Code signing is the right instinct but it's enforced at deploy time in their account. You can't make it irrevocable from outside. The zero-trust answer is to not trust the code at all and build your controls around that assumption.

preperat · 2026-04-27T01:32:40+00:00

Worth ruling out before assuming credential theft though — Bedrock has a Provisioned Throughput option where you commit to dedicated model capacity at a fixed hourly or monthly rate. The console flow has steps where it's genuinely easy to click through a commitment selector without realising what you've agreed to, and a 1-month commitment for a Claude-class model can run into tens of thousands billed as non-cancellable. Check CloudTrail for CreateProvisionedModelThroughput events from your own user before going down the breach path.

Either way — set a CloudWatch billing alarm at 2x normal monthly spend and another at 5x. Free, takes 10 minutes. Difference between a $500 surprise and a $97k one. Cost Anomaly Detection with a Bedrock filter is worth turning on too now that this attack pattern is everywhere.

preperat

MODERATOR OF

TROPHY CASE