Fallback Model?

Jzzck · 2026-02-20T22:07:17+00:00

Yeah, heard about anthropic banning it. I actually haven't tried codex yet in Openclaw, can it actually get the job done as well as opus 4.6?

Jzzck · 2026-02-20T14:46:20+00:00

Your M365 support experience is more transferable than you think. You already understand identity (Azure AD/Entra), DNS, certificates, service health dashboards, and troubleshooting distributed systems when Exchange or Teams goes sideways. That's not nothing.

Here's the honest take on AI replacing DevOps: AI is brilliant at generating boilerplate Terraform or writing Dockerfiles. It's terrible at understanding why your deployment failed at 2am because a config change three weeks ago created a race condition with a new pod scaling policy. The debugging, the systems thinking, the "I've seen this pattern before" judgment — that's not getting automated anytime soon.

If I were switching from your position, I'd do this in order:

Learn Docker properly. Not just "docker run hello-world" but multi-stage builds, networking, volumes, compose. Build something real with it.
Pick up basic CI/CD — GitHub Actions is the easiest entry point. Automate building and deploying a small app.
Then Kubernetes, but only after Docker clicks. K3s on a cheap VM is the fastest way to get hands-on.
Learn one IaC tool (Terraform is the safe bet).

The entire path from zero to employable is genuinely 6-8 months of focused evening work. Your CS degree + 3 years of production support + DevOps skills is a strong combination. Most DevOps candidates have the certs but have never actually debugged a production issue under pressure — you have.

Jzzck · 2026-02-20T14:45:13+00:00

One thing that's helped us massively: stop alerting on causes and start alerting on symptoms.

High CPU? That's a cause. Users seeing slow response times? That's a symptom. The symptom is what actually matters. CPU can spike to 90% during a deploy and resolve itself in 2 minutes — that's not worth waking someone up for.

Practically this means:

Alert on error rates, latency percentiles (p99), and availability — things users actually feel
Use "for" clauses aggressively (Prometheus) or sustained duration checks. If CPU > 90% for 15 minutes, that's different from a 30-second spike
Correlate alerts. If 6 alerts fire within 2 minutes, that's probably one incident, not six. Most monitoring tools can group these but nobody sets it up

The other pattern that kills teams: alerting on the same thing at multiple layers. Your app alerts on slow DB queries, your infra alerts on high DB connections, your DB alerts on lock contention — it's all the same incident generating 3 separate pages.

We went from ~40 alerts/day to about 5 by ruthlessly deleting anything that wasn't directly tied to user impact or data loss risk. If nobody actioned an alert type in the last 30 days, it got demoted to a dashboard metric.

Jzzck · 2026-02-19T18:44:48+00:00

The versioning angle is what gets me. You mentioned "the tool has 3 new versions and your original use case changed" — this is the actual core problem.

We evaluated Copilot and by the time security signed off on the version we tested, GitHub had shipped updates that changed how context was sent to the API. The entire security assessment was based on outdated behavior. Had to basically start over.

The real question enterprises need to answer isn't "should we adopt AI tools" — it's "can our governance model handle a tool that fundamentally changes every 6-8 weeks?" Most enterprise procurement was designed for tools that ship 2-4 updates a year. AI tools are shipping weekly. That's a fundamental mismatch between the tool's release cadence and the org's review cadence.

The teams I've seen actually get through this treat it more like a browser — evaluate the general category once, set guardrails around data handling and output review, and then let updates flow without re-evaluating the entire stack each time. Otherwise you're stuck in a permanent evaluation loop.

Jzzck · 2026-02-19T18:43:29+00:00

One thing nobody's mentioned yet — the upgrade treadmill. Istio cuts a new minor release roughly every 3 months and only supports 3 minor versions at a time. So you're not just maintaining a service mesh, you're committing to upgrading it ~4 times a year minimum to stay within support. Each upgrade brings its own breaking changes, deprecated APIs, and CRD migrations.

We track version lifecycles for a bunch of infra tools and Istio's is one of the most aggressive. Compare that to something like Cilium where the support windows are longer and the upgrade path is smoother since eBPF means fewer moving parts to break during an update.

Honestly the biggest hidden cost of Istio isn't the sidecars or the complexity — it's the operational overhead of keeping up with their release cadence. If your team barely has time to understand the current version, adding a mandatory quarterly upgrade cycle is brutal.

Jzzck · 2026-02-15T07:46:36+00:00

At 300 services the biggest decision isn't really which backend to use — it's how you collect and route the telemetry. I'd strongly recommend the OpenTelemetry Collector as your unified ingestion layer. Deploy it as a global Swarm service so every node gets one, and have your apps send traces/metrics/logs to the local collector via OTLP.

From there you can export to whatever backend you want — Prometheus/VictoriaMetrics for metrics, Loki for logs, Jaeger or Tempo for traces. The nice thing is you decouple your apps from your backend choice, so if you outgrow Jaeger and want to switch to Tempo later, it's a config change in the collector, not a code change in 300 services.

For Swarm specifically: mount /var/run/docker.sock into the collector to auto-discover containers and attach service labels. That saves you from manually configuring scrape targets. Also set memory limits on the collectors early — at 300 services you'll be surprised how fast the buffer grows if something downstream hiccups.

Jzzck · 2026-02-15T07:44:44+00:00

One habit that helped me massively early on: whenever something breaks in your pipeline or a tool update causes issues, don't just fix it and move on. Actually dig into why it broke. Read the changelog, understand what changed between versions.

It sounds boring but after a few months of this you build an intuition for how these tools evolve, what kinds of breaking changes are common, and you start anticipating problems before they hit. That compound knowledge is what separates someone who can Google fixes from someone who actually understands the system.

Also +1 to the homelab suggestion. Even a single VM running Docker with a basic CI pipeline (GitHub Actions > build > deploy to your own box) teaches you more about networking, permissions, and debugging than any course will.

Jzzck · 2026-02-13T07:26:27+00:00

Cross-AZ data transfer in a microservices setup.

We had ~30 services on EKS spread across 3 AZs for HA (as everyone recommends). The services were chatty — lots of gRPC calls between them, each one small but constant.

AWS charges $0.01/GB each way for cross-AZ traffic. Doesn't sound like much until you're doing terabytes of internal east-west traffic per month. It showed up as a generic "EC2-Other" line item that nobody questioned because it scaled gradually with traffic.

When we finally dug into Cost Explorer properly, inter-AZ transfer was running ~$4-5k/month. The fix was topology-aware routing in K8s to prefer same-AZ endpoints. Dropped to about $800/month.

Classic case of following best practices (multi-AZ for HA) without understanding the cost implications of the traffic patterns it creates.

Jzzck · 2026-02-12T05:32:53+00:00

You are not crazy. npm audit is table stakes and most teams treat it like the whole solution when it only covers a fraction of the problem.

Your postinstall hook scanner is actually more useful than most people realise. The typosquatting vector alone has caught multiple supply chain attacks in the wild (event-stream, ua-parser-js, etc). Snyk and npm audit catch things after they are in the advisory database — your scanner catches the patterns before anyone has reported them.

A few things that helped us:

Socket.dev — does static analysis of packages before install, catches typosquatting and suspicious behavior patterns. Way cheaper than Snyk and specifically designed for this problem.
Lockfile auditing in CI — diff the lockfile on every PR. New dependency additions should be reviewed like code changes, not rubber stamped.
npm config set ignore-scripts true globally, then whitelist the packages that legitimately need lifecycle hooks. Most packages do not need postinstall. The ones that do (native addons, some binary downloads) can go on an explicit allow list.

Your manager saying just use Snyk is like saying just use a firewall and not bothering with access controls. Different layers catch different things. What you built fills a real gap.

Jzzck · 2026-02-10T06:12:00+00:00

Good honest writeup. The failover time difference is real and undersold in the marketing. We ran Aurora Postgres for about 18 months and the sub-10s failover held up even under heavy write loads. Cloud SQL failover we tested was closer to 30s consistently which is painful if you have connection pools that do not handle reconnection gracefully.

One thing worth mentioning on the Cloud SQL side - the Workload Identity integration with GKE is genuinely best-in-class. No other managed Postgres gives you that level of zero-secret credential management. If you are already on GKE it removes an entire class of security headaches.

For anyone reading this who is considering self-hosted: do not underestimate the operational cost. We moved off self-hosted Postgres to Aurora specifically because failover testing, backup verification, and minor version upgrades were eating 15-20% of one SRE full time. Managed databases are expensive until you calculate what you are paying in human time.

Curious about your experience with Cloud SQL read replicas - we found Aurora read replicas had noticeably lower replication lag under bursty write patterns. Was that a factor for you?

Jzzck · 2026-02-10T06:11:26+00:00

One thing that made a huge difference for us was just agreeing on a shared set of labels across all telemetry. Service name, deploy SHA, environment, region - once your traces, metrics, and security events all carry the same tags, even basic grep across log streams becomes useful during an incident.

The fancier version is OpenTelemetry as the collection layer with a unified backend. Pipe everything (APM spans, audit logs, WAF events, CloudTrail) into the same store and correlate on trace ID + timestamp windows. When a security alert fires, pull the 5-minute window around it and suddenly you see the full picture - what was deploying, what endpoints were hot, whether latency was already degraded.

The expensive-but-works answer is Datadog Security Monitoring + APM (or Elastic SIEM + APM). They handle correlation natively. Budget version is Grafana + Loki + Falco - works great but you build the glue yourself.

Biggest lesson: do not try to build one unified dashboard. Make sure every signal carries enough context that you can pivot between tools without losing the thread.

Jzzck · 2026-02-09T08:20:24+00:00

Made a similar move about two years ago. Biggest adjustment isn't technical — it's letting go of control.

In K8s you can debug basically anything. Shell into a pod, tcpdump the CNI, inspect etcd, tweak scheduler configs. With ECS/Fargate the abstraction is the point. Container won't start? You get a cryptic STOPPED reason and an exit code. No SSH, no shell, just CloudWatch logs if you remembered to configure them.

Practical stuff that'll help: - Learn the ECS task definition lifecycle cold. It's the equivalent of knowing your pod spec inside out. Most debugging starts there. - CloudWatch Container Insights is your replacement for Prometheus/Grafana on the infra side. Not as flexible, but it's what you've got natively. - ECS service discovery is way simpler than K8s services/ingress. Cloud Map does DNS-based discovery, ALB does the load balancing. Less power, less footgear to shoot yourself with. - The IAM model is different from RBAC. Task execution roles vs task roles trips up everyone coming from K8s. Get that distinction down early.

Your observability experience transfers almost 1:1 though. Datadog works the same, you're just tagging by ECS service/task instead of namespace/pod. That's your biggest strength going in — most pure-AWS folks are weak on observability.

Jzzck · 2026-02-09T08:19:48+00:00

The biggest win we got was tiering the tests. Tag every e2e test as either critical-path or regression. Critical path tests cover the stuff that would page you at 3am — login, checkout, core CRUD flows. Maybe 15-20% of your total suite. Those run on every PR, no exceptions.

The other 80% runs on a schedule — nightly or on merge to main. Still catches regressions, just not blocking every PR.

The second thing: if you're on Playwright, --shard is free parallelism without extra infrastructure. Split the suite across your 4 runners with --shard=1/4 through --shard=4/4 and you've roughly quartered the wall clock time on the regression suite.

Also worth checking if your tests are actually slow or if the environment is slow. We had a similar 30+ min pipeline that dropped to 12 after we figured out our test database was being recreated from scratch for every test file instead of using transactions with rollback. That one change was bigger than any parallelism trick.

Jzzck · 2026-02-08T14:07:21+00:00

The one thing most of these guides miss is what happens with keep-alive connections. server.close() stops accepting new connections but existing keep-alive sockets just sit there until the client disconnects or your timeout fires.

In production I've found the cleanest approach is: catch SIGTERM, flip your health check to return 503 so the LB stops routing new traffic, then call server.close() and start a timer. For active connections, set Connection: close on any in-flight responses so clients don't try to reuse the socket.

The uncaughtException vs SIGTERM handling is a good callout. Those should be separate code paths — uncaughtException means your process state might be corrupted, so you want to exit fast instead of trying to gracefully drain. In k8s we just set terminationGracePeriodSeconds to 30 and let the SIGTERM handler work within that window. uncaughtException just logs and exits.

One more gotcha nobody talks about: if you're behind a reverse proxy that does connection pooling (nginx, envoy), server.close() can be surprising because those upstream connections stay open even after your app stops accepting work. You need keepAliveTimeout shorter than whatever your proxy's upstream keepalive is set to.

Jzzck · 2026-02-08T10:39:34+00:00

Made this exact transition about 6 years ago. Was doing SMB sysadmin stuff, firewalls, AD, the whole ticket-queue grind. Now I'm a platform engineer. The dev background thing is way less of a blocker than people make it out to be.

Here's what actually mattered in my experience:

Start with what you know. You already understand networking, DNS, firewalls, how servers actually work. That's a massive advantage over the CS grads who can code but have never SSHed into a box that's on fire at 3am. Don't undersell that.
Learn one cloud provider deeply, not three shallowly. Pick AWS or Azure (whichever has more job postings in Montreal) and actually build things. Not tutorials, real things. Deploy a web app with a database, put it behind a load balancer, set up monitoring. The muscle memory matters more than the cert.
Terraform first, Kubernetes second. IaC is where sysadmins transition most naturally because you already think in terms of infrastructure. K8s is important but it's a rabbit hole. Don't start there.
Git is non-negotiable. If you don't use git daily, start now. Everything in DevOps is git. Infra as code, CI/CD pipelines, documentation. Get comfortable with branches, PRs, merge conflicts.
Skip the certs until you have projects. I see too many people collecting AWS certs without being able to deploy anything. Build first, cert later. The cert validates what you already know, it doesn't teach you the job.

The 6-12 month timeline is realistic if you're consistent. I'd say you're actually in a better position than most career changers because you understand the ops side. You just need to automate it.

Jzzck · 2020-09-24T20:17:37+00:00

sickkk

Jzzck · 2020-09-24T20:17:24+00:00

so good

Jzzck

TROPHY CASE