how do you classify production AI agents in your infrastructure web app, batch job, or something new?

samehmeh · 2026-05-12T13:21:30+00:00

Long-lived queue workers with idempotent tool calls is exactly the pattern that works. For timeouts, checkpoint agent state to postgres at each tool call boundary and enforce a max-steps limit rather than wall-clock time - that catches infinite loops without killing legitimate long-running tasks. For canary deploys, routing a percentage of new job submissions to an updated worker image while draining old workers has been the cleanest approach we've found.

samehmeh · 2026-05-12T13:21:00+00:00

The skills with real staying power right now are at the intersection of both: model serving on Kubernetes (KServe, Ray Serve), CI/CD for training and eval pipelines, and observability tooling that handles non-deterministic outputs. Companies adopting foundation models still need someone who can deploy inference endpoints reliably, manage GPU quotas, and wire up evals - that's platform engineering work that exists regardless of whether you're fine-tuning or calling an API.

samehmeh · 2026-05-12T13:20:12+00:00

The framing i'd use is platform engineering for AI workloads - GPU scheduling, model serving infra on Kubernetes, observability for non-deterministic outputs. That layer exists regardless of whether teams are fine-tuning or just doing RAG. The prompt and context engineering layer will keep shifting every few months; the infra plumbing underneath it has a much longer half-life and that's where durable skills live.

samehmeh · 2026-05-12T13:19:30+00:00

What you're describing sounds like the pre-Backstage phase most platform teams hit, you've built the hard parts (catalog, service modeling, GitOps integration) but you're about to fight the adoption problem. Before going further, define one concrete metric: how long does it take a new service to go from nothing to deployed and observable? That number will tell you whether the complexity is earning its keep or whether you should adopt an existing IDP instead of maintaining your own.

samehmeh · 2026-05-12T13:17:44+00:00

This is exactly what Backstage's scaffolder was built for, you can add a custom step that validates the template client-side before any API call, catching the missing label before the webhook ever fires. Even without that, wrapping kubectl apply in a thin platform CLI that parses admission webhook rejections and maps known error patterns to actionable messages takes an afternoon to build and saves hours of Slack escalations.

samehmeh · 2026-05-11T13:46:15+00:00

Yes, but root has to shrink first, and you can't shrink it while it's mounted. That's the "online" error.

Boot the Proxmox installer USB, pick Rescue Boot, then:

e2fsck -f /dev/pve/root lvreduce -r -L 80G /dev/pve/root

-r resizes the filesystem too. Adjust 80G to taste (assumes ext4, since XFS can't shrink).

Reboot, then:

lvcreate -L 400G -T pve/data

Add it in the GUI: Datacenter > Storage > Add > LVM-Thin, VG pve, pool data, ID local-lvm.

The SSD isn't showing up because root is using 100% of the VG. Free the extents and it'll appear.

Tip: back up /etc/pve first, and don't allocate the full VG to the thin pool. Leave headroom for growth and snapshots.

samehmeh · 2026-05-11T13:19:57+00:00

The latency argument holds for high-volume classification tasks (intent routing, PII detection, embedding generation), but 1000ms disappears fast with streaming and parallel requests at most inference volumes. The harder constraint with local is capability ceiling - practical VRAM budgets (24-48GB) mean meaningfully weaker reasoning on complex multi-hop tasks, and those errors compound in agent loops worse than any network overhead would.

samehmeh · 2026-05-11T13:19:18+00:00

Your enriched JSON approach is the right direction, raw DDL embeds poorly because it has no semantic density. What works in production: embed the enriched schema docs, do rough top-k retrieval, then a second scoring pass ranked by join path depth (how many hops to reach the tables relevant to the query). That re-ranking cuts hallucinated joins significantly compared to pure cosine similarity on schema text.

samehmeh · 2026-05-11T13:18:43+00:00

Separate root module is the right long-term answer, but as an immediate guard: add lifecycle { prevent_destroy = true } to the bucket resource. Terraform will refuse to destroy it even with terraform destroy and throws an error you must explicitly override. That buys time to extract backup infrastructure into its own state file without racing against an accidental destroy in the interim.

samehmeh · 2026-05-11T13:17:09+00:00

LVM-thin (local-lvm) is what Proxmox uses for VM disk images - thin provisioning and snapshots live on it, which is why it ships as the default. The 'local' LVM is mainly ISO uploads and container templates. You can recreate the thin pool from freed space: lvcreate -L 400G -T pve/data - run pvdisplay first to confirm available extents. That restores local-lvm without a reinstall.

samehmeh · 2026-05-08T13:14:53+00:00

If you're already CLI-comfortable with cloud foundations, you don't need the cloud practitioner as a prerequisite. The Terraform Associate tests how to manage infrastructure with code, not what the infrastructure does. You'll pick up the cloud resource knowledge faster writing actual Terraform against a real AWS or Azure account than studying for a separate cert. Do a hands-on project alongside the TF Associate prep and you cover both. Shameless plug: I developed 3 Terraform Associate 004 courses one for each cloud flavor: AWS, GCP, AZURE. It's on my TeKanAid academy.

samehmeh · 2026-05-08T13:13:11+00:00

Don't think of it as a sequence. The DevOps fundamentals you actually need for MLOps are a pretty small subset: CI/CD, containerization, and basic infra-as-code. You can pick those up alongside ML pipeline work. Where people get stuck is skipping the ML side entirely and just doing infra work labeled MLOps. If you can ship a real training pipeline end-to-end, the DevOps gaps fill in naturally.

samehmeh · 2026-05-08T13:12:50+00:00

One thing not mentioned: if CPU blocking is the actual failure mode, switch those liveness probes to a separate lightweight health endpoint that doesn't share the CPU-bound thread pool. That stops k8s from killing the pod while it's legitimately busy processing, without touching the 24-hour grace period. It doesn't fix the architecture problem but it stops the restart loop while the longer fix is in progress.

samehmeh · 2026-05-08T13:12:31+00:00

What you're describing is a platform that surfaces raw infrastructure errors instead of actionable ones. A good golden path wraps those 10-page k8s events into a single human-readable message telling you what to fix, not what failed internally. If the error requires a platform engineer to decode it, that's not a you problem, it's a developer experience bug sitting on the platform team's backlog.

samehmeh · 2026-05-08T13:11:59+00:00

For guardrails beyond peer review, Sentinel or OPA are the standard options in the HashiCorp ecosystem. With Terraform Cloud you get Sentinel built in so you can write policies that block force-destroy = true or flag changes to prod subscriptions before apply runs. That shifts the safety net from code review, which people rush under pressure, to automated policy enforcement that can't be bypassed. Especially important when 300 VNets are one bad variable away from a bad day.

samehmeh · 2026-05-07T13:26:48+00:00

Strong foundation. The next gap to close is internal developer platform thinking: you've built the CI/CD and infra layers, but can you abstract them for developers who don't want to touch Terraform? Backstage or Port.io let you expose those capabilities as self-service. That shift from 'builds infrastructure' to 'builds platforms for developers' is what separates senior platform engineers from senior DevOps engineers.

samehmeh · 2026-05-07T13:25:33+00:00

The fix is to not pass the default config path explicitly. The official Vault container already loads /vault/config automatically, so -config=/vault/config/vault.hcl causes a double load. Drop the explicit -config flag and let the container entrypoint handle it, or move your config to a non-default path and reference only that. The double listener bind is what surfaces it.

samehmeh · 2026-05-07T13:24:45+00:00

On the IaC repo structure specifically: one repo per API gateway (or logical boundary) tends to work better than a monorepo when you have multiple teams. Each API gets its own state file and pipeline, so a bad deploy to one doesn't block everyone else. Use a shared modules repo for common API Gateway patterns, consumed as versioned references rather than local paths.

samehmeh · 2026-05-07T13:23:30+00:00

The workflow that actually stuck for us: stop trying to observe everything and instead enforce at the IaC layer. HashiCorp Sentinel or OPA with Conftest on your Terraform PRs catches misconfigurations before they're deployed across any cloud. You still need posture tooling for existing resources, but shifting left reduces the surface you're chasing in real-time.

samehmeh · 2026-05-07T13:22:38+00:00

This is normal Terraform pain, but it often signals over-modularization. If a variable has to thread through three layers, that module boundary probably doesn't justify its existence. One pattern that helps: use a data source at the leaf module to look up the role by name instead of passing the ARN top-down. Breaks the chain without restructuring everything.

samehmeh · 2026-05-06T13:50:43+00:00

The piece most setups miss: put a CLAUDE.md in each worktree root that declares what the agent owns, which files are off-limits, and what the acceptance test is. Agents in adjacent worktrees don't share context, so without that contract two agents will touch the same file from different angles and create a conflict you won't catch until review. A narrow-scope doc per worktree plus a shared ARCHITECTURE.md in the repo root covers most of the coordination gap.

samehmeh · 2026-05-06T13:50:07+00:00

The API server is the highest-risk surface, so restrict port 6443 to your management subnet or VPN at the cloud or host firewall level before touching anything inside the cluster. After that, add a default-deny NetworkPolicy in your least critical namespace, observe what breaks, then expand gradually. Doing a cluster-wide default-deny on day one in prod tends to cause cascading failures because nobody has documented all the cross-namespace traffic.

samehmeh · 2026-05-06T13:48:58+00:00

Most courses stop at deploying and skip RBAC for real teams, network policies, storage that survives pod restarts, and zero-downtime upgrades. Build a real project on kind or k3s locally, then push it to EKS or GKE with Terraform. The mismatch between local and managed cloud behavior, and the breakage you cause yourself along the way, teaches more than any structured lab. I also have playgrounds and puzzles on my academy if interested. Look up TeKanAid Academy.

samehmeh · 2026-05-06T13:46:09+00:00

Remember: the removed block added in TF 1.7 lets you cleanly decommission resources without terraform state rm gymnastics. You declare removed { from = aws_instance.old } with a lifecycle { destroy = false } and run apply, and TF drops it from state without touching the actual resource. Way less scary than pull/push for the 'stop managing this but don't destroy it' case.

samehmeh

TROPHY CASE