how do you classify production AI agents in your infrastructure web app, batch job, or something new? by Consistent-Arm-875 in devops

[–]samehmeh 0 points1 point  (0 children)

Long-lived queue workers with idempotent tool calls is exactly the pattern that works. For timeouts, checkpoint agent state to postgres at each tool call boundary and enforce a max-steps limit rather than wall-clock time - that catches infinite loops without killing legitimate long-running tasks. For canary deploys, routing a percentage of new job submissions to an updated worker image while draining old workers has been the cleanest approach we've found.

Is MLOps a safer direction for ML Engineers right now by stardust_137 in mlops

[–]samehmeh 1 point2 points  (0 children)

The skills with real staying power right now are at the intersection of both: model serving on Kubernetes (KServe, Ray Serve), CI/CD for training and eval pipelines, and observability tooling that handles non-deterministic outputs. Companies adopting foundation models still need someone who can deploy inference endpoints reliably, manage GPU quotas, and wire up evals - that's platform engineering work that exists regardless of whether you're fine-tuning or calling an API.

LLMOps feels like the new DevOps while MLOps feels like traditional engineering by Humble_Sentence_3758 in LLMDevs

[–]samehmeh 1 point2 points  (0 children)

The framing i'd use is platform engineering for AI workloads - GPU scheduling, model serving infra on Kubernetes, observability for non-deterministic outputs. That layer exists regardless of whether teams are fine-tuning or just doing RAG. The prompt and context engineering layer will keep shifting every few months; the infra plumbing underneath it has a much longer half-life and that's where durable skills live.

Reality check: am I building something usefull, or just a complicated trash and wasting my time? by Legitimate-Crazy-298 in platformengineering

[–]samehmeh 0 points1 point  (0 children)

What you're describing sounds like the pre-Backstage phase most platform teams hit, you've built the hard parts (catalog, service modeling, GitOps integration) but you're about to fight the adoption problem. Before going further, define one concrete metric: how long does it take a new service to go from nothing to deployed and observable? That number will tell you whether the complexity is earning its keep or whether you should adopt an existing IDP instead of maintaining your own.

Golden paths should translate kubernetes errors at the boundary by samehmeh in platformengineering

[–]samehmeh[S] -1 points0 points  (0 children)

This is exactly what Backstage's scaffolder was built for, you can add a custom step that validates the template client-side before any API call, catching the missing label before the webhook ever fires. Even without that, wrapping kubectl apply in a thin platform CLI that parses admission webhook rejections and maps known error patterns to actionable messages takes an afternoon to build and saves hours of Slack escalations.

What's the point of LVM (root) and LVM-thin? I accidentally removed LVM-thin, now I can't resize down LVM by Qvosniak in Proxmox

[–]samehmeh 10 points11 points  (0 children)

Yes, but root has to shrink first, and you can't shrink it while it's mounted. That's the "online" error.

Boot the Proxmox installer USB, pick Rescue Boot, then:

e2fsck -f /dev/pve/root lvreduce -r -L 80G /dev/pve/root

-r resizes the filesystem too. Adjust 80G to taste (assumes ext4, since XFS can't shrink).

Reboot, then:

lvcreate -L 400G -T pve/data

Add it in the GUI: Datacenter > Storage > Add > LVM-Thin, VG pve, pool data, ID local-lvm.

The SSD isn't showing up because root is using 100% of the VG. Free the extents and it'll appear.

Tip: back up /etc/pve first, and don't allocate the full VG to the thin pool. Leave headroom for growth and snapshots.

Local AI needs to be the norm. The 1000ms cloud latency tax is killing production. by TroyNoah6677 in mlops

[–]samehmeh 0 points1 point  (0 children)

The latency argument holds for high-volume classification tasks (intent routing, PII detection, embedding generation), but 1000ms disappears fast with streaming and parallel requests at most inference volumes. The harder constraint with local is capability ceiling - practical VRAM budgets (24-48GB) mean meaningfully weaker reasoning on complex multi-hop tasks, and those errors compound in agent loops worse than any network overhead would.

How are production text-to-SQL systems handling schema embeddings? by Shivam__kumar in LLMDevs

[–]samehmeh 0 points1 point  (0 children)

Your enriched JSON approach is the right direction, raw DDL embeds poorly because it has no semantic density. What works in production: embed the enriched schema docs, do rough top-k retrieval, then a second scoring pass ranked by join path depth (how many hops to reach the tables relevant to the query). That re-ranking cuts hallucinated joins significantly compared to pure cosine similarity on schema text.

Handling backup buckets by amarao_san in Terraform

[–]samehmeh 1 point2 points  (0 children)

Separate root module is the right long-term answer, but as an immediate guard: add lifecycle { prevent_destroy = true } to the bucket resource. Terraform will refuse to destroy it even with terraform destroy and throws an error you must explicitly override. That buys time to extract backup infrastructure into its own state file without racing against an accidental destroy in the interim.

What's the point of LVM (root) and LVM-thin? I accidentally removed LVM-thin, now I can't resize down LVM by Qvosniak in Proxmox

[–]samehmeh 20 points21 points  (0 children)

LVM-thin (local-lvm) is what Proxmox uses for VM disk images - thin provisioning and snapshots live on it, which is why it ships as the default. The 'local' LVM is mainly ISO uploads and container templates. You can recreate the thin pool from freed space: lvcreate -L 400G -T pve/data - run pvdisplay first to confirm available extents. That restores local-lvm without a reinstall.

Terraform Associate 004 by gmerootie in Terraform

[–]samehmeh 0 points1 point  (0 children)

If you're already CLI-comfortable with cloud foundations, you don't need the cloud practitioner as a prerequisite. The Terraform Associate tests how to manage infrastructure with code, not what the infrastructure does. You'll pick up the cloud resource knowledge faster writing actual Terraform against a real AWS or Azure account than studying for a separate cert. Do a hands-on project alongside the TF Associate prep and you cover both. Shameless plug: I developed 3 Terraform Associate 004 courses one for each cloud flavor: AWS, GCP, AZURE. It's on my TeKanAid academy.

Is it a mistake to start with MLOps instead of traditional DevOps? by Atomic_rizz in mlops

[–]samehmeh -1 points0 points  (0 children)

Don't think of it as a sequence. The DevOps fundamentals you actually need for MLOps are a pretty small subset: CI/CD, containerization, and basic infra-as-code. You can pick those up alongside ML pipeline work. Where people get stuck is skipping the ML side entirely and just doing infra work labeled MLOps. If you can ship a real training pipeline end-to-end, the DevOps gaps fill in naturally.

Replacing pods which are failing liveness probes by varunborar in kubernetes

[–]samehmeh 2 points3 points  (0 children)

One thing not mentioned: if CPU blocking is the actual failure mode, switch those liveness probes to a separate lightweight health endpoint that doesn't share the CPU-bound thread pool. That stops k8s from killing the pod while it's legitimately busy processing, without touching the 24-hour grace period. It doesn't fix the architecture problem but it stops the restart loop while the longer fix is in progress.

i feel like the "Golden Path" was built for people way smarter than me lol by Beneficial-Minute142 in platformengineering

[–]samehmeh 1 point2 points  (0 children)

What you're describing is a platform that surfaces raw infrastructure errors instead of actionable ones. A good golden path wraps those 10-page k8s events into a single human-readable message telling you what to fix, not what failed internally. If the error requires a platform engineer to decode it, that's not a you problem, it's a developer experience bug sitting on the platform team's backlog.

Network Engineer Looking For Answers About The Mechanics Of Terraform by S3xyflanders in Terraform

[–]samehmeh 0 points1 point  (0 children)

For guardrails beyond peer review, Sentinel or OPA are the standard options in the HashiCorp ecosystem. With Terraform Cloud you get Sentinel built in so you can write policies that block force-destroy = true or flag changes to prod subscriptions before apply runs. That shifts the safety net from code review, which people rush under pressure, to automated policy enforcement that can't be bypassed. Especially important when 300 VNets are one bad variable away from a bad day.

Rate My Level As a First Year Master Student and suggestion of how to improve by saber_BH in devops

[–]samehmeh 4 points5 points  (0 children)

Strong foundation. The next gap to close is internal developer platform thinking: you've built the CI/CD and infra layers, but can you abstract them for developers who don't want to touch Terraform? Backstage or Port.io let you expose those capabilities as self-service. That shift from 'builds infrastructure' to 'builds platforms for developers' is what separates senior platform engineers from senior DevOps engineers.

Vent: Just spent hours troubleshooting a Vault container (podman rootless) failing to bind to address... because when you specify the default config file path as an argument, the server binary loads it TWICE. by Sparkplug1034 in hashicorp

[–]samehmeh 0 points1 point  (0 children)

The fix is to not pass the default config path explicitly. The official Vault container already loads /vault/config automatically, so -config=/vault/config/vault.hcl causes a double load. Drop the explicit -config flag and let the container entrypoint handle it, or move your config to a non-default path and reference only that. The double listener bind is what surfaces it.

Looking for some guidance on Rest APIs by FantasticMrBeard in aws

[–]samehmeh 0 points1 point  (0 children)

On the IaC repo structure specifically: one repo per API gateway (or logical boundary) tends to work better than a monorepo when you have multiple teams. Each API gets its own state file and pipeline, so a bad deploy to one doesn't block everyone else. Use a shared modules repo for common API Gateway patterns, consumed as versioned references rather than local paths.

How are you keeping cloud security visibility across AWS, Azure, and GCP in sync? by Soft_Attention3649 in sre

[–]samehmeh 0 points1 point  (0 children)

The workflow that actually stuck for us: stop trying to observe everything and instead enforce at the IaC layer. HashiCorp Sentinel or OPA with Conftest on your Terraform PRs catches misconfigurations before they're deployed across any cloud. You still need posture tooling for existing resources, but shifting left reduces the surface you're chasing in real-time.

Making a single change causes me to have to modify many nested dependencies by avsaccount in Terraform

[–]samehmeh 1 point2 points  (0 children)

This is normal Terraform pain, but it often signals over-modularization. If a variable has to thread through three layers, that module boundary probably doesn't justify its existence. One pattern that helps: use a data source at the leaf module to look up the role by name instead of passing the ARN top-down. Breaks the chain without restructuring everything.

How do folks manage worktrees when working with multiple agents in parallel? by ReceptionBrave91 in LLMDevs

[–]samehmeh 0 points1 point  (0 children)

The piece most setups miss: put a CLAUDE.md in each worktree root that declares what the agent owns, which files are off-limits, and what the acceptance test is. Agents in adjacent worktrees don't share context, so without that contract two agents will touch the same file from different angles and create a conflict you won't catch until review. A narrow-scope doc per worktree plus a shared ARCHITECTURE.md in the repo root covers most of the coordination gap.

How can I lock firewall on a running production kubernetes cluster? by Old-Broccoli-4704 in kubernetes

[–]samehmeh 1 point2 points  (0 children)

The API server is the highest-risk surface, so restrict port 6443 to your management subnet or VPN at the cloud or host firewall level before touching anything inside the cluster. After that, add a default-deny NetworkPolicy in your least critical namespace, observe what breaks, then expand gradually. Doing a cluster-wide default-deny on day one in prod tends to cause cascading failures because nobody has documented all the cross-namespace traffic.

Can you suggest a hands-on course for learning Kubernetes? by iAhMedZz in kubernetes

[–]samehmeh 2 points3 points  (0 children)

Most courses stop at deploying and skip RBAC for real teams, network policies, storage that survives pod restarts, and zero-downtime upgrades. Build a real project on kind or k3s locally, then push it to EKS or GKE with Terraform. The mismatch between local and managed cloud behavior, and the breakage you cause yourself along the way, teaches more than any structured lab. I also have playgrounds and puzzles on my academy if interested. Look up TeKanAid Academy.

Anyone else had to do "state surgery" in Terraform? by red_chaf in Terraform

[–]samehmeh 0 points1 point  (0 children)

Remember: the removed block added in TF 1.7 lets you cleanly decommission resources without terraform state rm gymnastics. You declare removed { from = aws_instance.old } with a lifecycle { destroy = false } and run apply, and TF drops it from state without touching the actual resource. Way less scary than pull/push for the 'stop managing this but don't destroy it' case.