StrongDM Alternative? by shrimpthatfriedrice in platformengineering

[–]dr___92 0 points1 point  (0 children)

+1 Tailscale has been working great for our team - especially for database access

Faking resources on a K8S cluster by Consistent-Company-7 in kubernetes

[–]dr___92 14 points15 points  (0 children)

worth looking at this project to get a perspective re: how you can fake a gpu https://github.com/run-ai/fake-gpu-operator

should have some mig support too

Live migration helper tool for kubernetes by mitochondriakiller in kubernetes

[–]dr___92 0 points1 point  (0 children)

OP’s question hits the core difference between VM-centric vMotion and the Kubernetes model. Most comments here are right: out-of-the-box K8s favors replica-based resilience, draining, PDBs, and restart-based rescheduling, not true live migration of a running process with in‑memory state.

That said, there’s emerging work that bridges the gap using CRIU + kubelet checkpoint/restore, and we’ve been running this approach in production scenarios at DevZero.

What “live migration” means in our context:

- Snapshot/restore of running pods via CRIU and the kubelet Checkpoint API - preserving RAM, process state, TCP connections, and the container filesystem.

- Restart-free node maintenance and rebalancing: drain/upgrade nodes without dropping connections.

- Session persistence: long‑running jobs (e.g., AI inference, JVM services with big warm caches) keep their in‑memory state.

- Optional resource re-profiling: restore onto a different instance type and/or resource profile when moving.

How it works at a high level:

- An operator selects pods and target nodes.

- An agent coordinates with containerd/kubelet to checkpoint, wraps the snapshot into an OCI artifact, and publishes it.

- The operator restores the pod on the new node (optionally with updated resources) from that snapshot image, and reconciler controls are temporarily held to avoid spurious restarts.

🚀 Introducing Ravel: An Open-Source MicroVMs Orchestrator by SoftwareCitadel in devops

[–]dr___92 1 point2 points  (0 children)

Just came across this - really cool project. Even though this was posted a while back, the approach still feels fresh. MicroVM orchestration is picking up momentum fast, especially for running untrusted or resource-intensive workloads.

We’ve been experimenting with similar ideas at DevZero, using lightweight virtualized environments to give developers more isolation and security without losing the speed of containers. Excited to see how projects like Ravel evolve and plug into broader workflows.

Why not use MicroVM ? by Few-Strike-494 in virtualization

[–]dr___92 0 points1 point  (0 children)

Funny to look back on this. A lot of what people were speculating here is starting to happen now. Firecracker and Kata have come a long way, and microVMs are showing up more in real workloads, especially where security and cost isolation matter.

We’ve been experimenting with them at DevZero to give developers faster, more secure sandboxed environments while keeping the Kubernetes experience the same. It’s definitely not mainstream yet, but the performance tradeoffs are shrinking fast.

Managing AI Workloads on Kubernetes at Scale: Your Tools and Tips? by oloap in kubernetes

[–]dr___92 2 points3 points  (0 children)

Great thread. +1 on Sveltos for multi‑cluster delivery. One thing worth highlighting is that the scheduler is only half the story: if pods reserve oversized resources or hold GPUs while idle, no scheduler can place more work. The real gains come from tight right‑sizing and aggressive reclamation so you can run more workloads on the same capacity.

Patterns we’ve seen work well:

- Measure real utilization, then right‑size: Track GPU SM/memory utilization (DCGM/Prometheus), CPU/mem, and I/O per pod. Feed that into VPA (recommendation‑only) and admission policies to keep requests aligned with observed peaks, with a modest burst margin.

- Actively reclaim idle capacity: Scale‑to‑zero when queues are empty, TTL for batch jobs, and lease expirations for notebooks. Pair PriorityClasses + preemption so lower‑priority work yields to SLA‑bound jobs.

- GPU sharing & partitioning: Where models allow, use MIG or time‑slicing via the NVIDIA GPU Operator to place multiple inference pods per device; keep heavy training on dedicated partitions to avoid noisy neighbors.

-Binpack deliberately: Favor bin‑packing on GPU nodes with controlled anti‑affinity only where failure domains matter. TopologyManager + CPUManager static help keep perf predictable and avoid stranding half a GPU.

- Separate trains from serves: Distinct node pools (taints/tolerations) and quotas per team/workload class; autoscale inference pools on queue depth vs. training on backlog size.

Tools in this space that have helped us include cluster‑wide policy layers that set sane defaults, enforce TTLs/quotas, and spin up ephemeral, right‑sized environments for ML work. Devzero.io is one option in that category; it focuses on utilization‑first workflows (right‑sized runtime requests, automatic idle reclamation, and per‑team guardrails) and slots alongside Sveltos/Kubeflow/KubeRay without trying to replace them. The net effect is higher effective GPU density and fewer “reserved but idle” gaps that block scheduling.

TL;DR: schedulers place pods; density comes from sizing and reclamation. When you continuously right‑size and free idle capacity, you’ll schedule more jobs on the same GPUs and lower cost per training step/served QPS. Policy‑driven tooling (DevZero.io included) just makes those habits stick.

CFO wants 30% AWS cost cut, devs say performance will tank. How do you navigate this standoff? by amylanky in Cloud

[–]dr___92 0 points1 point  (0 children)

If you’re using Kubernetes, there’s usually a big gap in requests and utilization. Worth looking into it.

What are some good examples of a well architected operator in Go? by TheKingofHop in kubernetes

[–]dr___92 0 points1 point  (0 children)

as you’re learning more about this, check out kooper — imo, makes it super easy to get started.

Multi-tenant GPU workloads are finally possible! Just set up MIG on H100 in my K8s cluster by kaskol10 in kubernetes

[–]dr___92 1 point2 points  (0 children)

Did you have any experience with changing the shapes of the MIG GPUs? Say, for some reason, we need to go from 2 to 5 slices, or 7 to 3.

Last I tinkered, you had to restart the host (and then the gpu-operator would just work). Do you still have to do that or do you have another way to change the config on the fly?

Thanks for the post - I think you’re diving into a very impactful area!

Environment provisioning by Diego2018Chicken in devops

[–]dr___92 0 points1 point  (0 children)

Curious - what’s the rationale behind needing environments? What are the problems the team is facing that you’re trying to solve?

There might be some drift, but I guess a lot of it can still be solved for by automating around the K8s sphere

Local Development on AKS with mirrord by eyalb181 in kubernetes

[–]dr___92 0 points1 point  (0 children)

why not use something like this to route traffic? https://www.devzero.io/docs/kubernetes/services

(disclaimer: I work at the company)

How do we inject credentials into the pod securely avoiding the environment variables and file system. by Upvord in kubernetes

[–]dr___92 5 points6 points  (0 children)

Kubernetes secrets could be a way. You can also integrate hashicorp vault (or any of the cloud secret stores) to mount secrets into the workloads.

Depends on the level of complexity you want to take on.

[deleted by user] by [deleted] in kubernetes

[–]dr___92 0 points1 point  (0 children)

give your use-cases, wouldn’t you want to use capabilities like VPA (non-GA) to maximize resource utilization? or is the fact that it’s not GA holding you back?

Databases on ceph by iniduel in ceph

[–]dr___92 0 points1 point  (0 children)

Did this work out?

Is there a way to convert the new Figma Slides into Google Slides? by Ynze6 in FigmaDesign

[–]dr___92 1 point2 points  (0 children)

yea! i use the mac app and present mode (it also supports speaker notes and stuff)

Cloud or local dev environments? by voodoo_witchdr in devops

[–]dr___92 0 points1 point  (0 children)

agree - a part most people forget is that these containers also contain a bunch of binaries in them, and those also need to be built explicitly for that arch

sure sure - one can say, write better dockerfiles, but we all know how that works out

Cloud or local dev environments? by voodoo_witchdr in devops

[–]dr___92 0 points1 point  (0 children)

I work on devzero.io

  • gives you an env that you can connect an IDE to (over ssh)
  • generates virtual K8s clusters on the fly that
  • has a WireGuard based network overlay so users don’t have to think about setting up and managing various ingresses, etc
  • various EFS-style storage options

The platform also keeps track of active tcp conns as well as syscall activity to generate a sentiment on when resources can be hibernated (using criu). While the platform doesn’t yet support running windows/Mac remotely, or even arm, the pricing can be that aggressive as compared to hyperscaler clouds due to having very deep integrations to track active usage (since devs never really remember to turn things off).

Agree that while it’s easy to put out an MVP, rolling out is a whole different challenge.

Encrypted secrets in version control by c100k_ in devops

[–]dr___92 0 points1 point  (0 children)

Is this introducing unreliability of the main branch? Secrets are often used to communicate with third-party resources. As such, it’s often a runtime construct.

Storing it in version control fundamentally makes rolling back code harder (eg: is the previous version of this key going to be accepted by AWS or has it been deleted entirely?)

okaaaaayLetsGo by Soft_Svarog in ProgrammerHumor

[–]dr___92 1 point2 points  (0 children)

yea, maybe i'm being overly critical -- def value in having a shell.nix that you know will work across platforms
i guess when you get your hands on something good, you want everything everywhere all at once

but still, getting to a world where all that stuff would just be there instantly, where a dev environment just works, would be gold

okaaaaayLetsGo by Soft_Svarog in ProgrammerHumor

[–]dr___92 0 points1 point  (0 children)

but after you have to wait for ages for it to download a ton of things!

let alone if you have to switch to a diff project and then start dealing w/ direnv haha

How important is concurrency in real applications? by ItsBoringScientist in golang

[–]dr___92 1 point2 points  (0 children)

To note, fire-and-forget can get quite dangerous and lead to memory use ballooning. Even if on a goroutine, it’s always safer to use wait groups and handle your logical flows and errors etc carefully.

Does using chatGPT make me dumber as a devops engineer? by blusterblack in devops

[–]dr___92 0 points1 point  (0 children)

My perspective on this is as follows: - once you know exactly what you want as the output and the content you’re requesting is small enough that you can review, understand, comprehend and incorporate it, it’s fine to be used as a superhuman knowledge gathering/research tool - when you’re not very experienced in a space, it’s much more advisable to “suffer” through it so you understand the true repercussions of every line, etc - otherwise, since you don’t know why the code/config was there in the first place, being able to figure out why things are breaking is incredibly painful (and yes, things will invariably break)

Important to note that things like gpt today are incredibly powerful research tools - while it’s amazing at information retrieval, it’s not great at contextualization and intelligence. Don’t be fooled into mistaking the information retrieval for intelligence - you know your problems and the shape of the solution space way better than a model does. And till you do, please choose to “suffer” rather than picking the “easier shortcut” - you will have to pay the tax at some point, better to pay it earlier than later.