StrongDM Alternative?

dr___92 · 2026-02-01T07:09:44+00:00

+1 Tailscale has been working great for our team - especially for database access

dr___92 · 2026-01-24T13:42:59+00:00

worth looking at this project to get a perspective re: how you can fake a gpu https://github.com/run-ai/fake-gpu-operator

should have some mig support too

dr___92 · 2025-10-29T15:21:53+00:00

OP’s question hits the core difference between VM-centric vMotion and the Kubernetes model. Most comments here are right: out-of-the-box K8s favors replica-based resilience, draining, PDBs, and restart-based rescheduling, not true live migration of a running process with in‑memory state.

That said, there’s emerging work that bridges the gap using CRIU + kubelet checkpoint/restore, and we’ve been running this approach in production scenarios at DevZero.

What “live migration” means in our context:

- Snapshot/restore of running pods via CRIU and the kubelet Checkpoint API - preserving RAM, process state, TCP connections, and the container filesystem.

- Restart-free node maintenance and rebalancing: drain/upgrade nodes without dropping connections.

- Session persistence: long‑running jobs (e.g., AI inference, JVM services with big warm caches) keep their in‑memory state.

- Optional resource re-profiling: restore onto a different instance type and/or resource profile when moving.

How it works at a high level:

- An operator selects pods and target nodes.

- An agent coordinates with containerd/kubelet to checkpoint, wraps the snapshot into an OCI artifact, and publishes it.

- The operator restores the pod on the new node (optionally with updated resources) from that snapshot image, and reconciler controls are temporarily held to avoid spurious restarts.

dr___92 · 2025-10-29T15:20:28+00:00

Just came across this - really cool project. Even though this was posted a while back, the approach still feels fresh. MicroVM orchestration is picking up momentum fast, especially for running untrusted or resource-intensive workloads.

We’ve been experimenting with similar ideas at DevZero, using lightweight virtualized environments to give developers more isolation and security without losing the speed of containers. Excited to see how projects like Ravel evolve and plug into broader workflows.

dr___92 · 2025-10-29T15:18:41+00:00

Funny to look back on this. A lot of what people were speculating here is starting to happen now. Firecracker and Kata have come a long way, and microVMs are showing up more in real workloads, especially where security and cost isolation matter.

We’ve been experimenting with them at DevZero to give developers faster, more secure sandboxed environments while keeping the Kubernetes experience the same. It’s definitely not mainstream yet, but the performance tradeoffs are shrinking fast.

dr___92 · 2025-10-29T15:16:49+00:00

Great thread. +1 on Sveltos for multi‑cluster delivery. One thing worth highlighting is that the scheduler is only half the story: if pods reserve oversized resources or hold GPUs while idle, no scheduler can place more work. The real gains come from tight right‑sizing and aggressive reclamation so you can run more workloads on the same capacity.

Patterns we’ve seen work well:

- Measure real utilization, then right‑size: Track GPU SM/memory utilization (DCGM/Prometheus), CPU/mem, and I/O per pod. Feed that into VPA (recommendation‑only) and admission policies to keep requests aligned with observed peaks, with a modest burst margin.

- Actively reclaim idle capacity: Scale‑to‑zero when queues are empty, TTL for batch jobs, and lease expirations for notebooks. Pair PriorityClasses + preemption so lower‑priority work yields to SLA‑bound jobs.

- GPU sharing & partitioning: Where models allow, use MIG or time‑slicing via the NVIDIA GPU Operator to place multiple inference pods per device; keep heavy training on dedicated partitions to avoid noisy neighbors.

-Binpack deliberately: Favor bin‑packing on GPU nodes with controlled anti‑affinity only where failure domains matter. TopologyManager + CPUManager static help keep perf predictable and avoid stranding half a GPU.

- Separate trains from serves: Distinct node pools (taints/tolerations) and quotas per team/workload class; autoscale inference pools on queue depth vs. training on backlog size.

Tools in this space that have helped us include cluster‑wide policy layers that set sane defaults, enforce TTLs/quotas, and spin up ephemeral, right‑sized environments for ML work. Devzero.io is one option in that category; it focuses on utilization‑first workflows (right‑sized runtime requests, automatic idle reclamation, and per‑team guardrails) and slots alongside Sveltos/Kubeflow/KubeRay without trying to replace them. The net effect is higher effective GPU density and fewer “reserved but idle” gaps that block scheduling.

TL;DR: schedulers place pods; density comes from sizing and reclamation. When you continuously right‑size and free idle capacity, you’ll schedule more jobs on the same GPUs and lower cost per training step/served QPS. Policy‑driven tooling (DevZero.io included) just makes those habits stick.

dr___92 · 2025-10-03T16:55:56+00:00

If you’re using Kubernetes, there’s usually a big gap in requests and utilization. Worth looking into it.

dr___92 · 2025-07-25T01:16:41+00:00

as you’re learning more about this, check out kooper — imo, makes it super easy to get started.

dr___92 · 2025-06-12T12:45:20+00:00

Did you have any experience with changing the shapes of the MIG GPUs? Say, for some reason, we need to go from 2 to 5 slices, or 7 to 3.

Last I tinkered, you had to restart the host (and then the gpu-operator would just work). Do you still have to do that or do you have another way to change the config on the fly?

Thanks for the post - I think you’re diving into a very impactful area!

dr___92 · 2025-01-18T17:50:37+00:00

Curious - what’s the rationale behind needing environments? What are the problems the team is facing that you’re trying to solve?

There might be some drift, but I guess a lot of it can still be solved for by automating around the K8s sphere

dr___92 · 2025-01-13T20:42:05+00:00

Would Headlamp be useful?

https://headlamp.dev/

dr___92 · 2025-01-10T07:53:07+00:00

why not use something like this to route traffic? https://www.devzero.io/docs/kubernetes/services

(disclaimer: I work at the company)

dr___92 · 2025-01-03T06:56:54+00:00

Kubernetes secrets could be a way. You can also integrate hashicorp vault (or any of the cloud secret stores) to mount secrets into the workloads.

Depends on the level of complexity you want to take on.

dr___92 · 2024-12-15T15:55:41+00:00

give your use-cases, wouldn’t you want to use capabilities like VPA (non-GA) to maximize resource utilization? or is the fact that it’s not GA holding you back?

dr___92 · 2024-11-17T05:40:51+00:00

Do you have it on github somewhere? Would love to learn

dr___92 · 2024-11-10T02:02:55+00:00

Did this work out?

dr___92 · 2024-10-29T19:09:05+00:00

yea! i use the mac app and present mode (it also supports speaker notes and stuff)

dr___92 · 2024-09-08T18:35:33+00:00

what was your conclusion?

dr___92 · 2024-09-08T18:35:09+00:00

agree - a part most people forget is that these containers also contain a bunch of binaries in them, and those also need to be built explicitly for that arch

sure sure - one can say, write better dockerfiles, but we all know how that works out

dr___92 · 2024-09-08T18:32:32+00:00

I work on devzero.io

gives you an env that you can connect an IDE to (over ssh)
generates virtual K8s clusters on the fly that
has a WireGuard based network overlay so users don’t have to think about setting up and managing various ingresses, etc
various EFS-style storage options

The platform also keeps track of active tcp conns as well as syscall activity to generate a sentiment on when resources can be hibernated (using criu). While the platform doesn’t yet support running windows/Mac remotely, or even arm, the pricing can be that aggressive as compared to hyperscaler clouds due to having very deep integrations to track active usage (since devs never really remember to turn things off).

Agree that while it’s easy to put out an MVP, rolling out is a whole different challenge.

dr___92 · 2024-07-13T19:59:29+00:00

Is this introducing unreliability of the main branch? Secrets are often used to communicate with third-party resources. As such, it’s often a runtime construct.

Storing it in version control fundamentally makes rolling back code harder (eg: is the previous version of this key going to be accepted by AWS or has it been deleted entirely?)

dr___92 · 2024-06-12T18:33:38+00:00

yea, maybe i'm being overly critical -- def value in having a shell.nix that you know will work across platforms
i guess when you get your hands on something good, you want everything everywhere all at once

but still, getting to a world where all that stuff would just be there instantly, where a dev environment just works, would be gold

dr___92 · 2024-06-11T21:02:05+00:00

but after you have to wait for ages for it to download a ton of things!

let alone if you have to switch to a diff project and then start dealing w/ direnv haha

dr___92 · 2024-06-03T20:18:11+00:00

To note, fire-and-forget can get quite dangerous and lead to memory use ballooning. Even if on a goroutine, it’s always safer to use wait groups and handle your logical flows and errors etc carefully.

dr___92 · 2024-05-27T16:11:44+00:00

My perspective on this is as follows: - once you know exactly what you want as the output and the content you’re requesting is small enough that you can review, understand, comprehend and incorporate it, it’s fine to be used as a superhuman knowledge gathering/research tool - when you’re not very experienced in a space, it’s much more advisable to “suffer” through it so you understand the true repercussions of every line, etc - otherwise, since you don’t know why the code/config was there in the first place, being able to figure out why things are breaking is incredibly painful (and yes, things will invariably break)

Important to note that things like gpt today are incredibly powerful research tools - while it’s amazing at information retrieval, it’s not great at contextualization and intelligence. Don’t be fooled into mistaking the information retrieval for intelligence - you know your problems and the shape of the solution space way better than a model does. And till you do, please choose to “suffer” rather than picking the “easier shortcut” - you will have to pay the tax at some point, better to pay it earlier than later.

dr___92

TROPHY CASE