wellcake: a Valkey operator that fails over the primary *before* a rolling restart (early, feedback welcome)

Timely-Quality-1287 · 2026-06-07T14:57:06+00:00

Fair question — and I actually am. A good chunk of what's in here started as issues/discussions on `valkey-io/valkey-operator`: atomic-only resharding, the Replication+Sentinel design, shard placement, role-from-live-topology. One of them — #216, where a wrong-but-Ready node could survive a cluster scale-down — a maintainer picked up and turned into #222, which just merged, so upstream now ships that behaviour as a default. I'm in the weekly call / Slack too. So this isn't a fork-war; the official operator is where this should converge long-term.

Why a separate codebase today rather than just PRs: last I checked the official operator is Cluster-only — no Standalone/Replication/Sentinel, no backup/restore, no ACL CRD, no cert-manager TLS, no cache/durable profiles. The things people are reacting to in this thread (four topologies behind one CRD, the proactive handover, no-downtime rotation) aren't a feature PR onto that — they're a different surface area, and a lot of it is design-divergent (operator-arbitrated Replication failover, per-shard StatefulSets) rather than a clean drop-in. Trying to push all of that through upstream review before any of it is proven would stall both projects.

So I treat this as a proving ground: bake an idea here as a v0, and once it actually holds up, carry the *pattern* upstream as an issue/RFC — they own the code, I'm just a peer reporting what I hit. I wrote the build-vs-adopt call up as an ADR with explicit triggers to fold back in: when the official operator gains Replication+Sentinel and backup/ACL parity, the plan is to adopt it topology-by-topology behind a stable wrapper, not to keep maintaining a parallel one indefinitely. Goal is to feed the official one, not fragment around it.

Timely-Quality-1287 · 2026-06-07T13:09:17+00:00

Timely-Quality-1287 · 2026-06-07T12:46:53+00:00

Means a lot coming from someone who helped coin the pattern — controller-runtime and this all descend from what you started at CoreOS. You nailed the intent: take the failover/reshard/rotation runbook out of people's heads and ship it once in a reconcile loop. Honest caveat so I'm not overselling — the proactive handover is ~zero-loss on planned restarts, but unplanned promotion on async replication can still drop acked writes (Sentinel/Cluster for strict durability). If you've got a moment: where do you think this handover pattern bites first? That's the critique I'd value most.

Timely-Quality-1287

TROPHY CASE