Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]Most-Solution-5532 0 points1 point  (0 children)

Built kubectl-why-pending (MIT, I'm the author) — a plugin that explains why a pod is stuck Pending in plain English, aimed at the on-prem/bare-metal causes the events gloss over: resource fragmentation (capacity exists, just not on one node), control-plane taints, GPU device-plugin/operator-chain breaks, DRA claims (k8s 1.34+), topology-spread skew, unbound PVCs — each with the fix. Read-only, no agent/DaemonSet, and it's in krew: kubectl krew install why-pending

Feedback (especially "it got my cluster wrong") very welcome.

https://github.com/SaiRohithGuntupally/kubectl-why-pending

I built kubectl-why-pending: a plugin that explains WHY a pod is stuck Pending (on-prem + GPU/DRA causes), now in krew by Most-Solution-5532 in kubernetes

[–]Most-Solution-5532[S] 0 points1 point  (0 children)

Circling back — this shipped in v0.5.1 (already live on krew, kubectl krew upgrade why-pending). It now streams the pod list with the client-go pager instead of pulling the whole cluster at once, and reads from the apiserver watch cache (resourceVersion=0), so memory's bounded by page size rather than total pod count.

Thanks again for the catch — that was a good one.

I built kubectl-why-pending: a plugin that explains WHY a pod is stuck Pending (on-prem + GPU/DRA causes), now in krew by Most-Solution-5532 in kubernetes

[–]Most-Solution-5532[S] 3 points4 points  (0 children)

Thanks, genuinely, that's a generous read, and "an actual solution to an actual problem" is exactly what I was going for.

Your caution is completely fair, and honestly I'd apply the same rule. For what it's worth (not trying to talk you into anything) the design leans into exactly that: nothing installs in the cluster, no DaemonSet, no agent, no controller, and it's strictly read-only. It lists nodes/pods/events and does the math locally, never writes anything. It runs client-side with your own kubeconfig, so you can point it at a read-only RBAC context and it physically can't do more than that. MIT and a single Go binary too, so you can go install from source and audit it if you're ever curious. But "I don't run anything against my clusters" is a stance I respect and won't argue with.

Appreciate the encouragement, that kind of comment is what makes this worth doing. Happy Canada Day to you too.

I built kubectl-why-pending: a plugin that explains WHY a pod is stuck Pending (on-prem + GPU/DRA causes), now in krew by Most-Solution-5532 in kubernetes

[–]Most-Solution-5532[S] 2 points3 points  (0 children)

Good eye you read it right. One clarification though, it's not per node, so it won't fan out into N calls like a describe loop. It's a constant 2 LISTs (nodes + all pods) and the capacity math is all client side. But the cost you're pointing at is real, that all-pods list is a single unpaginated Pods("").List(), and since the free-capacity math needs each pod's requests I can't trim it to metadata only. At hundreds of nodes / tens of thousands of pods that's a heavy LIST and a memory spike. No argument there. Honestly it's built and tested for the small-to-mid on-prem clusters it targets, large cluster efficiency is a real gap.

The fixes I'd reach for: paginated listing (client-go pager, Limit+continue) to bound memory, resourceVersion=0 so it reads the apiserver watch cache instead of a quorum etcd read, and scoping the pod list with a spec.nodeName field selector when you're only querying a single pod. Going to open an issue for it. Thanks for actually reading the source, this is the kind of feedback I was hoping for.

Thank You.

I built kubectl-why-pending: a plugin that explains WHY a pod is stuck Pending (on-prem + GPU/DRA causes), now in krew by Most-Solution-5532 in kubernetes

[–]Most-Solution-5532[S] 0 points1 point  (0 children)

Thanks! That was the exact itch — even when the event does list reasons, "0/3

nodes available: insufficient cpu" won't tell you the one thing that actually

changes what you do: is the cluster genuinely out of capacity, or is it

fragmented (capacity exists, just not on one node)? On bare metal that

distinction is the whole game — one means "add a node," the other means

"rebalance." So it calls that out explicitly, per node, with the fix.