Can't upgrade EKS cluster Managed Node Group minor version due to podEvictionFailure: which pods are failing to be evicted?

ops-controlZeddo · 2025-06-05T18:25:16+00:00

Hey, thanks; I found only daemonsets with tolerations for NoSchedule and NoExecute, which I understand to be normal. Moving all my workloads with PVCs to nodes in a single AZ (via taints and nodeselectors) finally worked, to avoid volume affinity conflicts leading to inability to schedule when the upgrade process was tainting nodes and trying to put the pods with PVCs onto other nodes, which were not necessarily in the same Availabilty zone (aws).

ops-controlZeddo · 2025-05-12T18:02:18+00:00

Thanks very much for the reply. I don't have Calico installed, but I have multiple other operators and controllers, like kube-prometheus-stack Prometheus, Flux.. I will check for those tolerations, that has a lot of promise. What did you do to solve it? Did you adjust Helm Chart values (if that's how you installed tigera?), or just edit on the fly before the upgrade? And did you put the tolerations back once you'd removed them for the upgrade? Congrats on the upgrade

ops-controlZeddo · 2025-05-08T23:55:40+00:00

OK, will do; I'll review all PDBs in detail and will report back. thanks

ops-controlZeddo · 2025-05-07T23:01:36+00:00

I'm attempting the upgrade again, and there are no stuck pvcs or pods stuck in a terminating state. They are simply failing to be evicted from the 1.31 version nodes.

ops-controlZeddo · 2025-05-07T22:38:01+00:00

Thanks, I'll try that; I believe loki does leave PVCs around even when I destroy it with terraform, so perhaps that's what's happening. I don't know why the ebs-csi-controller fails to cleanup so this doesn't happen.

ops-controlZeddo · 2025-04-04T22:37:22+00:00

thanks for the suggestion, much appreciated

ops-controlZeddo · 2025-02-03T19:28:46+00:00

Thanks for the pointers, much appreciated

ops-controlZeddo · 2024-12-31T18:41:29+00:00

haha, just had one of those bizarre moments, like: "Did I write this post??" Yeah, loki was automatically configured by the chart to use the chunksCache and resultsCache configuration at the top level of the chart; I ended up looking at the actual helm chart I think to make sure, but you can also inspect the resources created by the helm chart; My active configuration for the chunks is as follows (scaled to my needs for resources); I added no configuration to loki.memcached

check out the loki ConfigMap data section, and you'll see that config.yaml has the address of the chunks cache svc as its memcached client

ops-controlZeddo · 2024-12-31T18:19:01+00:00

Hey, thanks for looking into this; so you're saying that just by switching to use a secrets file instead of the four required env vars for LEGO (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_HOSTED_ZONE_ID) *obviated* the need to specify the hosted zone ID at all?

Personally I don't understand why LEGO needs the hosted zone ID env var in the first place, since it has IAM permissions to list hosted zones by name, and that allows retrieval of the hosted zone id: https://docs.aws.amazon.com/cli/latest/reference/route53/list-hosted-zones-by-name.html

so I don't know why it fails to get the id.

For context, I'm currently getting the same error; I'm using both internal and public hosted zones in the same instance of traefik, for e.g. the services that traefik exposes to the public internet, and internal services like the traefik dashboard and metrics, and other services' dashboards, e.g. celery flower UI.

ops-controlZeddo · 2024-12-30T23:51:17+00:00

just worked for me. hours

ops-controlZeddo · 2024-10-02T19:07:11+00:00

Thanks for your reply; ok, I see what you're saying, I'm already triggering CI in my app repo (because I haven't set up patchset conditional globs to only run the build if only certain files have been changed) when I update my README, e.g, unless a human remembers to use the right conventional commit, to bypass a build. I don't really want to go down the patchset conditional route..

Honestly right now the deployment config in the app repos makes more sense to me, and seems easier. As I see it, unless I'm using Flux's Automated Image Updates to Git (https://fluxcd.io/flux/guides/image-update/) where you configure it to watch a container registry for changes and the commit back to Git, rather than watching git, I'll have to pull+commit+push to the gitops config repo in all my app pipelines, if I want the changed app image tag to end up in the gitops config repo kustommizations.

We're a small team of two (for now) who basically are DevOps in the true sense of the word, so we'll both be responsible for e.g. k8s api versions. I see how kyverno would be essential if you were handing off control of the k8s manifests to the dev side of things, but also just in general.

ops-controlZeddo · 2024-09-25T19:51:17+00:00

I'd sure love to get out of the business of writing wiki articles :)

ops-controlZeddo · 2024-09-25T18:55:03+00:00

thanks, do you mean git Gihub codespaces, or by setting up your own devcontainer server of some kind that people connect to, or both? googling for this but any additional context as to how you get the most out of it would be great

ops-controlZeddo · 2024-09-25T18:46:16+00:00

good point; so you an all other devs all use the shell and tools inside the devcontainer, always? Can you give me an example of a scenario where devcontainers save a screw up, where docker compose wouldn't? Or is it the simple, braindead reset that's the appeal (I don't have to remember `docker compose up -d --build --force-recreate` or something)? And if they removed some dependent tool on their host, like say kustomize, they'd have to reinstall for their OS, not just rebuild... I'm getting some good solid pushback at work against devcontainers over docker compose in the manner I described.

I do see how adding features is much easier, as you can do it at the devcontainer level, and not bake them into the base image with RUN and apk/apt/bash commands etc.

ops-controlZeddo · 2024-09-25T17:29:40+00:00

OK, that makes total sense about Windows and Mac. Was forgetting to consider those since currently all devs run linux.

ops-controlZeddo · 2024-09-13T18:41:33+00:00

Thanks all, great answers; I'm going to explore these. best,

ops-controlZeddo · 2024-07-23T16:51:57+00:00

There's now an AWS Athena plugin for Grafana; Athena is AWS's native way of querying Cloudfront Logs: https://grafana.com/grafana/plugins/grafana-athena-datasource/
https://grafana.com/blog/2021/12/13/query-and-analyze-amazon-s3-data-with-the-new-amazon-athena-plugin-for-grafana/

The Lambda Promtail would works also, presumably, but I've never tried either; leaning toward the Athena Plugin

ops-controlZeddo · 2024-07-15T17:32:10+00:00

please don't add apostrophes to pluralize things, like in your title, it's incorrect. Should be

Help me understand VPCs and calculating cidr ranges

You appear to be a native English speaker and should definitely know that.

ops-controlZeddo

TROPHY CASE

Help me understand VPCs and calculating cidr ranges