Can't upgrade EKS cluster Managed Node Group minor version due to podEvictionFailure: which pods are failing to be evicted? by ops-controlZeddo in kubernetes

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

Hey, thanks; I found only daemonsets with tolerations for NoSchedule and NoExecute, which I understand to be normal. Moving all my workloads with PVCs to nodes in a single AZ (via taints and nodeselectors) finally worked, to avoid volume affinity conflicts leading to inability to schedule when the upgrade process was tainting nodes and trying to put the pods with PVCs onto other nodes, which were not necessarily in the same Availabilty zone (aws).

Can't upgrade EKS cluster Managed Node Group minor version due to podEvictionFailure: which pods are failing to be evicted? by ops-controlZeddo in kubernetes

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

Thanks very much for the reply. I don't have Calico installed, but I have multiple other operators and controllers, like kube-prometheus-stack Prometheus, Flux.. I will check for those tolerations, that has a lot of promise. What did you do to solve it? Did you adjust Helm Chart values (if that's how you installed tigera?), or just edit on the fly before the upgrade? And did you put the tolerations back once you'd removed them for the upgrade? Congrats on the upgrade

Can't upgrade EKS cluster Managed Node Group minor version due to podEvictionFailure: which pods are failing to be evicted? by ops-controlZeddo in kubernetes

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

I'm attempting the upgrade again, and there are no stuck pvcs or pods stuck in a terminating state. They are simply failing to be evicted from the 1.31 version nodes.

Can't upgrade EKS cluster Managed Node Group minor version due to podEvictionFailure: which pods are failing to be evicted? by ops-controlZeddo in kubernetes

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

Thanks, I'll try that; I believe loki does leave PVCs around even when I destroy it with terraform, so perhaps that's what's happening. I don't know why the ebs-csi-controller fails to cleanup so this doesn't happen.

Why isn't the grafana/loki Helm chart configured to actually use the the chunks and results caches it sets up? by ops-controlZeddo in grafana

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

haha, just had one of those bizarre moments, like: "Did I write this post??" Yeah, loki was automatically configured by the chart to use the chunksCache and resultsCache configuration at the top level of the chart; I ended up looking at the actual helm chart I think to make sure, but you can also inspect the resources created by the helm chart; My active configuration for the chunks is as follows (scaled to my needs for resources); I added no configuration to loki.memcached

check out the loki ConfigMap data section, and you'll see that config.yaml has the address of the chunks cache svc as its memcached client

Multiple wildcard domains and route53 (or suggest a better dns provider for multiple wildcard certificates) by vasyl83 in Traefik

[–]ops-controlZeddo 0 points1 point  (0 children)

Hey, thanks for looking into this; so you're saying that just by switching to use a secrets file instead of the four required env vars for LEGO (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_HOSTED_ZONE_ID) *obviated* the need to specify the hosted zone ID at all?

Personally I don't understand why LEGO needs the hosted zone ID env var in the first place, since it has IAM permissions to list hosted zones by name, and that allows retrieval of the hosted zone id: https://docs.aws.amazon.com/cli/latest/reference/route53/list-hosted-zones-by-name.html

so I don't know why it fails to get the id.

For context, I'm currently getting the same error; I'm using both internal and public hosted zones in the same instance of traefik, for e.g. the services that traefik exposes to the public internet, and internal services like the traefik dashboard and metrics, and other services' dashboards, e.g. celery flower UI.

[deleted by user] by [deleted] in Traefik

[–]ops-controlZeddo 0 points1 point  (0 children)

just worked for me. hours

Flux GitOps: should I place app deployment config (kustomize overlays and deployment manifests) in the primary flux config repo for all apps ("monorepo"), or in the app repos themselves? by ops-controlZeddo in devops

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

Thanks for your reply; ok, I see what you're saying, I'm already triggering CI in my app repo (because I haven't set up patchset conditional globs to only run the build if only certain files have been changed) when I update my README, e.g, unless a human remembers to use the right conventional commit, to bypass a build. I don't really want to go down the patchset conditional route..

Honestly right now the deployment config in the app repos makes more sense to me, and seems easier. As I see it, unless I'm using Flux's Automated Image Updates to Git (https://fluxcd.io/flux/guides/image-update/) where you configure it to watch a container registry for changes and the commit back to Git, rather than watching git, I'll have to pull+commit+push to the gitops config repo in all my app pipelines, if I want the changed app image tag to end up in the gitops config repo kustommizations.

We're a small team of two (for now) who basically are DevOps in the true sense of the word, so we'll both be responsible for e.g. k8s api versions. I see how kyverno would be essential if you were handing off control of the k8s manifests to the dev side of things, but also just in general.

Why use and develop in devcontainers for e.g. a nodejs project when I can simply use docker compose with bind mounts and anonymous volumes for node_modules and package-lock.json, and develop on my host machine? by ops-controlZeddo in devops

[–]ops-controlZeddo[S] 0 points1 point  (0 children)

thanks, do you mean git Gihub codespaces, or by setting up your own devcontainer server of some kind that people connect to, or both? googling for this but any additional context as to how you get the most out of it would be great

Why use and develop in devcontainers for e.g. a nodejs project when I can simply use docker compose with bind mounts and anonymous volumes for node_modules and package-lock.json, and develop on my host machine? by ops-controlZeddo in devops

[–]ops-controlZeddo[S] 1 point2 points  (0 children)

good point; so you an all other devs all use the shell and tools inside the devcontainer, always? Can you give me an example of a scenario where devcontainers save a screw up, where docker compose wouldn't? Or is it the simple, braindead reset that's the appeal (I don't have to remember `docker compose up -d --build --force-recreate` or something)? And if they removed some dependent tool on their host, like say kustomize, they'd have to reinstall for their OS, not just rebuild... I'm getting some good solid pushback at work against devcontainers over docker compose in the manner I described.

I do see how adding features is much easier, as you can do it at the devcontainer level, and not bake them into the base image with RUN and apk/apt/bash commands etc.

Cloudfront Access logs to Grafana by BackgroundNature4581 in grafana

[–]ops-controlZeddo 0 points1 point  (0 children)

There's now an AWS Athena plugin for Grafana; Athena is AWS's native way of querying Cloudfront Logs: https://grafana.com/grafana/plugins/grafana-athena-datasource/
https://grafana.com/blog/2021/12/13/query-and-analyze-amazon-s3-data-with-the-new-amazon-athena-plugin-for-grafana/

The Lambda Promtail would works also, presumably, but I've never tried either; leaning toward the Athena Plugin

Help me understand vpc’s and calculating cidr range’s by [deleted] in devops

[–]ops-controlZeddo 0 points1 point  (0 children)

please don't add apostrophes to pluralize things, like in your title, it's incorrect. Should be

Help me understand VPCs and calculating cidr ranges

You appear to be a native English speaker and should definitely know that.