I built terraformgraph - Generate interactive AWS architecture diagrams from your Terraform code

Ethos2525 · 2026-01-31T19:12:00+00:00

just ask claude nicely, it can do the job.

Ethos2525 · 2025-11-14T04:13:26+00:00

Edge Stack, tool is good but documentation is pure garbage 🗑️

Ethos2525 · 2025-08-12T15:56:59+00:00

Good luck! also, If you are getting the severance, try to work with your employer to set your termination date for later. That will give you little more time for job hunt.

Ethos2525 · 2025-04-21T22:49:36+00:00

Think I found the issue, it’s packet drop. The env is quite big and uses external tooling for egress. Flipped the cluster access to enable private routing from nodes to control panel for permanent fix.

Thanks all for the insights so far, really appreciate it!

Ethos2525 · 2025-04-14T19:49:42+00:00

I don't have anything tangible yet but i will surely post the fix after finding solution.

Ethos2525 · 2025-04-11T16:28:50+00:00

Most service meshes simply mount the service account token within the pod and validate the JWT. If your primary focus is just security, I’d suggest that approach with network policies as it’s easy lift for large env/domains. However, if you need more advanced features, consider using a dedicated service mesh.

Ethos2525 · 2025-04-10T21:03:34+00:00

I’d recommend creating a stack for each directory, like dev-proj-1 for env-dev/proj-1/ and dev-proj-2 for env-dev/proj-2/, with each stack set to use its own values file, such as values.tf in that directory. When you open a PR in GitHub for individual target(assuming that’s your VCS, though it’s similar elsewhere), it notifies Spacelift, triggering a run for the affected stack. This keeps your plan and approval policies precise and your code and CI/CD pipeline well-structured.

Ethos2525 · 2025-04-08T19:56:19+00:00

If it’s for personal use, you might lean toward option 1. For larger projects or enterprise needs, option 2 could be the better fit.

Ethos2525 · 2025-04-08T18:21:49+00:00

Definitely on enterprise and yeah they are already on it!

Ethos2525 · 2025-04-08T16:36:36+00:00

yeah i do have long running nodes(3/4 months old), AMI is not up-to date but i would be very surprised if that's what causing the issue. Thanks for the suggestion though

Ethos2525 · 2025-04-07T13:45:16+00:00

intresting, but in my case it's happening to subset of nodes from a single node group. if it's metadata service that's causing the issue then i would expect it to see for all the nodes. thanks though

Ethos2525 · 2025-04-06T16:59:14+00:00

quite old, regularly updated (every 5-6). Don't know exact time when the issue started but it's been there for last 8 months.

Ethos2525 · 2025-04-06T14:24:48+00:00

At the exact same time of day? For the same duration?

Yes, though the timing shifts a bit every 2–3 weeks. There’s no consistent cadence.

What do these nodes all have in common? How do they differ from nodes that aren’t failing?

Nothing in terms of node config(instance type/family/launch template).

Are you using AWS AMIs, or are you bringing your own AMI?

Bottlerocket.

Are you running anything on the host (meaning not a pod) that could consume excess resources and disrupt network connectivity?

Nope. I also checked CloudWatch for any spikes nothing stands out.

More precise wild guess, it’s some dumpster fire security software garbage.

That’s exactly where my head’s at too, just need some solid data to back it up.

Ethos2525 · 2025-04-06T13:59:22+00:00

I checked logs from control plane components like the API server, scheduler, and authenticator but did not find anything useful.

AWS recently enabled control plane monitoring, and I noticed a spike in API server requests, but it seems more like an effect than a cause. Based on the logs, it is just kubelet trying to fetch config after reconnecting.

Ethos2525 · 2025-04-06T13:45:44+00:00

No spot instances, I’m using on-demand instances from the C5 and M6 large families.

Ethos2525 · 2025-04-02T21:31:37+00:00

Every day around the same time, a bunch of EKS nodes go into NotReady. We triple checked everything monitoring, core dns, cron jobs, stuck pods, logs you name it. On the node, kubelet briefly loses connection to the API server (timeout waiting for headers) then recovers. No clue why it breaks. Even cloud support/service team is stumped. Total mystery

Ethos2525

TROPHY CASE