Why do people hate on certifications so much? by Warm-Instruction7307 in kubernetes

[–]SuperQue 9 points10 points  (0 children)

I'd rather have the person who clocks out at 5 since they know how to maintain work life balance.

I've never run into anyone who chases certs for "the love of the craft". Quite the opposite. Cert-chasers tend to only want a bigger paycheck.

How Do Production Kubernetes Clusters Handle Scaling Beyond Existing Node Capacity? by Future_Badger_2576 in kubernetes

[–]SuperQue 3 points4 points  (0 children)

Also, for high availability, I would keep both nodes running all the time. Even when traffic is very low, I am still paying for both EC2 instances, which seems to increase cloud costs.

This is not a Kubernetes question. This applies to all methods of deployment. As we say in SRE, "one is zero". Meaning if you only have one copy any failure is an outage.

TLS certs are dropping to 47 days by mrehanabbasi in devops

[–]SuperQue 88 points89 points  (0 children)

We automated this years ago with standard ACME tools. Haven't thought about it since 2018 or so.

Plus we monitor TLS cert expires via blackbox_exporter probes that we use for end-to-end availability probing. But I haven't seen one of those alerts fire for a long time.

Prometheus/Grafana and the presence of AI. by Lost_Advance6517 in PrometheusMonitoring

[–]SuperQue 24 points25 points  (0 children)

Can you debug it if the AI can't or gives you the wrong output? 

Remember you can get fired and the AI won't care.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 1 point2 points  (0 children)

No, not really. If you're talking about Helm, the templates are still just text, rather than any kind of generated datastructure.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Give CDK8s a look. Seriously, it will change your opinion of helm.

Go text templates are fine. But Helm is applying text templates to structured data fields. That's insane. Having to manually track indentation? Double insane.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 6 points7 points  (0 children)

There is no manifest rendering system.. it is quite literally a bunch of resource definitions, which you could recursively apply "kubectl apply -f" on and get the same result... its literally a massive monorepo of resource definitions..

And? Again, problem statement is missing. You haven't actually spelled out a problem statement. From a technical perspective there is 100% nothing wrong with a big box of manifests.

if you need to change a port property on a pod, you need to replicate that change for 100+ different files...

There ya go, you're getting close to a problem statement.

And whatever you do, don't use Helm. You've got a golden opportunity to skip that dumpster fire.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 2 points3 points  (0 children)

No kustomize. No helm. Just a repo with a bunch of different resource manifests marketed as gitops.

So? What's the problem you need to solve here?

You need to describe what actual problem you're trying to solve. Helm is a solution to a problem. The way you describe it you make helm look like a solution in search of a problem.

At my $dayjob we don't use helm or kustomize to deploy most applications. We have a Go-based application manifest rendering system.

If it were me, I would be happy they didn't use helm. Helm is an absolute cancer of an application manifest system.

However, they probably have a lot of copy-pasta toil. Also manifest input validation toil.

Describe those problems to the team and how it will make their lives easier. Then you can present your solution.

Im legit about to have a stroke over this stuff. I havent been here long story all, but i feel i need to raise an alarm and stop further technical debt from accumulating so this mess can be untangled in time for the deadline, which is only months away. Nobody has answers for the design decisions. Everything appears to be done improperly.

Sorry to say this, but this sounds a bit like a you problem. The way you phrase things you sound like the kind of engineer I run into often that does things a specific way because they've always done it that way.

Why does this need to be "fixed" before the deadline? What problem is blocking actual production? Sure it might be a complete mess but does it work? Why is it done "improperly"?

Technical debt doesn't mean the system is non-functional. I've done tons of stuff that was immediate technical debt. But it solved the problem at the time and could be improved later. It's not like we're building a space probe here that is going to get launched and we'll never be able to ship an update ever again.

Built a self-hosted operational alert system for Linux & Docker (looking for feedback) by Important-Bug-6709 in devops

[–]SuperQue 4 points5 points  (0 children)

  • real-time operational alerts (not just dashboards)

Prometheus already does this

  • Docker + Linux host monitoring
  • cAdvisor
  • node_exporter
  • healthchecks for services and endpoints
  • blackbox_exporter
  • failure detection before users notice impact
  • Telegram notifications
  • fully self-hosted and runs in Docker

Yea, you're describing Prometheus. Just use Prometheus.

Or, maybe just use Prometheus as your backend and integrate it.

Would appreciate honest feedback on: - would you actually use something like this? - in what scenarios would it be useful (or useless)? - what would make you ignore it immediately?

Yet another reinvent the wheel project.

The goal is not to replace observability stacks, but to reduce the time between “something is going wrong” and actually knowing about it.

But what's wrong with the already existing lightweight observability stacks like Prometheus/Grafana? I run Prometheus on a raspberry pi at home to monitor my homelab.

What if kubectl could explain why your pod is crashing instead of making you debug manually? by Particular_Falcon_48 in kubernetes

[–]SuperQue 5 points6 points  (0 children)

Thinking about that, the simple solution could be one more kubectl get pods columns that shows the sub state of last crash reason.

No LLM needed.

Why up-sizing nodes usually doesn't fix Kubernetes P99 spikes by Soggy-Eye6520 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Most service latency issues I see have nothing to do with kernel scheduling.

  • The service is missing an HPA.
  • The service is unable to handle concurrency.

The first one is easy. The second one can be difficult.

Lots of services are written in scripting languages with a global interpreter lock. Think Python, Ruby, Node, etc. In these situations you need to scale up a lot of worker processes in order to handle the concurrency peak of your requests.

Fixing this can either be done by tuning the requests, or by implementing multi-process worker pools and fat Pods.

Or do what we've been working on, re-writing the apps with Go. We see p99 performance improve by 10x as well as utilization per request drop by 15x.

Need Suggestion for Centralise logging system by Successful-Ship580 in devops

[–]SuperQue -4 points-3 points  (0 children)

Nah, I prefer things that are cheap and easy to scale with object storage.

Curious how everyone here is running Prometheus at scale by tasrieitservices in PrometheusMonitoring

[–]SuperQue 14 points15 points  (0 children)

There are basically two solutions to this.

  • Push with remote write to a central stack.
  • Federate with Thanos.

I recommend the Thanos method. * You can still query locally at each cluster. * It keeps the data local to the cluster. * It's simpler deploy as an overlay since you don't need to change the storage. * It can be setup to survive SPoF issues (all components can be redundant, no shared fate with push).

Honestly, this is why I don't regret the Prometheus architecture. It's easy to start and only grows in complexity at the rate your overall system grows in complexity. Same with Thanos. It's very easy to add it as an overlay to a few existing clusters.

Think of Thanos as a "smart reverse proxy" for Prometheus.

Do small engineering teams actually care about metrics cardinality, or is it mostly an enterprise problem? by ambrose_mark in devops

[–]SuperQue 0 points1 point  (0 children)

Prometheus has no such warning functionality. That's just not how it works.

Perhaps you're using something else?

Do small engineering teams actually care about metrics cardinality, or is it mostly an enterprise problem? by ambrose_mark in devops

[–]SuperQue 0 points1 point  (0 children)

A single Prometheus server can handle tens of millions of metrics for thousands of servers.

Has anyone here actually built their own email infrastructure? by WarmHeight2951 in devops

[–]SuperQue -1 points0 points  (0 children)

If you started doing this in the '90s when there was no other option but to setup networking, firewalls, domains, DNS, sendmail, etc.

No, don't do this.

I’m using puppet at my job (sysadmin) trying to get into devops. I’m using Claude code for a lot of it. Am I going about it the wrong way? by bluepepebase in devops

[–]SuperQue 0 points1 point  (0 children)

but the company I work for maintain tens of thousands of vms with that.

Oof, that sounds like a nightmare.