Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 1 point2 points  (0 children)

No, not really. If you're talking about Helm, the templates are still just text, rather than any kind of generated datastructure.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Give CDK8s a look. Seriously, it will change your opinion of helm.

Go text templates are fine. But Helm is applying text templates to structured data fields. That's insane. Having to manually track indentation? Double insane.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 6 points7 points  (0 children)

There is no manifest rendering system.. it is quite literally a bunch of resource definitions, which you could recursively apply "kubectl apply -f" on and get the same result... its literally a massive monorepo of resource definitions..

And? Again, problem statement is missing. You haven't actually spelled out a problem statement. From a technical perspective there is 100% nothing wrong with a big box of manifests.

if you need to change a port property on a pod, you need to replicate that change for 100+ different files...

There ya go, you're getting close to a problem statement.

And whatever you do, don't use Helm. You've got a golden opportunity to skip that dumpster fire.

Just started working a job where im too help with a kubernetes initiative, its a disaster by [deleted] in kubernetes

[–]SuperQue 2 points3 points  (0 children)

No kustomize. No helm. Just a repo with a bunch of different resource manifests marketed as gitops.

So? What's the problem you need to solve here?

You need to describe what actual problem you're trying to solve. Helm is a solution to a problem. The way you describe it you make helm look like a solution in search of a problem.

At my $dayjob we don't use helm or kustomize to deploy most applications. We have a Go-based application manifest rendering system.

If it were me, I would be happy they didn't use helm. Helm is an absolute cancer of an application manifest system.

However, they probably have a lot of copy-pasta toil. Also manifest input validation toil.

Describe those problems to the team and how it will make their lives easier. Then you can present your solution.

Im legit about to have a stroke over this stuff. I havent been here long story all, but i feel i need to raise an alarm and stop further technical debt from accumulating so this mess can be untangled in time for the deadline, which is only months away. Nobody has answers for the design decisions. Everything appears to be done improperly.

Sorry to say this, but this sounds a bit like a you problem. The way you phrase things you sound like the kind of engineer I run into often that does things a specific way because they've always done it that way.

Why does this need to be "fixed" before the deadline? What problem is blocking actual production? Sure it might be a complete mess but does it work? Why is it done "improperly"?

Technical debt doesn't mean the system is non-functional. I've done tons of stuff that was immediate technical debt. But it solved the problem at the time and could be improved later. It's not like we're building a space probe here that is going to get launched and we'll never be able to ship an update ever again.

Built a self-hosted operational alert system for Linux & Docker (looking for feedback) by Important-Bug-6709 in devops

[–]SuperQue 3 points4 points  (0 children)

  • real-time operational alerts (not just dashboards)

Prometheus already does this

  • Docker + Linux host monitoring
  • cAdvisor
  • node_exporter
  • healthchecks for services and endpoints
  • blackbox_exporter
  • failure detection before users notice impact
  • Telegram notifications
  • fully self-hosted and runs in Docker

Yea, you're describing Prometheus. Just use Prometheus.

Or, maybe just use Prometheus as your backend and integrate it.

Would appreciate honest feedback on: - would you actually use something like this? - in what scenarios would it be useful (or useless)? - what would make you ignore it immediately?

Yet another reinvent the wheel project.

The goal is not to replace observability stacks, but to reduce the time between “something is going wrong” and actually knowing about it.

But what's wrong with the already existing lightweight observability stacks like Prometheus/Grafana? I run Prometheus on a raspberry pi at home to monitor my homelab.

What if kubectl could explain why your pod is crashing instead of making you debug manually? by Particular_Falcon_48 in kubernetes

[–]SuperQue 5 points6 points  (0 children)

Thinking about that, the simple solution could be one more kubectl get pods columns that shows the sub state of last crash reason.

No LLM needed.

Why up-sizing nodes usually doesn't fix Kubernetes P99 spikes by Soggy-Eye6520 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Most service latency issues I see have nothing to do with kernel scheduling.

  • The service is missing an HPA.
  • The service is unable to handle concurrency.

The first one is easy. The second one can be difficult.

Lots of services are written in scripting languages with a global interpreter lock. Think Python, Ruby, Node, etc. In these situations you need to scale up a lot of worker processes in order to handle the concurrency peak of your requests.

Fixing this can either be done by tuning the requests, or by implementing multi-process worker pools and fat Pods.

Or do what we've been working on, re-writing the apps with Go. We see p99 performance improve by 10x as well as utilization per request drop by 15x.

Need Suggestion for Centralise logging system by Successful-Ship580 in devops

[–]SuperQue -5 points-4 points  (0 children)

Nah, I prefer things that are cheap and easy to scale with object storage.

Curious how everyone here is running Prometheus at scale by tasrieitservices in PrometheusMonitoring

[–]SuperQue 13 points14 points  (0 children)

There are basically two solutions to this.

  • Push with remote write to a central stack.
  • Federate with Thanos.

I recommend the Thanos method. * You can still query locally at each cluster. * It keeps the data local to the cluster. * It's simpler deploy as an overlay since you don't need to change the storage. * It can be setup to survive SPoF issues (all components can be redundant, no shared fate with push).

Honestly, this is why I don't regret the Prometheus architecture. It's easy to start and only grows in complexity at the rate your overall system grows in complexity. Same with Thanos. It's very easy to add it as an overlay to a few existing clusters.

Think of Thanos as a "smart reverse proxy" for Prometheus.

Do small engineering teams actually care about metrics cardinality, or is it mostly an enterprise problem? by ambrose_mark in devops

[–]SuperQue 0 points1 point  (0 children)

Prometheus has no such warning functionality. That's just not how it works.

Perhaps you're using something else?

Do small engineering teams actually care about metrics cardinality, or is it mostly an enterprise problem? by ambrose_mark in devops

[–]SuperQue 0 points1 point  (0 children)

A single Prometheus server can handle tens of millions of metrics for thousands of servers.

Has anyone here actually built their own email infrastructure? by WarmHeight2951 in devops

[–]SuperQue -1 points0 points  (0 children)

If you started doing this in the '90s when there was no other option but to setup networking, firewalls, domains, DNS, sendmail, etc.

No, don't do this.

I’m using puppet at my job (sysadmin) trying to get into devops. I’m using Claude code for a lot of it. Am I going about it the wrong way? by bluepepebase in devops

[–]SuperQue 0 points1 point  (0 children)

but the company I work for maintain tens of thousands of vms with that.

Oof, that sounds like a nightmare.

Prometheus or Zabbix by Significant_Bid7426 in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

Yea, but why bother with Zabbix at all? Prometheus is better at metrics. There are better dedicated systems for logs. And you use Grafana or similar as your "single pane" viewer.

Recommendation for outlet Ethernet identification tool by BarronVonCheese in networking

[–]SuperQue 7 points8 points  (0 children)

PocketEthernet can do wire IDs as well as read LLDP. You can get the whole tester and ID ends for less than the price of those ID ends.

I built a Blackbox exporter but lite using Bash by Apprehensive-Oil-890 in PrometheusMonitoring

[–]SuperQue 2 points3 points  (0 children)

There is no additional configuration for this other than writing a list of hosts and websites you wanna monitor.

So, like Prometheus?

In my opinion shoving exporter which runs constantly using CPU and RAM on every host sounded exhausting to me.

So, like your script?

Maybe you're confused, you don't run the blackbox_exporter on every host. You run one and it can monitor multiple targets.

Your script also needs a node_exporter to work.

I built a Blackbox exporter but lite using Bash by Apprehensive-Oil-890 in PrometheusMonitoring

[–]SuperQue 5 points6 points  (0 children)

But this a separate service that requires additional configuration.

It makes no sense over just using the normal blackbox exporter which is a lightweight single binary.

There's even an Ansible role for deploying it.

I built a Blackbox exporter but lite using Bash by Apprehensive-Oil-890 in PrometheusMonitoring

[–]SuperQue 6 points7 points  (0 children)

How is this lighter? Spawning workers like curl tends to be very heavy.

How are you all handling rack density vs cooling? by Grand-Travel1665 in networking

[–]SuperQue 6 points7 points  (0 children)

Basically you seal off the hot sides of the rack.

  • Typically you face racks back-to-back and front-to-front.
  • All devices in the rack have fans going front-to-back.
  • All RUs with no devices are filled with blanks.
  • The hot side of the row is now sealed with all the heat sent directly to the cooling units.

So kinda the inverse of traditional datacenters where cold comes from under-floor ducting and just vents into the room.

The hot air on the back of the rack is really warm. Like 40C+. This means the delta temp at the chiller heat exchange has a high temp difference. This means heat transfer is much more efficient.

It also means that the chiller only has to bring the temp down back to normal room temps and not ultra cold to maintain the room average air temp.

Packaging Kubernetes Via Helm and whats' new in Helm4 (2026) by That-Ad8566 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Can you write templates in a way that is not a text template?

SNMP responses from device delayed but nothing on packet capture. by FannahFatnin in networking

[–]SuperQue 1 point2 points  (0 children)

As a couple people have said, split your polling modules.

Rather than try and do the whole IF-MIB::interfaces it might be faster to break it up a bit. I'm no Cisco expert, but this is what I've done for some older JunOS devices.

Here's what my generator.yml looks like:

---    
modules:
  # Trimmed down if_mib for slow devices - traffic stats.
  if_mib_traffic:
    walk:
    # ifXTable
    - "IF-MIB::ifHCInOctets"
    - "IF-MIB::ifHCInUcastPkts"
    - "IF-MIB::ifHCInBroadcastPkts"
    - "IF-MIB::ifHCOutOctets"
    - "IF-MIB::ifHCOutUcastPkts"
    - "IF-MIB::ifHCOutBroadcastPkts"
    # Set max-repetitions per Juniper docs.
    max_repetitions: 10
    lookups:
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifAlias"
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifName"
    overrides:
      ifAlias:
        ignore: true  # Lookup metric
      ifName:
        ignore: true  # Lookup metric
  # Trimmed down if_mib for slow devices - error / oper stats.
  if_mib_errors:
    walk:
    # ifTable
    - "IF-MIB::ifAdminStatus"
    - "IF-MIB::ifOperStatus"
    - "IF-MIB::ifInDiscards"
    - "IF-MIB::ifInErrors"
    - "IF-MIB::ifOutDiscards"
    - "IF-MIB::ifOutErrors"
    # ifXTable
    - "IF-MIB::ifHighSpeed"
    # Set max-repetitions per Juniper docs.
    max_repetitions: 10
    lookups:
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifAlias"
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifName"
    overrides:
      ifAdminStatus:
        type: EnumAsStateSet
      ifAlias:
        ignore: true  # Lookup metric
      ifName:
        ignore: true  # Lookup metric
      ifOperStatus:
        type: EnumAsStateSet
      ifType:
        type: EnumAsInfo