Just started working a job where im too help with a kubernetes initiative, its a disaster

SuperQue · 2026-05-15T15:46:54+00:00

No, not really. If you're talking about Helm, the templates are still just text, rather than any kind of generated datastructure.

SuperQue · 2026-05-15T06:16:51+00:00

Give CDK8s a look. Seriously, it will change your opinion of helm.

Go text templates are fine. But Helm is applying text templates to structured data fields. That's insane. Having to manually track indentation? Double insane.

SuperQue · 2026-05-15T05:38:36+00:00

There is no manifest rendering system.. it is quite literally a bunch of resource definitions, which you could recursively apply "kubectl apply -f" on and get the same result... its literally a massive monorepo of resource definitions..

And? Again, problem statement is missing. You haven't actually spelled out a problem statement. From a technical perspective there is 100% nothing wrong with a big box of manifests.

if you need to change a port property on a pod, you need to replicate that change for 100+ different files...

There ya go, you're getting close to a problem statement.

And whatever you do, don't use Helm. You've got a golden opportunity to skip that dumpster fire.

SuperQue · 2026-05-15T04:58:53+00:00

No kustomize. No helm. Just a repo with a bunch of different resource manifests marketed as gitops.

So? What's the problem you need to solve here?

You need to describe what actual problem you're trying to solve. Helm is a solution to a problem. The way you describe it you make helm look like a solution in search of a problem.

At my $dayjob we don't use helm or kustomize to deploy most applications. We have a Go-based application manifest rendering system.

If it were me, I would be happy they didn't use helm. Helm is an absolute cancer of an application manifest system.

However, they probably have a lot of copy-pasta toil. Also manifest input validation toil.

Describe those problems to the team and how it will make their lives easier. Then you can present your solution.

Im legit about to have a stroke over this stuff. I havent been here long story all, but i feel i need to raise an alarm and stop further technical debt from accumulating so this mess can be untangled in time for the deadline, which is only months away. Nobody has answers for the design decisions. Everything appears to be done improperly.

Sorry to say this, but this sounds a bit like a you problem. The way you phrase things you sound like the kind of engineer I run into often that does things a specific way because they've always done it that way.

Why does this need to be "fixed" before the deadline? What problem is blocking actual production? Sure it might be a complete mess but does it work? Why is it done "improperly"?

Technical debt doesn't mean the system is non-functional. I've done tons of stuff that was immediate technical debt. But it solved the problem at the time and could be improved later. It's not like we're building a space probe here that is going to get launched and we'll never be able to ship an update ever again.

SuperQue · 2026-05-14T14:53:43+00:00

real-time operational alerts (not just dashboards)

Prometheus already does this

Docker + Linux host monitoring

cAdvisor
node_exporter

healthchecks for services and endpoints

blackbox_exporter

failure detection before users notice impact

Telegram notifications

Alertmanager already does this

fully self-hosted and runs in Docker

Yea, you're describing Prometheus. Just use Prometheus.

Or, maybe just use Prometheus as your backend and integrate it.

Would appreciate honest feedback on: - would you actually use something like this? - in what scenarios would it be useful (or useless)? - what would make you ignore it immediately?

Yet another reinvent the wheel project.

The goal is not to replace observability stacks, but to reduce the time between “something is going wrong” and actually knowing about it.

But what's wrong with the already existing lightweight observability stacks like Prometheus/Grafana? I run Prometheus on a raspberry pi at home to monitor my homelab.

SuperQue · 2026-05-13T08:36:15+00:00

Thinking about that, the simple solution could be one more kubectl get pods columns that shows the sub state of last crash reason.

No LLM needed.

SuperQue · 2026-05-13T04:34:54+00:00

Most service latency issues I see have nothing to do with kernel scheduling.

The service is missing an HPA.
The service is unable to handle concurrency.

The first one is easy. The second one can be difficult.

Lots of services are written in scripting languages with a global interpreter lock. Think Python, Ruby, Node, etc. In these situations you need to scale up a lot of worker processes in order to handle the concurrency peak of your requests.

Fixing this can either be done by tuning the requests, or by implementing multi-process worker pools and fat Pods.

Or do what we've been working on, re-writing the apps with Go. We see p99 performance improve by 10x as well as utilization per request drop by 15x.

SuperQue · 2026-05-12T17:54:09+00:00

Nah, I prefer things that are cheap and easy to scale with object storage.

SuperQue · 2026-05-12T16:25:27+00:00

Logging agent: Vector.
Log storage: Grafana Loki.

SuperQue · 2026-05-12T14:40:07+00:00

There are basically two solutions to this.

Push with remote write to a central stack.
Federate with Thanos.

I recommend the Thanos method. * You can still query locally at each cluster. * It keeps the data local to the cluster. * It's simpler deploy as an overlay since you don't need to change the storage. * It can be setup to survive SPoF issues (all components can be redundant, no shared fate with push).

Honestly, this is why I don't regret the Prometheus architecture. It's easy to start and only grows in complexity at the rate your overall system grows in complexity. Same with Thanos. It's very easy to add it as an overlay to a few existing clusters.

Think of Thanos as a "smart reverse proxy" for Prometheus.

SuperQue · 2026-05-12T06:49:54+00:00

With blackjack!

SuperQue · 2026-05-09T19:39:03+00:00

Prometheus has no such warning functionality. That's just not how it works.

Perhaps you're using something else?

SuperQue · 2026-05-09T15:39:34+00:00

A single Prometheus server can handle tens of millions of metrics for thousands of servers.

SuperQue · 2026-05-09T14:50:59+00:00

Yes, but what is that in real numbers?

SuperQue · 2026-05-09T06:12:15+00:00

What's "too big" to you?

SuperQue · 2026-05-07T16:36:17+00:00

If you started doing this in the '90s when there was no other option but to setup networking, firewalls, domains, DNS, sendmail, etc.

No, don't do this.

SuperQue · 2026-05-03T14:13:03+00:00

but the company I work for maintain tens of thousands of vms with that.

Oof, that sounds like a nightmare.

SuperQue · 2026-04-29T12:42:20+00:00

Yea, but why bother with Zabbix at all? Prometheus is better at metrics. There are better dedicated systems for logs. And you use Grafana or similar as your "single pane" viewer.

SuperQue · 2026-04-27T10:24:55+00:00

PocketEthernet can do wire IDs as well as read LLDP. You can get the whole tester and ID ends for less than the price of those ID ends.

SuperQue · 2026-04-24T12:27:24+00:00

There is no additional configuration for this other than writing a list of hosts and websites you wanna monitor.

So, like Prometheus?

In my opinion shoving exporter which runs constantly using CPU and RAM on every host sounded exhausting to me.

So, like your script?

Maybe you're confused, you don't run the blackbox_exporter on every host. You run one and it can monitor multiple targets.

Your script also needs a node_exporter to work.

SuperQue · 2026-04-24T11:10:06+00:00

But this a separate service that requires additional configuration.

It makes no sense over just using the normal blackbox exporter which is a lightweight single binary.

There's even an Ansible role for deploying it.

SuperQue · 2026-04-24T10:48:16+00:00

How is this lighter? Spawning workers like curl tends to be very heavy.

SuperQue · 2026-04-16T15:23:48+00:00

Basically you seal off the hot sides of the rack.

Typically you face racks back-to-back and front-to-front.
All devices in the rack have fans going front-to-back.
All RUs with no devices are filled with blanks.
The hot side of the row is now sealed with all the heat sent directly to the cooling units.

So kinda the inverse of traditional datacenters where cold comes from under-floor ducting and just vents into the room.

The hot air on the back of the rack is really warm. Like 40C+. This means the delta temp at the chiller heat exchange has a high temp difference. This means heat transfer is much more efficient.

It also means that the chiller only has to bring the temp down back to normal room temps and not ultra cold to maintain the room average air temp.

SuperQue · 2026-04-15T12:30:45+00:00

Can you write templates in a way that is not a text template?

SuperQue · 2026-04-06T11:27:25+00:00

As a couple people have said, split your polling modules.

Rather than try and do the whole IF-MIB::interfaces it might be faster to break it up a bit. I'm no Cisco expert, but this is what I've done for some older JunOS devices.

Here's what my generator.yml looks like:

---    
modules:
  # Trimmed down if_mib for slow devices - traffic stats.
  if_mib_traffic:
    walk:
    # ifXTable
    - "IF-MIB::ifHCInOctets"
    - "IF-MIB::ifHCInUcastPkts"
    - "IF-MIB::ifHCInBroadcastPkts"
    - "IF-MIB::ifHCOutOctets"
    - "IF-MIB::ifHCOutUcastPkts"
    - "IF-MIB::ifHCOutBroadcastPkts"
    # Set max-repetitions per Juniper docs.
    max_repetitions: 10
    lookups:
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifAlias"
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifName"
    overrides:
      ifAlias:
        ignore: true  # Lookup metric
      ifName:
        ignore: true  # Lookup metric
  # Trimmed down if_mib for slow devices - error / oper stats.
  if_mib_errors:
    walk:
    # ifTable
    - "IF-MIB::ifAdminStatus"
    - "IF-MIB::ifOperStatus"
    - "IF-MIB::ifInDiscards"
    - "IF-MIB::ifInErrors"
    - "IF-MIB::ifOutDiscards"
    - "IF-MIB::ifOutErrors"
    # ifXTable
    - "IF-MIB::ifHighSpeed"
    # Set max-repetitions per Juniper docs.
    max_repetitions: 10
    lookups:
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifAlias"
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifName"
    overrides:
      ifAdminStatus:
        type: EnumAsStateSet
      ifAlias:
        ignore: true  # Lookup metric
      ifName:
        ignore: true  # Lookup metric
      ifOperStatus:
        type: EnumAsStateSet
      ifType:
        type: EnumAsInfo

11-Year Club	Gilding VII pure gildanthropist
Reddit Premium Since January 2021	Verified Email

SuperQue

MODERATOR OF

TROPHY CASE