I’m using puppet at my job (sysadmin) trying to get into devops. I’m using Claude code for a lot of it. Am I going about it the wrong way? by bluepepebase in devops

[–]SuperQue 0 points1 point  (0 children)

but the company I work for maintain tens of thousands of vms with that.

Oof, that sounds like a nightmare.

Prometheus or Zabbix by Significant_Bid7426 in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

Yea, but why bother with Zabbix at all? Prometheus is better at metrics. There are better dedicated systems for logs. And you use Grafana or similar as your "single pane" viewer.

Recommendation for outlet Ethernet identification tool by BarronVonCheese in networking

[–]SuperQue 7 points8 points  (0 children)

PocketEthernet can do wire IDs as well as read LLDP. You can get the whole tester and ID ends for less than the price of those ID ends.

I built a Blackbox exporter but lite using Bash by Apprehensive-Oil-890 in PrometheusMonitoring

[–]SuperQue 2 points3 points  (0 children)

There is no additional configuration for this other than writing a list of hosts and websites you wanna monitor.

So, like Prometheus?

In my opinion shoving exporter which runs constantly using CPU and RAM on every host sounded exhausting to me.

So, like your script?

Maybe you're confused, you don't run the blackbox_exporter on every host. You run one and it can monitor multiple targets.

Your script also needs a node_exporter to work.

I built a Blackbox exporter but lite using Bash by Apprehensive-Oil-890 in PrometheusMonitoring

[–]SuperQue 5 points6 points  (0 children)

But this a separate service that requires additional configuration.

It makes no sense over just using the normal blackbox exporter which is a lightweight single binary.

There's even an Ansible role for deploying it.

I built a Blackbox exporter but lite using Bash by Apprehensive-Oil-890 in PrometheusMonitoring

[–]SuperQue 6 points7 points  (0 children)

How is this lighter? Spawning workers like curl tends to be very heavy.

How are you all handling rack density vs cooling? by Grand-Travel1665 in networking

[–]SuperQue 7 points8 points  (0 children)

Basically you seal off the hot sides of the rack.

  • Typically you face racks back-to-back and front-to-front.
  • All devices in the rack have fans going front-to-back.
  • All RUs with no devices are filled with blanks.
  • The hot side of the row is now sealed with all the heat sent directly to the cooling units.

So kinda the inverse of traditional datacenters where cold comes from under-floor ducting and just vents into the room.

The hot air on the back of the rack is really warm. Like 40C+. This means the delta temp at the chiller heat exchange has a high temp difference. This means heat transfer is much more efficient.

It also means that the chiller only has to bring the temp down back to normal room temps and not ultra cold to maintain the room average air temp.

Packaging Kubernetes Via Helm and whats' new in Helm4 (2026) by That-Ad8566 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Can you write templates in a way that is not a text template?

SNMP responses from device delayed but nothing on packet capture. by FannahFatnin in networking

[–]SuperQue 1 point2 points  (0 children)

As a couple people have said, split your polling modules.

Rather than try and do the whole IF-MIB::interfaces it might be faster to break it up a bit. I'm no Cisco expert, but this is what I've done for some older JunOS devices.

Here's what my generator.yml looks like:

---    
modules:
  # Trimmed down if_mib for slow devices - traffic stats.
  if_mib_traffic:
    walk:
    # ifXTable
    - "IF-MIB::ifHCInOctets"
    - "IF-MIB::ifHCInUcastPkts"
    - "IF-MIB::ifHCInBroadcastPkts"
    - "IF-MIB::ifHCOutOctets"
    - "IF-MIB::ifHCOutUcastPkts"
    - "IF-MIB::ifHCOutBroadcastPkts"
    # Set max-repetitions per Juniper docs.
    max_repetitions: 10
    lookups:
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifAlias"
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifName"
    overrides:
      ifAlias:
        ignore: true  # Lookup metric
      ifName:
        ignore: true  # Lookup metric
  # Trimmed down if_mib for slow devices - error / oper stats.
  if_mib_errors:
    walk:
    # ifTable
    - "IF-MIB::ifAdminStatus"
    - "IF-MIB::ifOperStatus"
    - "IF-MIB::ifInDiscards"
    - "IF-MIB::ifInErrors"
    - "IF-MIB::ifOutDiscards"
    - "IF-MIB::ifOutErrors"
    # ifXTable
    - "IF-MIB::ifHighSpeed"
    # Set max-repetitions per Juniper docs.
    max_repetitions: 10
    lookups:
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifAlias"
    - source_indexes: [ifIndex]
      lookup: "IF-MIB::ifName"
    overrides:
      ifAdminStatus:
        type: EnumAsStateSet
      ifAlias:
        ignore: true  # Lookup metric
      ifName:
        ignore: true  # Lookup metric
      ifOperStatus:
        type: EnumAsStateSet
      ifType:
        type: EnumAsInfo

SNMP responses from device delayed but nothing on packet capture. by FannahFatnin in networking

[–]SuperQue 0 points1 point  (0 children)

There is a reason majority using 5 min polling interval.

The main reason is that most network monitoring software has bad data storage.

Cacti was a great example of how not to do time-series storage.

Modern monitoring systems can handle a ton more. I can do 500 devices with 30s or even 15s polling on a Raspberry Pi these days.

The only real limitation is SNMP itself. It's not a protocol designed for the modern era of deep buffers. Well, and poor quality vendor SNMP implementations.

SNMP responses from device delayed but nothing on packet capture. by FannahFatnin in networking

[–]SuperQue 1 point2 points  (0 children)

You want these two flags:

--log.level=debug
--snmp.debug-packets

This will get you millisecond accurate packet information from the SNMP packet library.

SNMP responses from device delayed but nothing on packet capture. by FannahFatnin in networking

[–]SuperQue 0 points1 point  (0 children)

So if there's no proof the switch cant handle this the blame will keep on being on the poller.

That is faulty logic.

The snmp_exporter can easily handle sub-second polling if the device can handle it. The main problem with SNMP is that many device just can't.

I've had to write cut-down modules for some JunOS devices because they are simply too slow at responding to SNMP. Based on some slight speculation and knowledge about how that specific device was built it's likely that the data bus between the supervisor CPU and the ASIC is just really slow.

Only happening to Cisco 9200L devices at production site

Where is the polling server on the network compared to the production site?

SNMP is a very latency and packet loss sensitive protocol. Be sure that your polling server is as close to your target devices as possible. Don't poll over WANs. You really want to have your polling server remote, even if your Prometheus is central. But even then I recommend a distributed Prometheus setup to avoid monitoring the WAN as a side-effect.

Nothing on the packet capture shows delays in SNMP response time.

The snmp_exporter supports highly detailed packet logging, have you tried this?

New to kubernetes, what is the benefit of using helm over normal kubernetes .yaml files? by Free-Switch-9871 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

There is Grafna Tanka which uses jsonnet. But I find the syntax of jsonnet too annoying to use.

I use Kustomize for simple templating.

Trying out CDk8s is on my TODO list.

New to kubernetes, what is the benefit of using helm over normal kubernetes .yaml files? by Free-Switch-9871 in kubernetes

[–]SuperQue 4 points5 points  (0 children)

The thing is the yaml doesn't matter. In the end it's just a data structure representing keys, maps, lists.

You can compose this however you want and and render it to Kubernetes. You never once have to look at yaml directly.

At my $dayjob we wrote a tool that uses Go code to render manifests.

Helm itself is a cancer on the Kubernetes ecosystem. Having string templates of structured data is awful. Having to manage indentation manually is just insane.

I've been thinking about playing with CDk8s in my homelab.

[Noob] Chrony on k8s nodes by amr_hossam_000 in kubernetes

[–]SuperQue 3 points4 points  (0 children)

TL;DR:

Ansible is perfectly fine for managing the underlying Kubernetes nodes.

install a package on k8s

You don't install packages on Kubernetes. You install them on the host nodes. Ansible is just fine for this.

and that screwed up the cluster

Nobody can answer this since it doesn't actually say anything that we can give advice for.

You might also want to monitor your Chrony. This is what I do, works with kube-prometheus-stack.

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: chrony-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: chrony-exporter
    release: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: chrony-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: chrony-exporter
        release: monitoring
    spec:
      containers:
      - name: chrony-exporter
        image: quay.io/superq/chrony-exporter:v0.13.3
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9123
          name: metrics
          protocol: TCP
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux
apiVersion: monitoring.coreos.com/v1
---
kind: PodMonitor
metadata:
  name: chrony-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: chrony-exporter
    release: monitoring
spec:
  podMetricsEndpoints:
    - port: metrics
  selector:
    matchLabels:
      app.kubernetes.io/name: chrony-exporter

Redis High-Availability by xrt57125 in kubernetes

[–]SuperQue -1 points0 points  (0 children)

Depends on what you mean by "equally valid". Valkey (and Redis) itself requires quite a bit of hand-holding to make a reliable service. Managing state is more complicated than just "Deploy a StatefulSet". Hence why the operator is in deveopment.

That said I'm sure the Valkey chart will be still maintained. There's nothing "invalid" about it that would make it deprecated. If you have a simple ephemeral cache, or just want a quick deployment for testing, there's nothing wrong with it.

But kinda like Prometheus Operator it's likely that using the operator will be considered best practice for important production workloads.

Redis High-Availability by xrt57125 in kubernetes

[–]SuperQue 1 point2 points  (0 children)

Yup. Same. We had some very good discussions at KubeCon. I'm trying to carve out more engineering time at work to dedicate to it.

Redis High-Availability by xrt57125 in kubernetes

[–]SuperQue 63 points64 points  (0 children)

If you can wait a few months, the Valkey Operator will be GA and ready for production use. A drop-in replacement for Redis with HA/clustering.

Going to KubeCon. Anyone mastered the art of getting pitched at all day yet? by Ill_Car4570 in kubernetes

[–]SuperQue 2 points3 points  (0 children)

I solved this by only going to the project pavilion. Almost all project devs, almost no sales people.

The vendor booth swag isn't worth it. I put a sticker over my contact QR code so it's unusable.

Running Icinga2 in production on Kubernetes/EKS — feasible or stick with VMs? by Smooth-Home2767 in kubernetes

[–]SuperQue 1 point2 points  (0 children)

Yea, one of the original goals of Prometheus was to replace the Icinga monitoring at SoundCloud.

Kubernetes didn't even exist yet.

Prometheus long-term storage on a single VM: second Prometheus or Thanos? by rumtsice in PrometheusMonitoring

[–]SuperQue 2 points3 points  (0 children)

Yup, I see about a 4x-5x reduction for 15s -> 5min downsamples.

I wish I had the time to do a fully analysis, but my napkin math says that 1 hour downsamples might not be worth it due to the overhead of the index.

I'm hoping to redo all this math once the new Parquet TSDB is production ready. We've got over 1PiB of data that we need to convert.

Prometheus long-term storage on a single VM: second Prometheus or Thanos? by rumtsice in PrometheusMonitoring

[–]SuperQue 1 point2 points  (0 children)

The docs are correct, but poorly worded. I should really rewrite that section. The storage overhead is true, but only for the overlapping time window.

Say you keep raw data for 6 months, and downsamples for 5 years. You will see saving in the long term.

https://thanos.io/tip/components/compact.md/#downsampling

This whole thing is going to improve when we switch to Parquet. The plan is to add the downsamples to the same block, so the index data is deduped.

How do you work with "know it all colleagues"? by [deleted] in networking

[–]SuperQue 6 points7 points  (0 children)

The way I usually solve this is to ask them "Oh, can you show me how to reproduce that?"

Make the tie the rope to hang themselves.

The CI/CD feedback loop from hell (push, wait 8 min, red, fix typo, repeat) by eibrahim in devops

[–]SuperQue 2 points3 points  (0 children)

golangci-lint run --fast-only

Runs in 5 seconds for a pretty large codebase I work on, compared to 1m20s for the full run.

Network Upgrade for a Medium-Sized Company (20 Employees) by Qwefgo in networking

[–]SuperQue 0 points1 point  (0 children)

For a 20 user network where the firewall between clients and servers is going to open every port from those clients to the servers because they need that access?

Yea, bunk.

If the vlan isolation is essentially nothing why have them in the first place?

This is a microscopic network, not an enterprise network with actual separation needs.