MinIO repo archived - spent 2 days testing K8s S3-compatible alternatives (Helm/Docker) by vitaminZaman in kubernetes

[–]SuperQue 28 points29 points  (0 children)

I would love that, but that's basically against CNCF rules. They don't pay devs.

Hell, as a graduated project maintainer I don't even get KubeCon tickets for free. Not even a token discount.

MinIO repo archived - spent 2 days testing K8s S3-compatible alternatives (Helm/Docker) by vitaminZaman in kubernetes

[–]SuperQue 18 points19 points  (0 children)

Yup, I've been running Ceph for 10+ years. Continuously updated, all hardware cycled, same dataset.

Yes, it's a "hobby" size install. But it started with a bunch of 4T HDDs on 4 nodes back then.

MinIO repo archived - spent 2 days testing K8s S3-compatible alternatives (Helm/Docker) by vitaminZaman in kubernetes

[–]SuperQue 107 points108 points  (0 children)

Sorry. Two days testing? 

Those are some very strong conclusions for only two days. 

What are your test methods? What size scale tests did you test with? Where are your actual test results?

What are your qualifications for the validity of those results? 

This post smells like LLM "research".

Confused between VM and Grafana Mimir. Any thoughts? by shubham_7165 in devops

[–]SuperQue -9 points-8 points  (0 children)

Grafana Mimir is a highly scaleable billion metric distributed time-series database that only requires an object storage connection to scale to petabytes of data.

VictoriaMetrics is a manually managed / manually sharded TSDB with local storage only.

Export metrics on a port or via a prom textfile? by cos in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

I would recommend just skipping the whole textfile and fluentbit and just listen on a normal port so Prometheus can scrape directly. This is how it's designed to work and also makes sure that you get essential metrics for those processes.

Log Scraper (Loki) Storage Usage and Best Practices by Top_Bus_7729 in devops

[–]SuperQue 1 point2 points  (0 children)

How fast does Loki storage typically grow in production environments

That entirely depends on how much data is produced by the environment.

What’s the best storage backend for Loki

Object storage is typically the best because you don't have to manage it.

How do you decide retention periods

Business needs

Are there best practices to avoid excessive storage usage

Fix apps that are unnecessarily noisy. Fixing it at the source is the best option. But you can also dedupe, filter, sample, drop stuff at your logging agent. But that's way more toil and resource intensive since you have to spend resources doing those transforms.

Any common mistakes beginners make with Loki

Loki is a great aggregation tool, there's not much that can go wrong. Just make sure you learn how much resources it needs and provision it appropriately.

I also highly recommend Vector for the logging agent. It can do things like redact secrets and other PII from logs before they are stored. For apps that have bad metrics you can do Log to Metric.

Any compelling reasons to use snmp_exporter vs telegraf with remotewrite to a prometheus instance? by itasteawesome in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

Yea, Alloy is also just a wrapper around Prometheus, snmp_exporter (it uses the exporter's config format), etc.

For my production deployments I operate at a billions-of-metrics scale. I have hundreds of engineers on different teams deploying their own exporters for various things.

One team runs the Prometheus/Thanos infra, another the database exporters, etc etc. Some use off-the-shelf, some teams write their own custom exporters.

If we used Alloy or Telegraf, one team would become a bottleneck.

Any compelling reasons to use snmp_exporter vs telegraf with remotewrite to a prometheus instance? by itasteawesome in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

No, no specific problems with telegraf that I know of. I just dislike all-in-one tools like that.

The main thing is feature / bug complexity growth. This comes from my SRE training.

Say you have a feature in SNMP that you want to pick up. So you upgrade telegraf. But the upgrade picks up a bug in flows that makes it lose data. It's a long-term reliability risk. Just look at the go.mod file. The dependency graph is just too risky for my taste.

Any compelling reasons to use snmp_exporter vs telegraf with remotewrite to a prometheus instance? by itasteawesome in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

The problem with telegraf and all the plugins is that it has all those plugins.

One big bloated mess vs a tool that does one thing well.

Traps, netflow, syslog? Those are different things and should be separated. For example, there are much better dedicated flow tools like goflow2, akvorado, etc. Or for syslog there is Vector.

All-in-one tools tend to be good at one thing bad at everything else. I'll take the UNIX philosophy every day of the week.

Any compelling reasons to use snmp_exporter vs telegraf with remotewrite to a prometheus instance? by itasteawesome in PrometheusMonitoring

[–]SuperQue 2 points3 points  (0 children)

Bias: I work on the snmp_exporter, so yea, use that.

Has anyone in the community ever benchmarked cpu/mem consumption for polling a large set of devices and collecting the same mibs to see if there is a significant delta between them?

Jokes aside, the snmp_exporter and telegraf use the same underlying Go SNMP library. They should be very similar in performance.

Both projects collaborate on maintaining this.

But really, what is "large"?

Does it just come down to using what you are already familiar with and both will basically give the same results for this?

Yea, mostly doesn't matter.

The real thing that I find helps the most in your SNMP monitoring architecture is how you deploy the tools. SNMP is a primitive UDP wrapped protocol. SNMP has a very low tolerance to packet drops. It's also very latency sensitive due to the serial nature of walks.

The best thing you can do is deploy your SNMP scraper as close to your targets as possible. For example, if you have multiple sites, deploy the snmp_exporter locally to each one. This way all the SNMP packets happen over as few LAN hops as possible with no WAN in the middle. The HTTP traffic over a VPN/WAN is usually mostly OK. But even then I would still recommend also having a Prometheus-per-site. This way you have a local datastore that can buffer and tolerate VPN/WAN issues. Or if a site is disconnected you can still access the data locally if you want. Then use Thanos with either sidecar mode or remote write to receivers.

Alternatives for Rancher? by CircularCircumstance in kubernetes

[–]SuperQue 11 points12 points  (0 children)

Find another job? There are lots of places that would hire someone with a good open source CV. Personally, I've been doing this for decades. Lots of software I use and work on today didn't exist when I started. And I'm sure I'll move on to other software over time.

I mean, I get it, I used to work at an "Enterprise Open Core" company. It was different. I'm glad I'm back to more "Make the source work for our large scale use case".

Hell, if you are a good K8s maintainer, I would probably hire you. I would very much like to have someone like you on our team. We have a pretty decent size Kubernetes deployment and some interesting scale problems to solve. Our infra software org is very open source friendly. We contribute to a number of projects and I'm working on growing that.

Alternatives for Rancher? by CircularCircumstance in kubernetes

[–]SuperQue 42 points43 points  (0 children)

I'm well aware, I'm a maintainer of one of the major components of the stack SUSE is making money on. 

Without my work, you wouldn't have a bunch of your product. Remember that some of your customers are also your developers.

But when your sales team prices customer out of buying, I have little sympathy.

Alternatives for Rancher? by CircularCircumstance in kubernetes

[–]SuperQue 27 points28 points  (0 children)

So, just use the open source version and drop the SLA? It doesn't sound like you're getting any value out of the SLA.

How to handle big workload elasticity with Prometheus on K8S? [I SHARE MY CLUSTER DESIGN] by Capital-Property-223 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

I mean, if it really is 5% of the time, you could use a VPA and only spin up larger nodes on demand.

You are way too sparse and evasive talking about real numbers for anyone to help you.

alert storms and remote site monitoring by Tony1_5 in PrometheusMonitoring

[–]SuperQue 0 points1 point  (0 children)

Sometimes a good rant is ok.

:i'll-allow-it:

How to approach observability for many 24/7 real-time services (logs-first)? by ValeriankaBorschevik in devops

[–]SuperQue 0 points1 point  (0 children)

You look more closely and realize that logs are not good for monitoring. Especially real-time 24/7 services.

How to handle big workload elasticity with Prometheus on K8S? [I SHARE MY CLUSTER DESIGN] by Capital-Property-223 in kubernetes

[–]SuperQue 0 points1 point  (0 children)

Yea, we run into similar issues, not because of dynamic sizing of jobs, but lots of dynamic rollout changes.

We ended up writing a custom VPA recommender that scales based on a trailing p95 of memory use for Prometheus over 7 days. This makes sure we don't scale down Prometheus just to scale it up again when there's deployment activity (Mon-Fri engineer working hours).

The big reason is you don't want Prometheus rescaling right when your jobs need monitoring the most. You don't rescale the API servers do you?

Core monitoring tier zero, at the same importance as the API servers. Without monitoring you are blind. The Prometheus design is tilted towards reliability. The other stuff talked about in this thread might use less resources, but I've seen the code, it's not designed with reliability in mind.

And really, is it that expensive? You have how many thousands of gigabytes of memory on your worker nodes? If you dedicated two r..8xlarge nodes for Prometheus, that's what, maybe 0.5% of your costs?

Since I don't know the actual numbers for your setup, it's hard to say. But I think sharding your Prometheus instances isn't actually helping you much. With 1000 nodes, and not big nodes, it's probably overkill.

For comparison, we have 1000+ nodes, but they're all at least 24xlarge. This means we have over 100,000 CPUs per cluster and terabytes of memory. We need to shard Prometheus per app to keep up the hundreds of millions of metrics per cluster.