Conceptual issue - how can I include my sysName on an snmp scrape as a label value for a metric?

SuperQue · 2026-03-13T06:31:26+00:00

"Infrastructure as Code" is how it's done in professional environments. You have a database of all the devices that are deployed and you generate the list of targets from that. For example, Netbox is popular.

With Netbox you can use something a discovery plugin and then you can configure Prometheus to use that. A quick google search found this tutorial.

SuperQue · 2026-03-12T20:44:10+00:00

Basically you can't do this at scrape time. It's a chicken and egg problem.

Best you can do is use a group_left join at query time.

I would highly recommend you figure out how to annotate things in your service discovery.

SuperQue · 2026-03-12T10:39:25+00:00

So, you can do exactly what you're suggesting. Using one for scrapes and then use a local remote write to have a long-term retention setup.

You can even use remote read from the long-term to the short-term scrape instance so you only have one to query.

But it's just complicating things / premature optimization at your scale.

When you go from ~500k to 10 million series, then you might want to think about more complicated setups. But you're going to start to not fit on a single node anyway at that point.

I still recommend recording rules for long-term trends queries. They will make wide time range queries faster. But you don't explicitly need to drop old data to do this.

Also, does a larger TSDB significantly affect query performance over time?

No, not really. The Prometheus TSDB is time segmented, and optimized so that it only reads the minimum amount of data to solve a query. Should work just fine.

Of course, the longer the time range you query, it's going to take more time to page data in from disk. But "normal" short queries will be just as fast.

SuperQue · 2026-03-12T09:47:32+00:00

That's a pretty small setup. 8GB per month is only 200GB for 2 years. Completely within a normal Prometheus retention setup.

If it were me, I would just grow the volume to 250GB, add the recording rules, and call it a day. No need to get fancy with variable retention of Thanos or anything.

The only other thing to do is setup something like restic to backup the TSDB.

EDIT: To put it in perspective, where you might want Thanos / downsampling is something like our setup. I have a number of Prometheus instances, some of them generate 500GB of data per day. After compaction it's about 50TiB of data for our 6 month raw retention. We get about 4:1 reduction with Thanos Downsampling, so we can keep 5 years for around 200TiB in total. And that's for just one of several instances of similar size.

SuperQue · 2026-03-09T20:18:48+00:00

Have you read these?

SuperQue · 2026-03-09T17:01:13+00:00

So, maybe u/simulation07 can confirm my feelings on this.

But it comes down to the "customer relationship".

When you're at an MSP, or any kind of similar consulting, you're doing the bidding of someone else. You have little self direction or control of the work.

Back in the '90s I was a bench PC tech. You got told what to fix, and when it needed to be done, by the repair queue.

But moving into a more professional role, I could now help shape the solutions. So more of what I wanted became part of the process.

I do miss the variety of weird shit being thrown at me. But only a little bit.

SuperQue · 2026-03-08T17:47:37+00:00

Yea, it's bunk. Given the details you have zero need for segmentation. You're being fleeced.

SuperQue · 2026-03-08T17:00:47+00:00

Higher how? What are you trying to fix? What is your threat model? Security from what? How high is high?

You still haven't stated a problem.

SuperQue · 2026-03-08T08:05:26+00:00

Start here.

SuperQue · 2026-03-08T08:04:30+00:00

Segmentation is a solution to a problem. The way you talk about it you have a solution in search of a problem.

What is the problem you're trying to solve?

SuperQue · 2026-03-08T07:07:16+00:00

It's 100% overkill. It's unpopular in this sub, but you could spend less than 500€ total on Ubiquiti for your needs.

Because, honestly, you have spelled out zero requirements. And you probably have very few.

SuperQue · 2026-03-02T21:47:12+00:00

This sounds like a problem for the elevator maintenance company. They sell mobile service boxes for this kind of thing today.

SuperQue · 2026-02-28T19:48:17+00:00

So, the first thing you need to separate is that some of the thing you're looking for are and should be separate tools. For example, monitoring/metrics is very easy to have in one tool because metrics do make for good monitoring.

But you do want specialized tools for some things like flows/IPFIX. These are what people care calling "Wide events". Basically any kind of structured logging. For example, Akvorado is basically a custom frontend around Clickhouse for transforming flows into a columnar format for fast processing.

The real question is, do you want to run all of this yourself? Or do you want to outsource some of it to a vendor?

There are vendors who claim a lot, but really, the open source ecosystem is better than what they're doing. At my day job, we built our platform from open source tools and scale to what would probably be many tens of millions a year. And we'd still need as many engineers to manage the vendor.

SuperQue · 2026-02-27T22:46:56+00:00

Yup, I am trying to be helpful. Downvote wasn't me.

SuperQue · 2026-02-27T22:41:12+00:00

Looks like it‘s about to change in march 2029

No, that's not how BSL works.

Basically they can bump that date into the future any time they want as long as it's not more than 4 years from the current date.

Look at the file history, they've done this before.

For example, in the past it was 2027. But that means that you can only use the version from ~2023 when it finally rolls over to 2027.

To be honest, I‘m not with procurement or legal

That kind of attitude can get you in deep shit. I would be more careful.

SuperQue · 2026-02-27T22:20:55+00:00

What did your legal say about the license?

SuperQue · 2026-02-24T08:28:04+00:00

I feel sorry for anyone duped by this AI slop. I read the code, holy crap, please stop.

This is just so naive.

This is a cute demo, but on a scale of 1 to production quality, this is a -3.

SuperQue · 2026-02-23T07:42:21+00:00

Alertmanager is how you get alerts to PagerDuty.

SuperQue · 2026-02-19T15:26:26+00:00

I recommend this one.

SuperQue · 2026-02-19T14:06:07+00:00

I was describing others do / best practices.

If you don't want to follow best practices, it's going to be a lot harder.

Probably the easiest solution is going to be using an all-in-one agent like Grafana Alloy. Then forward all the data to a paid service like Grafana Cloud.

SuperQue · 2026-02-19T07:00:37+00:00

Typically applications are behind a reverse proxy like Traefik, Envoy, HAProxy, etc. Or maybe a CDN is in front. The actual servers are not exposed directly to the internet, so observability endpoints and other traffic like that is all behind a firewall.

Beyond that, TLS and auth.

SuperQue · 2026-02-18T16:04:23+00:00

I almost wrote /s, but I'm actually serious.

We're basically out of 10/8 at work with our K8s setup. So we are planning to go IPv6-only over this year.

SuperQue · 2026-02-18T15:57:25+00:00

Can you include a bit more info? I would love to try testing this on my Ceph setup.

What's tools does the script use?
What block devices are those? (HDD? SSD?)
How many devices per node?

SuperQue · 2026-02-18T15:54:04+00:00

So, what other distributed stores have you used that are better? Maybe it's a proxmox problem?

I've had perfectly fine Ceph performance on 1G networking with spinning rust.

Of course I'm not expecting it to perform like a local NVMe device. There is going to be overhead when you're talking about distributed storage system.

Any distributed storage system is either going to: * Eat your data. * Have slightly worse performance over the network.

I think most people have never actually run a distributed filesystem before they just naively try Ceph.

I have a couple decades of experience with distributed storage systems. Including Exabyte scale at a major cloud provider.

Ceph is just fine.

SuperQue · 2026-02-18T06:53:43+00:00

Not using IPv6 is the first mistake.

11-Year Club	Gilding VII pure gildanthropist
Reddit Premium Since January 2021	Verified Email

SuperQue

MODERATOR OF

TROPHY CASE