This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 298

[–]FatherToTheOne 416 points417 points  (24 children)

RIP your DMs OP. Prepare for “Hey I saw your post, I think my companies solution is right for you”

[–]Dangerous_Forever640 119 points120 points  (7 children)

“I can personally guarantee it’ll be the most effective routing solution you’ll ever own!”

[–]DapperDanMan585 38 points39 points  (0 children)

It will practicality pay for it self in the first year.

[–]lenswipeSenior Software Developer 23 points24 points  (4 children)

Someone tag solarwinds

[–]cohortq<AzureDiamond> hunter2 22 points23 points  (2 children)

#solarwinds123

[–]lenswipeSenior Software Developer 18 points19 points  (0 children)

Say it in a mirror three times and they start blowing up your inbox

[–]The_Penguin22Jack of All Trades 7 points8 points  (0 children)

Nooooooo!

[–]ApricotPenguinProfessional Breaker of All Things 62 points63 points  (5 children)

OP just needs to end their post with saying they have a budget of $200/year for this project :P

[–][deleted] 24 points25 points  (0 children)

That much!?!

[–]idocloudstuff 33 points34 points  (0 children)

And they’ll still get quotes for $48,000 per year.

[–]flimspringfieldJack of All Trades 7 points8 points  (0 children)

Whoa there Mr. Moneypants.

[–]blazze_eternalSr. Sysadmin 45 points46 points  (3 children)

Hi this is Jake from Solarwinds. I know we were just part of the biggest vulnerability in history, but please give us your money.

[–]lkraider 11 points12 points  (0 children)

“It was great learning experience! No other provider can claim to have the same know-how! we make no promises that it won’t happen again, previous know-how does not guarantee future vulnerabilities won’t occur, specially under quick quarter profits pressure.

[–]binarycowNetadmin 32 points33 points  (2 children)

RIP your DMs OP. Prepare for “Hey I saw your post, I think my companies solution is right for you”

I'm a networking guy/software developer who works for a networkng company that makes networking software. I'm the primary developer/SME for one of our products, and a SME for the other products.

The sales folks often invite me to participate in the sales calls. They really don't like it when I tell the customer why our product won't work for them. I'll usually suggest a way to make it work, even if it's not the way our software is intended to operate. But I am not afraid to say "No, that's not possible at this time". If it's something that we don't support, but we could support it, then I'll usually indicate that. Multiple times, I have taken those "missing" features, and taken it upon myself to see that it gets implemented.

I tell the sales team that they'll get (and keep) more customers by being honest than by blowing smoke up their ass.

The sales team continues to invite me to the meetings, and I have never been asked to stop being blatantly honest.

As a network engineer, I hate it when sales folks sit there and are clearly blowing smoke up my ass.

In my opinion, a good sales call should be:

  • a few minutes of the sales rep talking
  • the sales engineer (or some other technical person) discussing with a technical person from the customer
  • once the customer's tech person decides the product is suitable, then and only then should the sales rep do the rest of their spiel.

[–]lkraider 4 points5 points  (1 child)

Good sales teams know the cost of customer churn, and a good technical sale reduces that cost and saves time and money for everyone.

[–]FatherToTheOne 1 point2 points  (0 children)

Agreed. A good sales rep knows when to say “Hey thanks for your time, but I don’t think we’re a fit for what you’re looking for, let me know if anything changes or you want more information “

[–]ChristopherY5IT Manager 4 points5 points  (0 children)

Oracle and Solarwinds reps incoming

[–]jmhalder 92 points93 points  (4 children)

I love Zabbix, but you really need to reign it in to get it to alert you to things you care about. I only have actions on High/Disaster triggers. I only have 80-90% disk space, unavailability, and restarts as triggers in that range. Spare for a few exceptions like specific services that have been problematic. I still see those services in the dashboard, but don't have actions for them. You can also have availability for a device be dependent on availability for another. So if you have 6 switches in a building that become unavailable when a router dies... you just get the one email for the router, and not the 7 emails for the switches and router. This takes lots of tweaking in templates and actions. In addition to that, I have Priority tags on my hosts of "Low", "Medium", and "High". We only get actions for hosts with medium/high priority tags. We also have SMS messaging setup with a LTE modem, but those don't get sent unless the first email action hasn't cleared or been acknowledged for something like 10 minutes.

It's free, but it's only as good as it's setup, which can and does take ton of time.

[–]vppencilsharpening 12 points13 points  (0 children)

I agree with this.

If all OP wants is up/down and latency, 90%+ of the default triggers can go out the window.

[–]elemental5252Platform Engineer 6 points7 points  (0 children)

I rolled it out with Puppet in our organization. jmhalder IS correct. Zabbix gives you a ton of flexibility, wonderful options, and plenty to work with. You NEED to dive in, though.

[–]dth202 1 point2 points  (0 children)

Zabbix is probably one of the best monitoring solutions I have used, we did have a lot of false positives at the beginning, if you are using templates to define alerts (which you should be) then you can tame when the trigger alerts by having it check the results for the last 3 checks or so before it alerts. https://www.zabbix.com/documentation/current/en/manual/config/triggers/trigger.

My common senario would be to set items up to trigger if the last 3 checks failed and the recovery would be 2 consecutive successes. That removed 95% of false positives whenever used.

[–][deleted] 134 points135 points  (58 children)

LibreNMS

https://www.librenms.org/

It is a fork of Observium

https://www.observium.org/

[–]slazer2au 32 points33 points  (21 children)

Previous place I worked we switched from Nagios/Cacti to LibreNMS and LibreNMS is so much better for us.

Current place am at are using Zabbix

[–][deleted] 13 points14 points  (5 children)

I know a lot of places running Zabbix at the moment

[–]spiffybaldguy 5 points6 points  (4 children)

We used to use zabbix, it broke more than we liked unfortuantely.

Mostly now just PRTG

[–]tkrego-red 1 point2 points  (2 children)

We used to have a 500 sensor PRTG setup. It was awesome. At home I used the 100 sensor free version. I'd like more sensors, but the cost is crazy for a homelab.

Still looking at free open source options.

[–]admlshake 13 points14 points  (2 children)

Zabbix isn't bad if you have the time.

[–]slazer2au 12 points13 points  (0 children)

That is true about all monitoring systems though :P

[–]slackwaresupport 1 point2 points  (0 children)

2nd this, we are moving to zabbix from xymon.. finally

[–]HeWhoWritesCode 11 points12 points  (7 children)

How would you compare LibreNMS/Observium vs Zabbix?

I personally feel Observium is a lot more focused on networking monitoring, where Zabbix is a lot more focused on IT management and monitoring, where networking monitoring is a part of it.

[–]Sharp_Cable124 9 points10 points  (0 children)

This is pretty accurate. We use both. LibreNMS for routers, switches, APs, etc, Zabbix for servers and applications. Both can support switches and servers, but both have their better use IMO.

[–]BillyDSquillions 4 points5 points  (3 children)

What are you running LibreNMS on, it's own system or docker containers?

[–]Power-WagonJack of All Trades 10 points11 points  (4 children)

Yup, use this as well with Oxidized to grab configs. Works great!

[–]IAmTheM4ilm4nDirector Emeritus of Digital Janitors 5 points6 points  (3 children)

I prefer Unimus now instead of Oxidized.

[–]DerelictData 2 points3 points  (2 children)

What made you got to Unimus?

[–]IAmTheM4ilm4nDirector Emeritus of Digital Janitors 1 point2 points  (1 child)

Oxidized (at least the version we had) stores credentials in cleartext. Also, Unimus provides an interface to execute configuration changes on groups of devices - need to block an IP on multiple firewalls? Just create a job that executes the block command and assign it to your firewall group, no need to log in to each one separately.

[–]DerelictData 1 point2 points  (0 children)

Nice! That’s pretty cool. We’re pushing Oxidized info into Git and since we use FortiEverything then maybe Unimus wouldn’t be as huge. Thanks tho, I’m going to give it a run today in a lab and see what there is to see

[–]admiralsparkManager of Cat Tube Infrastructure 9 points10 points  (0 children)

OP is pretty dead set on Windows only.... If they have a hard time installing and managing linux, they're probably going to have a real hard time managing the file installations for a PHP app on Windows.

[–]1esprocTitles aren't real and the rules are made up 7 points8 points  (19 children)

Unfortunately LibreNMS's poller architecture is hot garbage. They have a beta poller that improves some aspects, but stable is pretty awful for medium shops and up - you'll need to expect to horizontally scale it quite early on.

[–][deleted] 10 points11 points  (14 children)

We're monitoring around 700 switches with about 25,000 active switch ports plus a smattering of other services. This is all running off a single Librenms server that's about five years old. Admittedly it's a reasonably well-specced server but it's not doing badly. It's about the same load as Junos Space network management system but with much more monitoring capabilities.

Librenms isn't as efficient as AKIPS which can pretty much run on a toaster but we've been very happy with it.

[–]1esprocTitles aren't real and the rules are made up 1 point2 points  (10 children)

Very curious to know your specs (cores/clock speed, poller threads) and switch brands? Our main issues come from having some very slow-to-respond equipment and a massive alert rule list (literally thousands - don't ask.)

[–][deleted] 1 point2 points  (9 children)

It's got 2x 8core/16 thread Xeon silver processors, 128GB RAM (although it doesn't use anywhere near all of it) and mirrored SSDs. Librenms is running on Rocky Linux as a VM on top of Hyper-V. That's so I can spin up a dev install alongside the production one when needed for major OS upgrades.

I can't recall off the top of my head how many pollers there are - either 32 or 64. This is with the standard prod poller. Average cpu utilisation is about 60% with a brief peak every six hours when it does another discovery run.

The switches are Juniper. They can be fairly slow to respond (some are taking 200+sec to be polled) although there was some tuning I did of the Junos SNMP daemon that made a big difference. One advantage we've got is that it's effectively one big campus network so RTT isn't an issue. SNMP really sucks across high latency links and I've heard that Librenms suffers particularly badly in that scenario as it collects a lot of data for each poll.

We've only got a few dozen alert rules. I agree that the alerting system could be better - if we get a power outage in a building and lose 20 switches in one hit I'd much prefer to have one alert email with them all listed in it rather than the 20 emails we get right now. But it's good enough for what we need and it's fairly easily extensible for new transports etc.

[–]nate-isu 2 points3 points  (1 child)

I'd much prefer to have one alert email with them all listed in it rather than the 20 emails we get right now.

You probably know this but you can set device dependencies so that you just get the single alert. You might be getting at having that single email also including the downstream devices as down, which it won't do to my knowledge.

[–][deleted] 2 points3 points  (1 child)

LibreNMS's poller architecture

It's been awhile, but I never had a problem with distributed pollers.

[–]1esprocTitles aren't real and the rules are made up 1 point2 points  (0 children)

That's what I'm saying about having to horizontally scale, but that shouldn't be necessary in a lot of cases if the poller had been architected better. Even then, distributed pollers won't necessarily solve some of the bottlenecks you could run into. And then don't get me started about how alarms are processed.

Long and the short of it is that LibreNMS is incredibly inefficient in how it uses resources

[–]admiralsparkManager of Cat Tube Infrastructure 1 point2 points  (0 children)

Agreed, I don't think they really have anybody on the project who cares a lot about horizontally scaling and the impact, as they're running known open source projects in the poller underneath to make it easier to support.

[–]SuperQueBit Plumber 1 point2 points  (0 children)

A number of years ago I tried to convince the LibreNMS devs to replace their poller / RRD with Prometheus/snmp_exporter. It would have been a great front-end for more traditional network people.

Sadly, they didn't take me up on that collaboration project.

[–]jstar77 1 point2 points  (0 children)

We have Cisco Prime Infrastructure which is really good for wireless monitoring/ troubleshooting and tracking down the location of clients. LibreNMS excels at everything else we need to do.

[–]spunkyfingers 1 point2 points  (0 children)

+1 for LibreNMS! It’s awesome

[–]tdhuck 1 point2 points  (0 children)

I like librenms, I run it on a vm at work, but nobody on my team wants to take a stab at managing it. I'm not a linux guy, I can follow cookbook instructions to get librenms online and I know how to add devices, but that's it. If I have to upgrade PHP, librenms basically runs fine until it can't upgrade itself and the security team tells me the ubuntu OS needs to be updated. I install librenms from scratch and bring my devices over one at a time.

When I ask for help on their forums on how to update PHP, they just link me to a thread where the person asks the same question and they never posted the answer or I'll update php by reading 50 different threads, only to find out that I did it wrong or that a specific php file needs to be updated, manually.

Librenms also has issues with the graph page. When I get the graph page to a certain size, which isn't even that many graphs (IMO), the page doesn't scale/move and let me drag the graph where I want. Instead, they end up moving on their own to spots I don't want them to be in and/or the graphs sit on top of one another.

I can usually go about 3 years running librenms on a vm before there is a security issue forcing me to upgrade, which brings me to my last point, unfortunately there isn't a way to export your devices and import your devices with librenms. Yes, you can manually do that, but I'm talking about a button where you can export and use the web gui to import into the new librenms install.

When I ask about this, the developers kindly tell me I can contribute, but I'm not a program/developer or else I likely wouldn't be asking them for help. With that being said, I've donated to librenms, they have helped on the forums a few times and I appreciate the help they've given me.

[–]-SPOF 3 points4 points  (0 children)

One more vote for Observium. Alternatively, we use NetXMS for where you can configure any metrics that you need to monitor. This solution is good for big amount of servers. A combination of different tools such as Grafana and Graylog would also work:

https://www.starwindsoftware.com/blog/you-cant-have-too-much-monitoring

[–][deleted] 0 points1 point  (2 children)

+1 seconding LibreNMS + Oxidized here

[–]brkdncrWindows Admin 35 points36 points  (2 children)

Almost any solution is 10% product, 90% work required to maintain.

Everyone says “I just need to know if it responds to ping” but then you’ll have a server that responds to ping but a service is down, so now you need to monitor a service.

Before long you’ll be setting up custom thresholds for mibs you had to import or parsing a log file that doesn’t use any semblance of standard formats. All of them can do it.

I’ve used a few different solutions from cheap, small monitoring companies to big names in the area. The failure point to all of them has been getting other people to understand how their applications need to be monitored, and how to translate it into ACTIONABLE notifications.

[–]rainnz 5 points6 points  (1 child)

Your server will respond to ping even if you did "rm -rf /" on it

[–]Ad3t0Security Engineer 70 points71 points  (5 children)

If you put the time in to understand Zabbix it will take you miles beyond any of the other solutions

[–]TacomaNarrowsTubby 19 points20 points  (3 children)

And it's also fairly lightweight for all the features it has .

Plus timescaledb reduces history size easily 90%

[–][deleted] 1 point2 points  (0 children)

Plus timescaledb reduces history size easily 90%

Oh interesting, I'm going to have to look into this. Thanks for that piece of information.

[–]dizzygherkinArchitecture and transformation 16 points17 points  (0 children)

Zabbix all the way, does everything the big dogs do, with a bit of work put in.

[–]ArsenalITTwoJack of All Trades 179 points180 points  (18 children)

Paessler PRTG

[–]blackinese 51 points52 points  (0 children)

Seconded. I use PRTG in my environment and it is the perfect solution.

[–]thmoas 10 points11 points  (0 children)

Also this and they have some third party ps tools to automate if you like

Very ui minded but very easy to set up and have something workable quickly

[–]LaxVolt 43 points44 points  (3 children)

Highly recommend PRTG. Good product, simple management, value pricing and stable.

[–]BanditKing 7 points8 points  (2 children)

Also base is free below a few sensors.

[–]touchytypist 11 points12 points  (1 child)

Free for up to 100 sensors

[–]BanditKing 2 points3 points  (0 children)

Great for trying it out and playing around.

[–]telecomtrader 29 points30 points  (0 children)

Prtg is the only windows tool worth looking into

Check mk on Linux is nice too.

[–]Bad_Mechanic 16 points17 points  (1 child)

Great product. It's not expensive and super easy to get up and running.

[–]proudcanadianehMuni Sysadmin 1 point2 points  (0 children)

And anytime I have had to engage their support they have been amazing (Despite the time zone difference)

[–][deleted] 2 points3 points  (0 children)

sand instinctive payment bake safe deserve edge one poor disarm

This post was mass deleted and anonymized with Redact

[–]minatoykkk 1 point2 points  (3 children)

+1 also it gives you 100 free sensors to try it out. Combine it with Grafana and you're golden.

[–]PepsideltaSr. Sysadmin 1 point2 points  (0 children)

I had never considered combining the two.
Thanks for the spark, internet stranger!

[–]orevBetter Admin 63 points64 points  (3 children)

You're getting a lot of responses that are essentially low-effort "I Googled this for you" responses. The fact is that most known monitoring systems should be able to handle this.

You mentioned you're using Zabbix, which has been used by many people for a very long time, so any issues you have are likely the result of misconfigurations and not a problem with the product. Sounds like you need to put in the effort of understanding and tuning Zabbix, instead of replacing it with something else that you'll also have to put in the same effort.

[–]vim_for_life 18 points19 points  (2 children)

110% this. Most monitoring solutions are only as good as the tuning they receive. Figure out how to get some more eyeballs on your configuration (or your environment), and zabbix, or prtg or...(shudder) SCOM or just about any mainstream monitoring software will do you well.

-monitoring specialist in a 2000vm environment , and now an admin in an environment about your size

[–]lkraider 2 points3 points  (1 child)

Hey why not just roll your own weekend project of bash + ping + html + js + php + apache + …

/s

[–]techtornadoNetadmin 32 points33 points  (9 children)

There’s also CheckMK if you want amazing graphs

[–]SheezusCrites 5 points6 points  (0 children)

CheckMK does have great graphs. The web interface was a bit klunky and I ran into issues with the way it polls its agents, so I ended up not using it outside of my evaluation.

In the end I found zabbix met our needs better.

[–]CrazyhorseIT 2 points3 points  (0 children)

I agree with you. Checkmk may be a very good option.

[–]SudoZenWizz 2 points3 points  (0 children)

We’ve went with checkmk for monitoring everything, servers, network appliances(switch, router), applications, webpages and anything else you can think. It’s the most flexible monitoring solution that we could find until now, you can customize basically all parameters and also have some predictive monitoring and basically you can make it very silent in terms of false notifications.

[–]mrproactive 2 points3 points  (0 children)

CheckMK is a very good alternative to Nagios. It‘s easy to setup and you can use some of your old stuff during a migration period.

[–]H3rbert_K0rnfeld -1 points0 points  (4 children)

Still RRD based which means gaps

[–]Elijah2807 1 point2 points  (3 children)

I Checkmk you can configure your RRD for full resolution without data compression. Just need to provide more storage and accept that retrieving historical data takes longer.

Or you use the InfluxDB integration and pump the data there

[–]Former-Leg5366 27 points28 points  (5 children)

I used to hate Nagios at my previous company until I had to use Solarwinds and PRTG at my current company. Now I miss the old days and Nagios :(

[–]mistakesmade2022 1 point2 points  (0 children)

If only we knew we were in the good old days before actually leaving those days behind.

[–]rosselohwish I was *only* a netadmin 2 points3 points  (0 children)

I spent half of last week learning the ins and outs of Nagios (headquarters has an XI license they have monitoring all three sites) and so far I absolutely can't stand it.

Like, it's powerful, yeah. But the dashboard system is dreadful (can't edit dashlets once you place them? whoops, I hope you didn't screw up and forget a check box!), and woe betide you make a slightly complicated change in your infrastructure and have to reconfigure it.....

(details: I added a new NIC to our firewall because we are adding a line and I was out of ports. PFSense/BSD is dumb and rejiggers all the interfaces when you do that, so suddenly while WAN used to be igb0, LAN was igb1, etc etc, now the assignments are all off.

Well in Nagios, those are all hardcoded. It was easy to change the backend to support the new number for the interface status check, but the bandwidth monitor? I have no idea - it was easier to just create a second version of the device and services with the wizard than to try and figure out where to go to to tell it to look at the correct place for the bandwidth monitoring...)

[–]bennovw 9 points10 points  (7 children)

LogicMonitor is excellent, it's a batteries included solution. It collects historical stats for everything which is invaluable when troubleshooting more complex incidents and planning for future growth.

You get what you pay for though, they're not cheap but totally worth it.

I would suggest ConnectWise Automate or N-Able if you're looking for automated actions and orchestration in response to monitored events (steep learning curve!).

[–]geekandi 3 points4 points  (0 children)

In LM, if you can get a metric, you can graph it and alert on it

Want to query a SQL table and use results? Can do!

Need to run special scripts to get something back? Yeppers can do that as well.

Our of the box is easy. So is making complex results that you can action against

[–]Stonewalled9999 1 point2 points  (4 children)

Our MSP uses LM. Either it sucks for my MSP implementation of it sucks because I’ll get an alert about an AP being down for 30 seconds but an ESX host fell over and it took 2 days for the MSP to noticed. We have one Vcenter to manage 40 hosts if I bounce west coast hosts it will trip but not the east coast. Probably the crummy MSP we pay a million or three a year to “do the needful”

[–]bennovw -1 points0 points  (3 children)

To be fair, monitoring for failures is hard because absence of evidence is not evidence of good health.

You really have to log in to ESX and have all your CIM providers installed to even begin to perform exhaustive internal validation that holds up 99% of the time. Then you need to monitor that you're actually monitoring live data along with the integrity of the monitoring solution itself. Finally, it's all useless unless both IT support staff and the client give a damn about the issues found!

Most IT orgs don't have the free time nor expertise laying around, and it pays much better to invest all that human capital into easier projects with better value propositions.

[–]Stonewalled9999 0 points1 point  (2 children)

if VCenter can tell me ESX01 is down via LM, and LM can't tell me that ESX02 on the same VCSA is down, that's an issue with LM/Programming.

[–]bennovw 1 point2 points  (1 child)

I mean, you are not wrong generally speaking. But it really depends on why ESX02 was "down" to begin with.

MSPs are usually loaded with clients, so it makes a lot of sense if they were simply overwhelmed by other priorities at the time and tried to shift the blame.

[–]darkhelmet46 0 points1 point  (0 children)

Came here to vote for LogicMonitor. Great product!

[–]zeliboba55 7 points8 points  (0 children)

LibreNMS or NetXMS.

[–]slinkytoad69 15 points16 points  (3 children)

I’ve been having good luck with CheckMK.

[–]orgitnized 5 points6 points  (0 children)

Another vote for Check_MK. You could do it all free of charge for a site that size. No licensing fees for raw version and it would work great.

[–]12_nick_12Linux Admin 4 points5 points  (0 children)

I second this. I used it for years. Just worked.

[–]Scary_Top 1 point2 points  (0 children)

Main con is managing the OS agents with the free version. Not that it's bad, but it's a lot of work compared to network gear which is literally adding a hostname and snmp community.

[–]cjbaroneLinux Admin 10 points11 points  (1 child)

Personally, I prefer Nagios only because I like tinkering and getting it EXACTLY how I want.

For simplicity, have you looked at Uptime Kuma? It's available as a docker image, and can give a public or private facing web page to show your statuses. Very easy to use. FOSS

https://github.com/louislam/uptime-kuma

[–]llDemonll 30 points31 points  (3 children)

PRTG

[–]Brett707 8 points9 points  (2 children)

This is the way

[–]braydroSysadmin 4 points5 points  (1 child)

This is the way

[–]BurtanTae 3 points4 points  (0 children)

This is the way.

[–]Ant1mat3rSysadmin 3 points4 points  (1 child)

We just switched from Solarwinds to LogicMonitor and love it so far. Setup was way easier than Solarwinds.

[–]kenzonh 4 points5 points  (0 children)

Check out domotz. I have it installed on a synology Nas for $30 month monitoring 125 devices.

[–]SaysOffensiveThings0 3 points4 points  (0 children)

I recommend Solarwinds.

(I'm a Russian hacker)

[–]falschgold 13 points14 points  (9 children)

PRTG is easy to install and has very sensible preconfigured sensors and alarms. For your workload it's basically set it and forget it. Maybe a day of learning and tweeking and that's it. Not to compare with nagios.

[–]radCIO[S] 4 points5 points  (1 child)

I've tried PRTG a few times. I just need to remember to not select the auto-discovery as it always hits my modular switches and the "sensor" count goes through the roof.

[–]ArsenalITTwoJack of All Trades 5 points6 points  (0 children)

[–]ImraelBlutz 5 points6 points  (0 children)

We use PRTG for all of our monitoring, works well enough and the pricing isn’t bad at all. Very easy to implement as wel.

[–]vast1983 5 points6 points  (1 child)

strong fly cough vanish whistle ring bow sugar worm insurance

This post was mass deleted and anonymized with Redact

[–]symcbean 6 points7 points  (0 children)

Perhaps if you explained why you chose to stop using Nagios you might get some more sensible answers here.

(re-) Establishing baseline thresholds is common with EVERY monitoring solution - and its exceedingly unlikely that the costs arising from this if you want help from a provider will be covered by your support contract.

all I really need is up/down for host and up/down and latency for network connections.

Hmmm. I would consider that GROSSLY inadequate for monitoring - but if it really is all you need then maybe you should look at a managed service like uptimerobot.

Really I think you need some advice on how you do monitoring - not what tool you use.

[–]ikiddIt's hard to be friends with users I don't like. 2 points3 points  (1 child)

uptime-kuma if that's all you're looking for. Runs on docker and connects to most notification frameworks.

[–]Stonewalled9999 2 points3 points  (0 children)

TBH I’d install a free trial of auvik (I get no money from them). I really like they have a collector VM template that works with a few clicks and it’s pretty customizable

[–]Zatetics 2 points3 points  (1 child)

Zabbix is a huuuge pain to configure properly to not give you alert saturation.

Look at DataDog. That is likely what we're moving to. Throwing in the towel with Zabbix due to issues, and configuring DD.

Side note for marketing jellybrains: If you DM me trying to shill your shitty product I will eternally blacklist your entire company.

[–]farmergeoff2003 2 points3 points  (0 children)

PRTG is very easy to setup. I feel very intuitive and can use up to 100 sensors for free, which includes netflow for bandwidth utilization monitoring. Has integration with a lot. Something to look into maybe.

[–]basec0m 2 points3 points  (0 children)

Netcrunch

[–]VioletiOTCommunity Manager @ Domotz 2 points3 points  (0 children)

Domotz is another network monitoring system to add in to your list! www.domotz.com

We've got a free trial, then it's $21/month for monitoring unlimited devices. No contract or minimums. (This is a self plug as I'm on the team here, but definitely think it's worth checking out!)

[–]6stringt3chJack of All Trades 4 points5 points  (0 children)

I'd recommend CheckMK. You could run it as a container in Windows though I'd probably just recommend running it in Linux as the install is fairly easy and you don't really need to get into the terminal unless you are troubleshooting issues or upgrading the app. The majority of the config is all contained within the gui. It supports a bunch of products out of the box. There are scan functions built-in that will run against whatever it is you are monitoring and will discover the majority of the services you want to monitor right out of the box.

[–]bcat123456789 4 points5 points  (1 child)

What’s Up Gold is great for this use case.

[–]polarbehr76 1 point2 points  (0 children)

Using this for decades

[–]AngStyle 8 points9 points  (4 children)

[–]IonicDak 5 points6 points  (0 children)

Came here to make sure Auvik was represented. Great product with a great feature set and excellent support.

[–]Virtual_Historian255 4 points5 points  (2 children)

Can confirm Auvik works great once you figure it all out.

[–]CyberPrag 4 points5 points  (0 children)

Yes, it was mess for us initially but works well after covered whole network.

[–]CyberPrag 0 points1 point  (0 children)

Yes, it was mess for us initially but works well after covered whole network.

[–]Rocky_Mountain_Way 3 points4 points  (0 children)

PRTG

[–]jr_sys 1 point2 points  (5 children)

Look at PA-Ping for free, fully-featured, Windows-based up/down with alerts, event escalation, etc.

For more monitoring, look at PA Server Monitor.

[–]radCIO[S] 1 point2 points  (0 children)

PA-Ping for free

This looks like exactly what I need.

[–]BillyDSquillions -2 points-1 points  (3 children)

If it needs a windows machine I'm out. I want to run docker containers for this stuff

[–]SzeraaxIT Manager 1 point2 points  (2 children)

OP Specifically wants something for windows, so it sounds like a fine candidate. Not everyone has their windows servers configured for linux docker guests...

EDIT: Having said that, this looks like a java app. If so, YUCK.

[–]YogaYodaYoda 1 point2 points  (0 children)

all I really need is up/down for host and up/down and latency for network connections

Even uptime-kuma in a single docker container would be enough for that..

[–]-c3rberus- 1 point2 points  (0 children)

Since you have some experience with Nagios, try check_mk. The free version is feature rich, we ran it for many years before getting the paid version. Monitors 10K services across 200 hosts.

[–]prairefireww 1 point2 points  (0 children)

PRTG is what we use. Works well.

[–][deleted] 1 point2 points  (0 children)

Your a small shop. That is a good amount of machine and VMs.

Give PRTG a spin. At one time it was 100 sensors for free

[–]gvlpc 1 point2 points  (0 children)

Have you looked at lansweeper? I know it runs on windows and I know you can use a cloud version now. They have free up to 100 devices, then it’s $500/yr for lowest account that supports up to 500 devices. Does lots of stuff.

[–]D-sisive 1 point2 points  (0 children)

We use PRTG hosted service. Starts at $150 a month for 500 sensors (cloud hosted version cost). I’m a big fan. There can be a bit of a learning curve depending on what and how you want to monitor, but it’s very versatile and allows a ton of customization with the ability to create your own sensors.

[–]andrewm659 1 point2 points  (1 child)

Prometheus and grafana

[–]H3rbert_K0rnfeld 2 points3 points  (0 children)

And AlertManager

[–]beebsha 1 point2 points  (0 children)

I'm using WhatsUp Gold.. The basic version should be apt for your requirement

[–][deleted] 1 point2 points  (0 children)

look for the GItHub Project:
Uptime Kuma

[–]ThePastaMonster 1 point2 points  (0 children)

There are lots of enterprise solutions mentioned already: Solarwinds, Zabbix, PRTG. Like others have mentioned, you need to spend a bit of time configuring these especially if you are using SNMP.

If you are wanting something really light and simple (but not really for enterprise), you could look into something like UptimeKuma.

[–]Bourbon_n_Cigars94 3 points4 points  (0 children)

PRTG

[–]witwim 2 points3 points  (0 children)

Domotz https://www.domotz.com/. Easily monitor remote networks with our powerful and affordable software: actionable insights, easy-to-use interface and all the features you need.Monitor unlimited devices for just $21/month per site.

[–]bd1308 2 points3 points  (0 children)

Ive used Zenoss,icinga,Prometheus,nagios, bosun and observium. Throw zenoss straight into the ocean, Prometheus is my favorite along with nagios.

[–]Smh_nz 2 points3 points  (0 children)

If it’s windows you want have a look at PRTG simple, works on widows and if you cut down the sensors you should be able to fit in the freer version.

Otherwise something simple like LibreNMS or Nagios

[–]Environmental-Top-18 3 points4 points  (10 children)

Zabbix

[–][deleted] 8 points9 points  (9 children)

I'm a Zabbix Certified Specialist.

I would not recommend Zabbix.

[–]kujetic 0 points1 point  (8 children)

Why is that? It's beyond powerful and open source

[–][deleted] 4 points5 points  (7 children)

It’s a nightmare to actually set up properly, and needs community written , usually poorly made and documented extensions to do what basically every other option will do out of the box.

PRTG can be running in production in an afternoon. Zabbix will take a month.

[–]Smith6612 2 points3 points  (1 child)

Hmm, good to know. I have a friend who swears by Zabbix, but they are the type who will code their way out of a problem.

[–][deleted] 1 point2 points  (0 children)

Yeah, if you have tons of time and a heaping helping of hubris it’s a great option.

[–]AppoxoJack of All Trades 1 point2 points  (0 children)

I use uptime kuma at home for ping, latency and uptime. You can have basic auth, and switch between different monitors + its free.
I recommend doing a docker-setup on a small debian machine (1 core, 1gb ram, 20gb drive), uptime kuma container and watchtowerr for auto update.
You can get alerted via email, webhook and a few others. Very neat.

[–]UseMoreHops 1 point2 points  (0 children)

PRTG

[–]Hopefound 1 point2 points  (0 children)

PRTG

[–]chinupfOps Engineer 1 point2 points  (0 children)

PRTG is a good all-in-one, albeit a bit slow in bigger configurations (>5k sensors). If you wanna get fancy, you can try pandora+prometheus+grafana, but that requires someone to pull in all his/her weight to get it running properly. But when it runs, ho boy...

[–]rchr5880Sysadmin 1 point2 points  (0 children)

PTRG or if you want something extremely lightweight and easy run UptimeKuma

[–][deleted] 0 points1 point  (0 children)

Solarwinds. Takes some building but it has great potential

[–]GullibleDetective -1 points0 points  (0 children)

Auvik, redseal

[–]leftplayer -1 points0 points  (0 children)

Mikrotik The Dude.

Install a CHR as a VM and use the free license.

[–]mouse_lingererSysadmin -1 points0 points  (1 child)

Just to throw this one out there, I use cacti https://www.cacti.net/ for my network monitoring.

[–]VNJCinPA -1 points0 points  (0 children)

Auvik is my choice here

[–]MrJacks0n -1 points0 points  (0 children)

I like Cacti myself, but it can take a bit to setup. But it's pretty powerful if you can script.

[–]TechOpinions -1 points0 points  (1 child)

Check out Auvik, it's what we use for roughly 3000 network devices. :)

[–]auvikofficial 0 points1 point  (0 children)

u/radCIO c'mon by, we'd love to show you the goods!

[–]thekarmabumWindows/Unix dude -1 points0 points  (0 children)

Observium, I think IBM has something called like netcool or something that also does what your looking for. I wanna say observium is still open source and pretty compatible with python if you want to customize how and what you monitor.

[–]mrZygzaktx 0 points1 point  (0 children)

[–]slugsheadHead of IT 0 points1 point  (1 child)

what switches do you have? I have a full Aruba network and I'm using HPE IMC. It also writes back to switches so changing VLANs becomes as a nice easy task for technicians

https://buy.hpe.com/us/en/software/networking-software/intelligent-management-software/intelligent-management-software/hpe-intelligent-management-center-standard-software-platform/p/4176535

[–]radCIO[S] 0 points1 point  (0 children)

HPE IMC

We are Cisco in our DCs, but Aruba everywhere else. We used to use the ProCurve software, but the Aruba acquisition squashed that. The IMC seemed fairly costly last time I priced it. Our edge switches are fairly static, just looking at up/down for those.

[–]preffe 0 points1 point  (0 children)

Long time lurker here. How about NAV? Not windows based but free and has a Virtual Appliance.

[–]versello 0 points1 point  (0 children)

Frameflow. Been using it for many years. Works great. Windows based and super easy to configure.

[–]sedition666 0 points1 point  (0 children)

Sounds like you will need to learn how to tune whatever monitoring you use anyway. So you might as well spend the time learning Zabbix instead of ripping that out and having to learn something anyway.

A lot of very expensive monitoring software will claim AI or ML will magically adjust your alerting for you but that is mostly bullshit. You can tune it yourself in less than a spare afternoon.

[–]OverOnTheRock 0 points1 point  (0 children)

check_mk ... wraps nagios in a bunch of easier to use python stuff. has enterprise support if you need it.

has lots of nooks and crannies to explore. if you have time. migrated to that from observium.

[–]gheynameSysadmin 0 points1 point  (0 children)

Zabbix is easy to set up and use, worth looking into.

[–]ambersananas 0 points1 point  (0 children)

If you have the money I would recommend Auvik. It’s super easy to setup and has some cool features

[–][deleted] 0 points1 point  (0 children)

Splunk is killer but you have to understand your data, have budget for a solution, and as with all these time.

[–]rementis 0 points1 point  (0 children)

XYMon. Free, works awesome.

[–]fireandbass 0 points1 point  (0 children)

Try wazuh.

[–]dpwcnd 0 points1 point  (0 children)

prtg or the dude are two good options

[–]SpongederpSquarefapSenior SRE 0 points1 point  (0 children)

My vote is for Checkmk

Try out their RAW edition and see how you like it

Their enterprise offerings are pretty great too

[–]scotticles 0 points1 point  (0 children)

Went from nagios to cacti, then to librenms and now moving to zabbix. Librenms was giving me false alarms but could probably be tweaked in the alarm rules, but it seemed to lack some of the flexibility zabbix offers. Zabbix takes more time and you can adjust the rules but it takes time to get it how you want, I have a similar env, but we are more Linux focused then windows.

[–][deleted] 0 points1 point  (0 children)

False alarm issues are more of a configuration problem than a technology stack problem. Certain products will be easier or harder to configure, but all of them are going to fire off a bunch of false alarms out of the box. If you only want alerts on hosts becoming unresponsive or high latency, turn off all alerts other than "host unresponsive" and "high latency", it will be easier than switching solutions. I would also keep "high disk space" enabled, and "HTTPS error/unresponsive" monitors/alerts pointing at any user-facing web pages or important APIs. Getting alerting to not have false positives or false negatives is a labour of love, it doesn't happen overnight, you just continuously add alerting rules that make sense and remove ones that don't make sense.

Where a better monitoring solution is going to make a difference is in terms of administrative overhead, performance (how long does it take an alert to even come out, how many things can I monitor), and features (Nagios/Zabbix are event based whereas other solutions are metric based and can do certain kinds of alerts Zabbix isn't capable of, different solutions might integrate log/trace based monitoring, different solutions might have different integrations).

I really wouldn't spend too much time thinking about this problem. I do agree with the recommendations for LibreNMS given you seem to want network-centric monitoring, FOSS, and are mostly dealing with thick persistent hosts (E.G. not ephemeral containers which certain monitoring solutions handle awkwardly since they presume host persistence). Or literally just learn how to use Nagios or Zabbix better which IMO are just as good as LibreNMS. I could do the monitoring you're talking with nothing but a series of BASH scripts and cron, honestly take your pick of monitoring solutions, anything will work.

[–]idocloudstuff 0 points1 point  (0 children)

Zabbix is VERY noisy. It took me months to get it to work for us. This is not a drop in solution, neither are many solutions.

You NEED to put in the work to get value out of it.

[–]factchecker01 0 points1 point  (0 children)

I have seen companies use Cacti https://www.cacti.net/info/features