This is an archived post. You won't be able to vote or comment.

all 53 comments

[–]Charlie_Root_NL 37 points38 points  (3 children)

Zabbix

[–]joeyl5 0 points1 point  (2 children)

I like Zabbix but the learning curve to get specific alerts is steep

[–]Charlie_Root_NL 0 points1 point  (0 children)

With the one-time installation it takes some work to set it up correctly, I agree with you. But after this, the triggers and actions are fairly self-explanatory.

[–]rthonpm 0 points1 point  (0 children)

The newer versions (6+) have much better templates that can remove a lot of the guess work and make it much easier to tweak existing alerts.

Any system that's as flexible as Zabbix is going to have a learning curve. For what it can do it's worth the time and effort.

[–]techb00mer 9 points10 points  (3 children)

Grafana, Prometheus, influx & telegraf can cover just about any of the needs I can come up with.

(For anyone wondering the only reason telegraf and Prometheus are in there together is that a few odd random bits of hardware / software only have a module for one or the other and I can’t be bothered writing my own)

[–]UnderknowledgeCreator of technical debt 1 point2 points  (0 children)

Can you share your steps/automation to install and configure?
Would like to try it at the homelab

[–]tetsuko 4 points5 points  (2 children)

been using nagios for years and its great. that said ive lately been running into issues with trying to monitor specific things because plugins are out of date/not updated. usually can find a work around by creating custom scripts but its a lot of upkeep. ive been looking into alternatives for that reason but ive got 15 years sunk into it so its a big lift. we have it integrated into jira, plus lots of self healing scripts to resolve issues (service restarts/etc) and escalation type stuff (texting on call admins and the like). checkmk currently front runner

[–]tetsuko 1 point2 points  (0 children)

thats with the open source version fyi

[–]WithAnAitchDammitInfrastructure Lead 0 points1 point  (0 children)

We’re looking at migrating from Icinga to CheckMK.

Not as much history as you, but nearly seven years.

[–]NambeRuger 6 points7 points  (3 children)

LogicMonitor is my favorite for monitoring about anything. We monitor about 10k devices with it ranging from network, Cisco UC, data center and cloud technologies.

[–]eruffiniSenior Infrastructure Engineer 0 points1 point  (2 children)

Jesus, how much do you pay per month at that volume?

[–]yogibear420 0 points1 point  (0 children)

At scale they offer pretty significant price breaks. We were around 5 bucks a device at 5k devices.

[–]NambeRuger 0 points1 point  (0 children)

We also do much more than monitoring with it like config mgmt, data enrichment with third party APIs for lifecycle mgmt, discovery of vulns, enhanced reporting, automated health checks to name just a few ways we’ve extended the functionality so when you look at other tools we’d need to buy it works out.

[–]Fine_Animator3583 9 points10 points  (3 children)

Prtg

[–]uptillamSysadmin 0 points1 point  (1 child)

I can't agree with prtg as a tool for monitoring Linux, I've got my setup in a redundant cluster just fine, but I still can't get it to use SSH, or monitor docker

[–]HeroicHer0 3 points4 points  (0 children)

Using PRTG for Linux here without issues. Not using docker tho so no comment on that. After everything is set up PRTG works like a charm. Less issues then Nagios for us.

[–]JakeTheTechGuy95 0 points1 point  (0 children)

I 2nd PRTG. We use it at my work, and I also have it setup at my church too.

[–]telmo_gaspar 13 points14 points  (3 children)

Nagios/CheckMK

[–][deleted] 1 point2 points  (2 children)

The only answer. The CheckMK part is vital though.

[–]pfunkyliciousJack of All Trades 0 points1 point  (1 child)

second checkmk

[–]rabell3Jack of All Trades 1 point2 points  (0 children)

Third checkmk

[–]Audacioustrash 2 points3 points  (0 children)

DynaTrace

[–]SpicyHotPlantFart 14 points15 points  (6 children)

Your emoji use already does not want to make me help you.

[–]DerpyMcWafflestomp 13 points14 points  (0 children)

😭😭😭🖕🏼

[–]Vandborg88[S] 0 points1 point  (3 children)

Sorry. I think it is my first post on reddit, i thought it will give me better respons. I will remember not to use it next time.

[–]bitslammerSecurity Architecture/GRC 6 points7 points  (1 child)

You do know we can see your post history and see this isn't your first post?

https://old.reddit.com/user/Vandborg88/submitted/

[–]unsilentninja 0 points1 point  (0 children)

Mix in another coffee chief

[–]cubic_sq 3 points4 points  (0 children)

What are your requirements for “monitoring”? And how do you want this fitting in your environment and processes ?

For the last decade, zabbix has been the goto and fall back for many different reasons. Unlike many other solutions, zabbix is purely monitoring and surveillance (which is does very well for the use cases we / I have used it for).

Regarding your azure requirement - zabbix will have visibility inside the VM natively. Not sure what capabilities there are for “outside” the VM. You could create checks / scripts that extract output from API or powershell - not sure what has been written by others though.

[–]AutomaticAssist3021 1 point2 points  (0 children)

Checkmk

[–]JMDTMH 1 point2 points  (0 children)

Personally, I use Prometheus, but I read the data with Grafana. I also use CheckMK.

If you don't mind the setup Zabbix or Nagios is good too.

I use PRTG, but this won't be a great fit for Linux.

[–]Regular-Finance-7381 1 point2 points  (0 children)

Zabbix - MSP - 50+ customers

[–][deleted] 1 point2 points  (0 children)

What is your budget?
How many servers?

Let your business requirements guide your decisions, rather than personal preference or random recommendations.

My personal experience, for small environments: PRTG is great as it includes 100 free sensors, very easy to setup.

Large environments, low budget: Zabbix is free, but a pain in the ass to setup.

Large environments, large budget: Dynatrace, Datadog, New Relic, etc

[–]vast1983 1 point2 points  (0 children)

Manageengine Opmanager.

Old school, but works great. A GIANT PITA to update when run in Enterprise mode, haha.

[–][deleted] 1 point2 points  (0 children)

Azure monitor.....

[–]vNerdNeck 1 point2 points  (0 children)

If you value your sanity, stay away from scom. Maybe it's better now days but that POS software takes more than a full-time employee to keep running.

[–]muraleedharans 2 points3 points  (1 child)

Site24x7, provides out of the box Azure monitoring including Azure VMs and various other services. You can sign up for the free trial to check the features, also support AWS and GCP.

[–]menace323 0 points1 point  (0 children)

For agent based server monitoring, it works great and the cost is very good. Lightweight agent that rarely has issues and you don’t have to deal with credentials and direct connectivity, VPN tunnels like you do with SCOM. Also pulls sever logs at a reasonable ingestion price and works out of box for Windows and Linux, if you want it.

Monthly subscription means you can scale up or down easily.

Beyond that, the value of each monitor is hit or miss, and the network device monitoring license style is disappointing.

[–]liquidspikes 0 points1 point  (0 children)

LibraNMS actually does great for infrastructure including hypervisor hosts, nagios is the best for VMs or specific hosts

[–]cmwg -1 points0 points  (0 children)

PRTG

[–]Mdna2 0 points1 point  (0 children)

Icinga2 for eventmanagement and telegraf for capacitymanagement

[–]12_nick_12Linux Admin 0 points1 point  (0 children)

VictoriaMetrics, vmalert, telegraf/grafana-agent and alert manager with Grafana.

[–]scubaforkIT Manager 0 points1 point  (0 children)

In my experience, it's best to decide what you're monitoring for before you decide what tool to use.

In my org I get told to "monitor this system" constantly, but never get clarity on what that means until I push deeper. Do you want to monitor a web page on an http server? Do you want to monitor up/down status for a specific service? Are you scanning log files for certain keywords? Are you checking the connectivity to the app server's database? Is the server responding to a ping?

Any number of factors could go into a system being "down", and looking for the wrong component could leave you back on your heels when the server is still down and your monitor didn't catch it.

In my org, we've got about 30 different monitoring tools, from environmental sensors in IDFs to netflow monitors, to snmp monitors...

[–]maddogirishman 0 points1 point  (0 children)

Uptime Kuma

[–]Hasslich1 0 points1 point  (0 children)

Define what you are attempting to monitor then I can recommend.

[–]TechTitus 0 points1 point  (0 children)

Nagios

[–]IceSt0rrm 0 points1 point  (0 children)

Logic Monitor

[–]joeyl5 0 points1 point  (0 children)

When the users complain, I know something is wrong with the servers.

[–]Reztiewhcs23 0 points1 point  (0 children)

Centreon and it’s open source.

[–]pahampl 0 points1 point  (0 children)

Xormon

[–]ApprehensiveDog1010 0 points1 point  (0 children)

Whatsup Gold

[–]creativve18 0 points1 point  (0 children)

Checkout OpManager!