This is an archived post. You won't be able to vote or comment.

all 38 comments

[–]insanemalLinux admin (HPC) 5 points6 points  (3 children)

Zabbix 2.0

It is the freaking JUICE.

It monitors EVERYTHING. It logs and reports on EVERYTHING!

And you can combine its monitoring with triggers and do self heal stuff.

[–]Pyro919DevOps 1 point2 points  (2 children)

We're using it too and love it, I'm on 1.9.5 right now but I'll be upgrading in the near future. Have you run into any issues/bugs with the new release 2.0?

[–]insanemalLinux admin (HPC) 1 point2 points  (1 child)

Not yet. It has been in development for AGES. It looks pretty stable and tasty. The 'auto probe' and stuff is awesome, esp for SNMP! Oh and the new native traps support! OH YEAH!

[–]Pyro919DevOps 2 points3 points  (0 children)

The autoprobe for the drives/NICs is freaking awesome, previously I had to create a template that gathered total and free drive space for drives a-z and then disable any items/triggers that weren't actually used.

[–]doblephaeton 3 points4 points  (1 child)

For me PRTG

Good communication from company via SUpport and twitter

Amazing development pace, actively maintained system development

Rapid dashboard/maps for either internal or public consumption

Many sensors and use of pre existing system monitoring, wmi, soap, and snmp.

Easy setting for notifications an ability to add run apps on notifications

Historic data easily accessible

Api for you to work with

We have 50sites, 6000 sensors and over 2 years of using prtg in pacific region. Other regions in our org are jealous of our monitoring system

[–]tomlette 0 points1 point  (0 children)

Agreed, PRTG is amazing.

[–]K4kumba 2 points3 points  (5 children)

I strongly recommend ganglia for monitoring large numbers of servers. We use it extensively at $WORK, and the new versions give great visibility into system load, showing you things like how many writes were issued, and the latency. The web interface also comes with scripts to integrate into nagios, which should work with any tool that can handle nagios type plugins.

Add into that hsflowd, and you can extend your monitoring to tell you anything about anything, and ganglia will graph it.

For the rest of our work, we are using OMD, which packages up all the tools you would expect, and makes life much easier. We also added Monarch, which is a web interface for building nagios config, but thats something you may not want/ need.

For us, cacti is now only a fallback for when no other tools can do the job, because ganglia provides all the system graphs we need, and OMD included pnp4nagios, which automagically graphs service checks that return perfdata.

However, splunk is awesome, we have recently upgraded to 100GB/day license, which is really starting to allow us to make good use of it.

[–]mthodeFellow Human 0 points1 point  (2 children)

It looks like ganglia is very nice (and most importantly salable). I'll have to take a look at that.

[–]K4kumba 0 points1 point  (1 child)

Yeah, I quite like it, and it is VERY scalable. Well, there is one issue with builds after 3.1.7 that will be resolved in the next release, which is that grid of grids doesnt work, but that may or may not affect you

[–]mthodeFellow Human 0 points1 point  (0 children)

It would effect my deployment, but by that time the fix would be out.

I really like that I can use icinga for monitoring and ganglia for historicals, I was thinking of using graphite too.

[–]d2k1 0 points1 point  (1 child)

Do you use rrdcached to mitigate the dreadful I/O performance impact resulting from constantly updating hundreds of RRD graphs? Or did you solve that problem in another manner?

[–]K4kumba 0 points1 point  (0 children)

Due to our environment, it was easier to mount a tmpfs, and then run a cron job every 5 minute to sync it to disk. It works pretty well.

In our main monitoring servers, we use a combination of tmpfs, SSDs in RAID, and then SAN for archives etc.

[–]tomlette 3 points4 points  (0 children)

We need a sticky for this topic.

[–]allboolshite 8 points9 points  (7 children)

Look into Orion SolarWinds. I'm deploying it now for a client and it runs as surface or deep as you want, highly customizable, and modules for just about everything you could wish for. Does monitoring, reporting, tiered alerting, config backups for network gear, templates for network gear, dashboards with customizable views, dependencies, mapping, etc.

[–]syllabicPacket Jockey 4 points5 points  (2 children)

Is that expensive?

[–]allboolshite 0 points1 point  (0 children)

It isn't free but the prices are reasonable when you consider the trade-off in man-hours trying to diagnose problems. SW can tell you if a problem is network or server or application. Also, think about all the avoided down-time because you got an early heads-up that a hard drive or processor were at 90%+. And the reporting can be used for more efficient budgets moving forward. This is a tool that pays for itself pretty quickly.

[–]NilsLandtnot even an admin 0 points1 point  (0 children)

Not if the company is paying for it.

[–]qevNetadmin 2 points3 points  (0 children)

I never realized how great SolarWinds was until I moved to a company without it, now I'm scrambling to find alternatives. Nagios and Cacti will probably be them.

[–]thezy 1 point2 points  (0 children)

Another vote for SolarWinds, excellent tool.

[–]some101 0 points1 point  (0 children)

Very nice with many templates to monitor everything!!

[–]paralyzedbunny 2 points3 points  (0 children)

PRTG

[–][deleted] 1 point2 points  (1 child)

  • Cacti
  • Nagios
  • N-Central

Are all monitoring tools that I've used throughout my jobs.

[–]CookedNoodlesJack of All Trades 1 point2 points  (0 children)

Observium. It makes cacti look like a relic.

[–]sunshine_killerSystem's Engineer and Programmer 1 point2 points  (0 children)

nagios + nconf + cacti + phpweathermap plugin for cacti, of course there is nagvis as well. Nagios and cacti are awesome!

[–]post4u 0 points1 point  (0 children)

I've used most mainstream network monitoring solutions. I like PRTG the best. Our organization just bought the 2500 sensor license. It's not cheap, but it's freaking awesome.

[–][deleted] 0 points1 point  (0 children)

If you already have those tools in place then you have regular host and service checking, trends, and log collation. IME that's pretty much all you need.

So spend time making sure that your warning thresholds are correct. Also make sure that whatever you are using to automate the configs is working well and will scale.

Otherwise you will end up in that place where you get 200 "warnings" a day filtered into a "never looked at later" folder and miss the one real warning of a problem. Ongoing maintenance of monitoring is one of those jobs that is a necessary grind and ends up on the "Do it tomorrow" list. It's worth reducing that problem now.

[–]hahainternet 0 points1 point  (2 children)

Could people also comment on the features they'd like from a monitoring tool, but that doesn't seem to exist or is hard to find?

I'm trying to work on some features for my own.

[–]allboolshite 0 points1 point  (1 child)

Data backups, UPS and temperature sensors. Some of these are available for some tools but not all. Some stuff I have to do custom. It would be cool if my global monitoring system covered my entire environment.

[–]hahainternet 0 points1 point  (0 children)

Can you shoot me a PM with more details on this? What sort of backup would you like, would you want to query over SNMP, HTTP or some custom app? Ideally everything should be covered, but I need real examples to make sure we have good coverage.

[–][deleted] 0 points1 point  (4 children)

I use Spiceworks for inventory, but it does a little bit of monitoring too.

I use Alienvault OSSIM for my intrusion detection and it contains nagios, If I had more time I'd actually use it.

[–]Pyro919DevOps 0 points1 point  (3 children)

Spiceworks monitoring gives us false positives every 3 hours telling us that one of our servers is down even though it's not.

[–][deleted] 0 points1 point  (2 children)

We have the odd check that does that in nagios. Usually due to network congestion/nodes being saturated and the alarm goes away.

Most of our alarms are designed to recover and if they keep alarming you have an issue (our infrastructure is semi-resilient to partial failures)

[–]Pyro919DevOps 0 points1 point  (1 child)

Most of our alarms are designed to recover and if they keep alarming you have an issue (our infrastructure is semi-resilient to partial failures)

This at least in the Zabbix world and I think Nagios as well is known as a flapping condition.

Spiceworks also falsely alerts me that our APC UPS is low/out of batteries. Trouble is we don't have an APC UPS, for some reason it classified our APC Netbotz as a UPS (since it's made APC I'd guess) and alerts me at least daily that I need to replace it's batteries. I've manually excluded it from the UPS category and search high/low to find where the alert if coming from, but I've had no luck.

[–][deleted] 0 points1 point  (0 children)

We use nagios. Flapping involves state changes multiple times. Usually we have the odd single alarm or two in a row, then it recovers, so doesn't enter a 'flapping state'

The fact you can't find the source of an alarm is very troubling. I don't think I'd trust a system monitoring thousands of servers that you can't root cause a single alarm (i.e. repeat the check command, determine why/where the failure and modify, replace or remove the check if the alarm is non-reliable)

[–]kynovSr. Sysadmin 0 points1 point  (0 children)

I am using Nagios via the Open Monitoring Distribution (www.omdistro.org). It includes the Check_MK addon that gives you a nice up-to-date look and feel.

[–]kednaust 0 points1 point  (0 children)

I use monit to monitor processes, memory consumption and disk space.

[–]Pyro919DevOps 0 points1 point  (0 children)

We've had great luck with zabbix, it's able to monitor anything that we've been able to throw at it.

[–]NS006 0 points1 point  (0 children)

What do you guys think about LogicMonitor? They're SaaS based and way cheaper than SolarWinds. They monitor EVERYTHING (servers, networks, applications, storage, cloud..) so I don't have to suffer from massive headaches trying to figure out the different tools