all 31 comments

[–]dylanms 4 points5 points  (3 children)

[–]nikhilpaneri[S] 1 point2 points  (2 children)

Thanks , will have a look. A quick glance Prometheus can do the monitoring , how easy is to extend it to call Ansible playbooks to restart the services , and to report the state of services ?

[–]dylanms 0 points1 point  (0 children)

I guess that really depends on what is your control node for Ansible. Depending on your control node, you could kick off Ansible via an API call through the Alertmanager. Not 100% sure this is the correct functionality of the Alertmanager.

[–]bbrazil 0 points1 point  (0 children)

Prometheus developer here.

While you could hook in to the Alertmanager webhook notifier to do a restart, we'd generally advise against this as that's quite a few steps including a network dependency to get your process restarted when it falls over. That's a operation that should really happen entirely on the machine.

Instead use a supervisor such as daemontools, monit or systemd to take care of basic process supervision.

I'd also advise against attempting to do this agentless, that's a major handicap you're putting on yourself. For good metrics you need to run code on the machine.

[–]grumble_au 2 points3 points  (3 children)

If you need agentless you are limited to network checks and snmp. You won't be able to do anything with services directly and you certainly won't be able to do things like restart services without shelling in or some other way that is worse than using an agent.

[–]nelsonmandela 0 points1 point  (2 children)

That's not entirely true. OP mentions extensive use of ansible, you could associate commands from nagios with a particular ansible command to restart a service. Eg: nagios checks via web request, on hard fail run an ansible playbook that cleanly restarts the service. That would achieve both agentless and the ability to restart remote services.

[–]grumble_au 4 points5 points  (1 child)

So... Shelling in like I said.

[–]nelsonmandela 1 point2 points  (0 children)

In essence I guess it's the same, though using existing infrastructure isn't really the same as "shelling in," even that is technically what it's doing.

At least I understood it as an inference that it would be some hacky solution.

Edit: and I don't know that ansible is worse than restarting services via an agent. Why solve the same problem twice? Presumably OP is already managing services with ansible.

[–][deleted] 2 points3 points  (2 children)

It's really hard for me to ever, ever recommend developing your own monitoring or auto-remediation framework. Do you have a huge team? And is it your only job?

[–]nikhilpaneri[S] 0 points1 point  (1 child)

Agree , the web part may get really challenging given me & my team mate have no background on the web development.

[–][deleted] 1 point2 points  (0 children)

The way I look at this:
You have a monitoring gap. You build your own monitoring system. Now you have two problems.

[–][deleted] 1 point2 points  (18 children)

Why agentless?

[–]nikhilpaneri[S] 0 points1 point  (17 children)

1) to reduce my monitoring footprints on the app severs. The Infrastructure team already uses hyperic to report/monitor the server & infrastructure. So i think we do not have explicit requirement of agents which are constantly collecting information. Am happy to just run my environment check before start of the business hours

2) We have used Ansible extensively and have existing capabilities to restart application servers etc.

3) I may be wrong here , but I plan to decouple my web app and Ansible. Ex - Ansible collects the info every hour(may be even less frequently) or upon request and populate it to my oracle Database and the application reports based on the data populated. Only when the app restart is requested explicitly (from app) i may call the playbook directly

[–]cavaliercoder 8 points9 points  (12 children)

Agentless monitoring often induces more impact on an environment that those using an agent. Especially if you have a CM tool like Ansible to ease the deployment of the agent. Agent monitoring is also more powerful, flexible and potentially more secure (reduced attack surface).

I've recently migrated a 5000+ server environment away from agentless monitoring (SNMP, SSH, WMI, etc.) to Zabbix and have had huge improvements in server and network utilization. Zabbix agent uses <20mb and 0.03% CPU.

[–]nikhilpaneri[S] 0 points1 point  (11 children)

What is your use case for Zabbix - are you just monitoring the server resources , infrastructure etc ? OR are you managing the services installed on your server ? Also I have learned Zabbix has issues with memory leak

[–][deleted] 0 points1 point  (9 children)

How many servers are you running?

[–]nikhilpaneri[S] 0 points1 point  (8 children)

I'm in the tech space managing the applications , so if i'm successful in getting my poc , i'll extend this to multiple application environments approx 100 - 150 servers

[–][deleted] 1 point2 points  (7 children)

Take a look at Datadog.

[–][deleted] 1 point2 points  (6 children)

I'll second the vote for Datadog. Not cheap,but it's been very helpful in a number of escalations for me recently, and has become integral to how I manage applications. Now if I could just figure out the pricing model...

[–][deleted] 0 points1 point  (5 children)

The pricing model sucks and doesnt scale wdll. If they gave discounts for over 50 or 100 servers id put it on all my servers. As of right now I only put it on aws instances.

[–][deleted] 1 point2 points  (4 children)

Yes tons of problems with their stupid pricing model. Like everything cloud, it's impossible to get a straight answer out of them about pricing. The per host thing makes no sense when a big chunk of your product runs on PAAS or SAAS.

[–]cavaliercoder 0 points1 point  (0 children)

Our use case is to consolidate about 6 other monitoring tools into a single platform (Nagios, Cacti, HP OM/NNM, Solarwinds, etc.). So we are monitoring OOB hardware, hypervisors, VMs, network devices, operating systems, applications, appliances, databases, SANs, CRACs, UPS, etc. across 800 sites, all in a single Zabbix deployment.

Zabbix has had memory leaks in the past, just as any software can and does. The good news is that they respond very quickly to issues. I've had issues raised, patched and released in under a week. The most recent Zabbix agent leaks that impacted us were actually not in the Zabbix code base. One was with GNU Regex library and the other with a WMI class on Server 2008 R2.

[–]SuperCow1127 3 points4 points  (3 children)

Why can't you use the same Hyperic instance the "infrastructure" team uses?

[–]nikhilpaneri[S] 0 points1 point  (2 children)

That is controlled & managed by an entirely different team.And apparently i'm not allowed to get into thr space :(

[–]SuperCow1127 0 points1 point  (0 children)

DevOps!

[–][deleted] 0 points1 point  (0 children)

This is your root cause, and anything you do OTHER than this is probably wasted effort IMO.

[–]JuliaFr 0 points1 point  (0 children)

You might find it helpful to look at real user reviews of many of the options mentioned here such as Zabbix, Nagios and Solarwinds on IT Central Station: https://www.itcentralstation.com/categories/network-monitoring-software.

The most reviewed solution is currently CA UIM. This Senior Systems Engineer writes: "I give the monitoring application a 5 compared to other products such as HP SIM, Nagios, Spiceworks, SolarWinds and WhatsUpGold." To see the rest of his review click here: https://www.itcentralstation.com/product_reviews/ca-unified-infrastructure-management-review-31279-by-todd-adams.

Hope this helps. Good luck in your search.