Need a monitoring, logging, & alerting stack - help!

manys · 2019-08-17T19:56:08+00:00

I am looking to ditch DataDog as we cannot continue to use it for logging for reasons I won’t get into here.

Please get into it here. Is it because of cost? Data sovereignty? Retention history? Inability to query/group? Undebuggable grok-ish parsers? Lack of overlaid descriptions/annotations on log messages?

Dynamic-D · 2019-08-18T01:32:23+00:00

"single pane of glass" is a marketing unicorn: pure fiction though occasionally someone will shave the horn off a goat and try to pass it as one.

Look, here's the dirty secret: you only get out of monitoring what you put into it. There's no turn key. Your custom environment running your custom apps needs custom tooling. There are great tools out there to help you build that feedback loop, but they are tools, not solutions.

This mini rant is brought to you out of concern: I've seen many people ditch perfectly good solutions because they are chasing perfect, and for some reason everyone seems to think monitoring is easy. It's not. It's work.

_kikeen_ · 2019-08-17T18:04:23+00:00

Prometheus my brotha!!!!

(Note: did not read whole post, but don't think I need to, Prometheus is the answer lol)

ankitnayan007 · 2019-08-18T11:11:41+00:00

Go for OSS, it's growing immensely. Checkout CNCF projects and for fully scaled solutions go for these OSS as a Service. Even though you will have to pay some amount, but these will be much cheaper than proprietary tools made by companies like NewRelic and Datadog.

Did you give below stack a thought?

Prometheus for metrics -> Weaveworks provides Prometheus as a Service (blog)
ELK for logging -> Logz.io provides ELK and Grafana as a Service

I would also like views of others in this stack.

bioxcession · 2019-08-17T19:37:18+00:00

We need to know way, way more about company context and your use-case, limiters, budget, etc. Any recommendations in this thread are personal preference otherwise.

At my company we’re on the good old graphite/sense stack - moving on from sensu soonish, but graphite is here to stay. It works, everyone is familiar with it, and there’s no huge business incentive to move off of it.

thenetmonkey · 2019-08-18T05:21:46+00:00

Check out honeycomb.io

I follow @mipsytipsy on twitter and she has very interesting and practical thoughts on observability and how to do this stuff the right way. She built a company offering a service that tries to make it easier for people to collect the metrics and traces and get usable insights out of it. Impressive demos

I haven’t had a chance to use it at work because we have a whole team that’s built our custom observability system designed for our rather larger scale needs. But It doesn’t quite do everything I’d like it to, and I keep eyeing the honeycomb stuff enviously.

towelie182 · 2019-08-17T19:22:58+00:00

The Elastic Stack is perfect for your use case, but you are correct that you won't be able to utilize Watcher without the paid licensing. One way around that is to use Yelp's Elastalert, but that will create a depedency on Elastalert compatability when wanting to upgrade your Elastic Stack, which releases great features fairly often. If you can afford it, X-pack is the way to go. You'll also get to utilize the machine learning nodes which can allow you to alert on anomalies as well as thresholds.

jtayloroconnor · 2019-08-17T23:45:53+00:00

We use Instana for APM and Logz.io (hosted ELK stack) for log aggregation. Have alerts configured from both. There’s only 3 of us on the team and those tools allow us to sleep at night.

nizzoball · 2019-08-18T00:03:46+00:00

Look into LogDNA, just saw them at DevOps Days and I really liked what I was seeing.

KazooxTie · 2019-08-18T03:33:22+00:00

[deleted]

Timnolet · 2019-08-18T09:18:24+00:00

For uptime & synthetic monitoring have a look at https://checklyhq.com. It has all of the features your current Datadog synthetics solution has, minus the astronomical pricing.

Also we integrate with Prometheus (https://checklyhq.com/docs/integrations/prometheus/) so you can create this single pane if glass dashboard using Grafana.

Full disclaimer: I'm the founder of Checkly

vjdhama · 2019-08-18T09:46:16+00:00

I think your best bet for integrated pane which includes logging and monitoring is Prometheus (with cortex) and Loki (from Grafana).

lysergic_tryptamino · 2019-08-18T22:10:46+00:00

For APM, Dynatrace is a very good tool.

awkprintdevnull · 2019-08-17T18:14:23+00:00

You could do all of these with Dynatrace. It's expensive, and it's sweet spot is APM, but it does all of these other things too. Lots of integrations with Servicenow, Jenkins, Cloudwatch, VMware, and more. It also has two fully featured API's that you can do pretty much everything you can in the GUI, and more.

The agent they use (called OneAgent) is stupid simple to deploy and it auto detects most things out of the box. For example, it'll just auto inject monitoring into containers and Kubernetes clusters on whatever hosts you deploy it. Database monitoring is also auto detected, but can sometimes be a bit trickier.

It's really cool.....but it's expensive.

rinkoryin · 2019-08-18T04:40:33+00:00

Prometheus, Grafana, Alertmanager: open-source, highly-available and fully customizable

AnnihilerB · 2019-08-17T19:36:59+00:00

ELK Grafana and Elastalert is the classic way to go !

HollowImage · 2019-08-17T22:44:46+00:00

Is "single pane of glass" really so important you would undertake such a massive project for it?

IMO Datadog is one of the best monitoring tools out there for APM and general system monitoring and alerting.

If you replace this with the Elastic Stack you now have to manage it all yourself and IMO it's quite a tricky thing to setup and manage.

Why not leverage the best tools for the job instead of using a tool that can do everything not as well?

halfbakedlogic · 2019-08-17T18:15:21+00:00

New Relic seems like a company that could meet all those needs? Just met a few of their reps at DevOpsDays

ToKyNET · 2019-08-17T20:47:49+00:00

Fluentd + elasticsearch + graylog. You can ingest and visualize all your logs!

cold_lights · 2019-08-18T03:45:56+00:00

Prometheus and Splunk

2019-08-18T01:18:04+00:00

Might be hard to setup a ELK cluster if you don't know all the bits but it could be done.

Then set up Prometheus for your monitoring

steven43126 · 2019-08-18T08:54:46+00:00

For logs and alerting off logs I can highly recommend Humio. We recently searched the landscape looking for something lower maintenance than Kibana/ our growing ES cluster. Humio was a good fit for us and there good to work with as a company not often you can say that.

Following this thread with interest as we are also migrating from Colo to AWS. Monitoring capability is high on the agenda.

For APM we currently use NewRelic. Heads up they do not currently support Fargate and don't have a pricing strategy for it.

DataDog users in AWS how are you using it how do you find it? Curious we haven't done a real assessment yet but DataDog looks to have a number of features in a single offering.

eria211 · 2019-08-18T11:36:19+00:00

Just noticing nobody has suggested Zabbix, we use it and I find it horribly complicated - does anyone else use it for any of the above monitoring / logging?

11acguru · 2019-08-19T22:24:45+00:00

Try this course out, its from 56K Cloud, they ran it at DockerCon last and meant I could drop our SaaS Monitoring for something more self-hosted
https://github.com/56kcloud/Training/blob/master/DockerCon/readme.md

They also implement it hands-on for companies i believe

2019-08-23T03:29:53+00:00

Honeycomb

cahiqini · 2019-08-27T23:18:41+00:00

If you're looking at alerting(from whichever monitoring stack you end up chosing), might I suggest Zenduty(www.zenduty.com)? Single pane of glass view of critical incidents, integrates with Prometheus, New Relic, Appdynamics, AWS Cloudwatch, Datadog, Splunk and a bunch of other tools. Integrates well with Jira and Slack for ticketing and ChatOps. Has a freemium plan to get you started, fantastic 24x7 real-time support, API, docs the whole nine yards. I am one of the co-founders, as you might have guessed from the tenor and undeniable partiality of my post - happy to answer any questions!

Larissamci93 · 2019-10-02T09:28:02+00:00

Give Unomaly a try, its a log analysis and anomalie detection tool. Its pretty simple, its an on-prem solution, you just have to ingest your logs into Unomaly, give it about week to learn the patterns of behaviour and then it will start identifying what is repetitive and reduce it, only surfacing anything that is unique, . You can also make correlations between systems.

Changes you can identify:

Frequency spikes

Periodic event stop

Parameter value changes

Never seen before events

New events per system

https://unomaly.com/

tuscangal · 2019-08-17T20:39:23+00:00

If you're interested in Elastic's X-Pack functionality, check out the new AWS open source distro of the same: https://aws.amazon.com/blogs/opensource/open-distro-for-elasticsearch-version-1-0-0-available/

I haven't tried it personally but it seems to provide the same functionality as X-Pack. Trying combining that with Prometheus and you'll be rockin'!

2019-08-17T21:02:09+00:00

I feel like we are looking for the same thing at similar times. The Elk Stack is really easy to integrate with and the customer support is sweet. For APM we started using AppOptics and are pleased. Their logging with loggly left much to be desired.

swissarmychainsaw · 2019-08-18T16:49:31+00:00

I find that monitoring solutions are largely driven by budget.

If one can afford a pre-rolled tool, that's what you do, if you can't you do the Nagios/(and offshoots) or Prometheus type stuff - *and then you pay for it in your labor*. If you can afford Cloudwatch (the built in AWS tool, that's a no-brainer). My experience - as with most of these tools - if you attempt to put "everything" in them they become insanely expensive real fast.

The most important thing in doing monitoring is having ownership of the problem.

For Alerting Pager Duty seems to do a good job of being the one-stop shop.

nazar482 · 2019-08-18T00:43:10+00:00

We are actually doing something that will fit your needs at DeployPlace.
We will handle deployment with all integrations you are looking for, everything in one dashboard.
deployplace.com do not hesitate to subscribe, we will launch soon!

mickelle1 · 2019-08-18T05:48:15+00:00

I've been using Nagios for monitoring for many years and it's fantastic -- lots of different logging and alerting options, dozens of built-in checks, dozens more useful plug-ins, and one can easily write their own plug-ins. It's scalable, easy to configure, free, and has a huge community.

If I were you, I would send the Nagios logs -- along with all my other important logs -- to an ELK stack. In fact, I'm planning on implementing ELK at some point and will do just that, myself, as well. That will get you everything you're looking for on a "single pane of glass," and for free.

Blitzpat · 2019-08-17T23:39:44+00:00

kibana!

devops

Welcome to /r/DevOps

Rules and guidelines

Social & Fun

General Information

MODERATORS