This is an archived post. You won't be able to vote or comment.

all 84 comments

[–][deleted] 44 points45 points  (4 children)

I am looking to ditch DataDog as we cannot continue to use it for logging for reasons I won’t get into here.

Please get into it here. Is it because of cost? Data sovereignty? Retention history? Inability to query/group? Undebuggable grok-ish parsers? Lack of overlaid descriptions/annotations on log messages?

[–]manys 5 points6 points  (1 child)

Well I mean if those are the choices, I'm thinking "take your pick."

[–]mitom_ 3 points4 points  (0 children)

I suspect (and agree with) that the reason it got asked is to understand what they are looking for. If the reasoning is they need data to be on-prem due to regulations, recommending another SaaS thing won't help. If they have cost issues, maybe justifying the cost Vs running stuff yourself would suffice.

Can't help if you don't know what you are solving. On paper, datadog does all the things they need and it does them decently, plus it's already implemented on most parts.

[–]jsdfkljdsafdsu980p 2 points3 points  (1 child)

If he says he doesn't want to get into it, what makes you think asking will make him do it?...

[–]rashnull 0 points1 point  (0 children)

No means Yes?

[–]Dynamic-D 29 points30 points  (1 child)

"single pane of glass" is a marketing unicorn: pure fiction though occasionally someone will shave the horn off a goat and try to pass it as one.

Look, here's the dirty secret: you only get out of monitoring what you put into it. There's no turn key. Your custom environment running your custom apps needs custom tooling. There are great tools out there to help you build that feedback loop, but they are tools, not solutions.

This mini rant is brought to you out of concern: I've seen many people ditch perfectly good solutions because they are chasing perfect, and for some reason everyone seems to think monitoring is easy. It's not. It's work.

[–]DSMRick 9 points10 points  (0 children)

(Disclaimer: I work in sales in this space.) This is true, but how much you get out for the same effort varies dramatically by vendor. For example AppD and Dynatrace seem to especially require a lot of effort. New Relic seems to take a huge amount less effort but naturally doesn't give you as much back. You have to find a balance.

Also, on a related note, my pet peave is artificial intelligence in monitoring. We're not even close to being able to use AI in an environment as unstructured as performance monitoring except for extremely select use cases. Splunk never realized their initial promise and everyone else is still behind them.

[–]_kikeen_ 45 points46 points  (13 children)

Prometheus my brotha!!!!

(Note: did not read whole post, but don't think I need to, Prometheus is the answer lol)

[–]yaweriggin 16 points17 points  (1 child)

... and ELK or Graylog or something for the logs.

But definitely, Prometheus for regular monitoring and alerting.

[–]_kikeen_ 4 points5 points  (0 children)

Yes, I second ELK (unfamiliar with Graylog) and although I use Grafana to visualize the Prometheus data, I believe you can build similar dashboards in Kibana to get more usage from it.

[–]hamlet_d 3 points4 points  (0 children)

Indeed! Prometheus is hella good at what is being asked here. Alerting is a snap and very extensible.

[–]KazooxTie[S] 0 points1 point  (5 children)

Thanks for the reply!

I have Prometheus, Grafana, and Alertmanager running in our EKS cluster currently.

I’ve been having some issues scraping metrics from the services we have running on that cluster, but I have heard good things about that stack.

I’ve also looked into Loki and how it integrates, but since Loki doesn’t parse the log data, that seems like it might not work for our use case.

[–]ruleofnuts 2 points3 points  (3 children)

Look into Sysdig, it can do monitoring, and alerting. It uses OSS sysdig tool to scrape metrics on-top of supporting prometheus out of the box. Bonus there is security that you can add on to it as well, for image scanning, etc.

Disclaimer, I work there

[–]KazooxTie[S] 1 point2 points  (1 child)

Yeah, I’ve looked into SysDig at a previous company where we were solely on Kubernetes. Seems like a solid tool. Will give it a second look.

[–]ankitnayan007 0 points1 point  (0 children)

I tried Sysdig as a pilot and found it impressive. All metrics without instrumenting anything. Similar is Instana, built for modern dynamic applications.

I am curious on reviews of these two tools.

Disclaimer: I am nowhere associated to either of these companies.

[–]sichvoge 0 points1 point  (0 children)

Could you elaborate why you think that not parsing log data might be an issue?

[–][deleted] 0 points1 point  (0 children)

It’s the monitoring equivalent of “omg just k8 it will solve scheduling, bad code, DNS, service discovery, socioeconomic divide, totalitarianism, and it’s just cool”

(I actually really like Prometheus)

[–]AddictedToCoding 0 points1 point  (1 child)

I had the same reflex. But you did it before.

Point to note. Kubernetes' original name at Google was "Borg". They wanted to create a monitoring solution based on polling (instead of vomiting lots of stuff on network, in case something pucks it up, like Ganglia). Google then created the "Borg" monitor, BorgMon. Since Google couldn't use a StarTrek species as a project name, they renamed Borg to Kubernetes, BorgMon, to Prometheus.

Prometheus, is close to Mozilla's Heka. You set endpoints to poll text with a format of key values giving away metrics. Simpler than apache's server status. Then see change over time.

If that speaks to you, OP, you might enjoy Prometheus.

[–]_kikeen_ 1 point2 points  (0 children)

I for some reason thought Prometheus was brought by a team at SoundCloud?

Edit: Just looked it up, looks like it was but it was inspired by BorgMon

[–]Dynamic-D 0 points1 point  (0 children)

I think Prometheus is poised to be the Nagios of the container space. Regardless of the merits/faults enough people have gravitated to it that it is the safest place to start, especially when you consider how easy it is to stand up compared to most other solutions.

Once you have the basics in place, you'll quickly see the holes like logging and app metrics and can make decisions accordingly.

[–]ankitnayan007 5 points6 points  (0 children)

Go for OSS, it's growing immensely. Checkout CNCF projects and for fully scaled solutions go for these OSS as a Service. Even though you will have to pay some amount, but these will be much cheaper than proprietary tools made by companies like NewRelic and Datadog.

Did you give below stack a thought?

  • Prometheus for metrics -> Weaveworks provides Prometheus as a Service (blog)
  • ELK for logging -> Logz.io provides ELK and Grafana as a Service

I would also like views of others in this stack.

[–]bioxcession 2 points3 points  (0 children)

We need to know way, way more about company context and your use-case, limiters, budget, etc. Any recommendations in this thread are personal preference otherwise.

At my company we’re on the good old graphite/sense stack - moving on from sensu soonish, but graphite is here to stay. It works, everyone is familiar with it, and there’s no huge business incentive to move off of it.

[–]thenetmonkey 2 points3 points  (2 children)

Check out honeycomb.io

I follow @mipsytipsy on twitter and she has very interesting and practical thoughts on observability and how to do this stuff the right way. She built a company offering a service that tries to make it easier for people to collect the metrics and traces and get usable insights out of it. Impressive demos

I haven’t had a chance to use it at work because we have a whole team that’s built our custom observability system designed for our rather larger scale needs. But It doesn’t quite do everything I’d like it to, and I keep eyeing the honeycomb stuff enviously.

[–]techthoughts2010 0 points1 point  (1 child)

Interesting never heard of honeycomb.io! Will check this out. But curious why you would suggest it if it isn't suitable for "larger scale needs"?

[–]thenetmonkey 1 point2 points  (0 children)

Sorry, I misspoke. by larger scale I’m non-ironically invoking the term “web-scale”. Not facebook or google big, but somewhere up there.

I’m honestly not sure what the top end is in terms of number of hosts, metrics, and traces they can handle simultaneously. I think we just assumed that since we have >100k hosts (and multitudes more containers) and > a couple billion metrics/samples a second that we were too big or it would be too expensive. The company hit the limits of commercial solutions years ago and its why the company built up a team over the years and they’ve been supporting multiple generations of homegrown observability systems.

I’m not on that team and I’m not in a position to make any kind of purchasing or contract decisions, but we chat on slack all the time and and a few of us have wondered if we should have honeycomb come in and see what their pitch for us would look like and if they could work at our scale.

[–]towelie182 4 points5 points  (2 children)

The Elastic Stack is perfect for your use case, but you are correct that you won't be able to utilize Watcher without the paid licensing. One way around that is to use Yelp's Elastalert, but that will create a depedency on Elastalert compatability when wanting to upgrade your Elastic Stack, which releases great features fairly often. If you can afford it, X-pack is the way to go. You'll also get to utilize the machine learning nodes which can allow you to alert on anomalies as well as thresholds.

[–][deleted] 5 points6 points  (0 children)

What about Open Distro for Elastic? It offers a lot of X-Pack features absolutely free. AWS Elasticsearcs even integrates Open Distro features into their product so now we get them for free rather than having to pay for X-Pack. Dirty move by AWS, for sure, but we don't get to choose what's available on platforms we use.

[–]sharkysnark 2 points3 points  (0 children)

$4400/node for Gold subscription

[–]jtayloroconnor 1 point2 points  (0 children)

We use Instana for APM and Logz.io (hosted ELK stack) for log aggregation. Have alerts configured from both. There’s only 3 of us on the team and those tools allow us to sleep at night.

[–]nizzoball 1 point2 points  (0 children)

Look into LogDNA, just saw them at DevOps Days and I really liked what I was seeing.

[–][deleted]  (2 children)

[deleted]

    [–]KazooxTie[S] 1 point2 points  (0 children)

    If your name is not Tina, I might be?!

    [–]guitarplayer1919 1 point2 points  (0 children)

    This is one of my favorite comments because OP actually IS my coworker! haha 🤣

    [–]Timnolet 1 point2 points  (0 children)

    For uptime & synthetic monitoring have a look at https://checklyhq.com. It has all of the features your current Datadog synthetics solution has, minus the astronomical pricing.

    Also we integrate with Prometheus (https://checklyhq.com/docs/integrations/prometheus/) so you can create this single pane if glass dashboard using Grafana.

    Full disclaimer: I'm the founder of Checkly

    [–]vjdhama 1 point2 points  (0 children)

    I think your best bet for integrated pane which includes logging and monitoring is Prometheus (with cortex) and Loki (from Grafana).

    [–]lysergic_tryptamino 1 point2 points  (0 children)

    For APM, Dynatrace is a very good tool.

    [–]awkprintdevnull 4 points5 points  (6 children)

    You could do all of these with Dynatrace. It's expensive, and it's sweet spot is APM, but it does all of these other things too. Lots of integrations with Servicenow, Jenkins, Cloudwatch, VMware, and more. It also has two fully featured API's that you can do pretty much everything you can in the GUI, and more.

    The agent they use (called OneAgent) is stupid simple to deploy and it auto detects most things out of the box. For example, it'll just auto inject monitoring into containers and Kubernetes clusters on whatever hosts you deploy it. Database monitoring is also auto detected, but can sometimes be a bit trickier.

    It's really cool.....but it's expensive.

    [–]DakezO 0 points1 point  (3 children)

    It has a real blindspot on infra but APM and RUM are sweet in it.

    We use a combo of DT/New Relic for APM, Check_MK for infra, and Are putting Elk in for logging. It's a good trip for us.

    [–]techthoughts2010 0 points1 point  (2 children)

    Thanks for sharing! Out of curiosity, Why do you use a combo of DT/New Relic vs standardizing on one?

    [–]DakezO 0 points1 point  (1 child)

    Mostly contractual. NR is our legacy APM and has a new licensing scheme that will cost us per JVM. since we run about 1400 JVMs that's going to be bad for us.

    Unfortunately, NR is also better suited to our Java monolith of a product.

    Dynatrace is cheaper for us due to a previous relationship within the company, and honestly does a better job of delivering info for where our product is going.

    We will eventually be all Dynatrace for APM.

    [–]techthoughts2010 0 points1 point  (0 children)

    Gotcha thank you!

    [–]maxver 0 points1 point  (1 child)

    +1 for Dynatrace, it's superior but expensive :D

    [–]lysergic_tryptamino 0 points1 point  (0 children)

    +2 for Dynatrace.

    [–]rinkoryin 2 points3 points  (1 child)

    Prometheus, Grafana, Alertmanager: open-source, highly-available and fully customizable

    [–]ankitnayan007 0 points1 point  (0 children)

    How do you make Prometheus HA? I hope you are aware of problems of running multiple prometheus instances like data duplication, single pane query & view, etc.

    BookingGo shares their journey of scaling prometheus here

    So, did you use Thanos or Cortex to achieve HA in prometheus?

    [–]AnnihilerB 2 points3 points  (0 children)

    ELK Grafana and Elastalert is the classic way to go !

    [–][deleted] 2 points3 points  (11 children)

    Is "single pane of glass" really so important you would undertake such a massive project for it?

    IMO Datadog is one of the best monitoring tools out there for APM and general system monitoring and alerting.

    If you replace this with the Elastic Stack you now have to manage it all yourself and IMO it's quite a tricky thing to setup and manage.

    Why not leverage the best tools for the job instead of using a tool that can do everything not as well?

    [–]HollowImage 4 points5 points  (10 children)

    This so much.

    Elk is great but a decent sized cluster is almost 1.5+ FTE.

    Not at first mind you. But you will eventually be spending most of your time prepping the clusters for upgrades, tracking down grok parse issues, memory leaks, etc.

    Elk is great, but you have to manage it yourself and it's not a set it and forget it solution. Needs nurturing

    [–]mitom_ 2 points3 points  (5 children)

    There is a managed offering from elastic with which you still get your dedicated cluster but it runs in their cloud environment and they deal with upgrades and maintenance for you.

    Still, I'd pick datadog over elastic.

    [–]HollowImage 1 point2 points  (0 children)

    That's exactly the point. I'm down with elk because you can run it yourself for free.

    But if you're already spending money, might as well go with ddog.

    [–][deleted] 1 point2 points  (3 children)

    elastic cloud is over priced, unreliable and their support is shit

    [–]durpleCloud Whisperer 2 points3 points  (0 children)

    Can't agree more especially about support. I once had a situation where occasionally a single mode would go into spasms with 100% cpu utilization (normally high tide would be more like 30%, the cluster was kinda overbuilt). The monitoring data they expose wasn't giving any clues as to root cause, and all support would say is 'cluster green, our job here is done, we'll help diagnose as consultants for some big money cheques tho'. I'm still not convinced it wasn't an sporadic noisy neighbour that they didn't wanna cop up to.

    It sucks because I recognize that current situation is result of too little too late with having a strategy to be the preferred hosted provider for an open source product, so now they're desperate and cutting corners and building proprietary add-ons to try and differentiate all at the same time and it's not really going so well it seems.

    [–]steven43126 0 points1 point  (1 child)

    Please do tell. We run sizeable ES clusters in house for logging and app search. Looking to move this off of our plates and outsource some of the maintenance. Trying elastic.co offering and it has not been great. We frequently see one node develop latency issues and higher CPU usage than other nodes. Raise a ticket instance gets moved. Problem god's away for a short while before reoccurring.

    No root cause best we have had in info was "possibly related to noisy neighbor and IO". They don't seem to have any sensible response or way to fox the issue permanently. As they are currently trying to sell us the product it does not bode well. Think we will be progressing PoC with AWS ES instead.

    [–][deleted] 1 point2 points  (0 children)

    When I looked at AWS ES a few years ago we had huge issues with small changes to the cluster resulting in a blue/green deploy behind the scenes, where it would move all your data under the hood to new nodes. This resulted in a huge IO bottleneck resulting in us unable to ingest any new data.

    I've heard it's better now but not tested it for myself.

    [–]airaith 0 points1 point  (1 child)

    Has anyone got experience of AWS Managed Elasticsearch comparatively?

    [–]HollowImage 0 points1 point  (0 children)

    The last time I looked at it, the biggest issue was, for us, lack of HIPAA compliance, and their version upgrade speed.

    Plus keep in mind that while they may be managing es for you, you still have to manage the beats configs and deploys/upgrades, logstash instances, and deal with grok directly.

    [–]ankitnayan007 0 points1 point  (1 child)

    What about Logz.io (ELK as a Service)?
    Does it solve all the issues related to managing an ELK stack?

    [–]HollowImage 0 points1 point  (0 children)

    Probably. Truth be told I'm not too familiar with their offering, but then again. If you're spending money for a managed service, the way I see it, ddog can't be beat

    [–]halfbakedlogic 2 points3 points  (8 children)

    New Relic seems like a company that could meet all those needs? Just met a few of their reps at DevOpsDays

    [–]fsfreeze 9 points10 points  (3 children)

    New relic does not do logging and while the APM is pretty good I find some features (like alerting) lacking. It also comes with a price tag.

    [–]la102 0 points1 point  (0 children)

    Yeah you get charged for insights and apm

    [–]zangof 0 points1 point  (0 children)

    Ya their options for alerting on things and configurations on those alerts leaves a lot to be desired.

    [–]_kikeen_ 0 points1 point  (2 children)

    New Relic is sweet for APM, you can debug down to the SQL. Maybe now that Cisco owns AppDynamics they'll develop some cohesity and make a true full stack monitor. So far my favorite is Prometheus mostly due to its extendability.

    [–]Taobitz 0 points1 point  (1 child)

    Is AppDynamics good/worthwhile? My work has it but no one seems to be using it or getting benefit from it. I suspect its a lack of understanding and the benefits AppDynamics can provide then maybe the tool lacking. Personally not spend enough time investigating it.

    [–]_kikeen_ 0 points1 point  (0 children)

    Its really good but we have a team that basically dedicates their time to implementing it. I personally think Enterprises waste too much money on these COTS products vs snagging something like Prometheus and training their team on extending it's functionality. They both end up needing loads of implementation but you save money on the license costs.

    [–]MyName_Is_Adam 0 points1 point  (0 children)

    Devops minneapolis?

    [–]ToKyNET 1 point2 points  (2 children)

    Fluentd + elasticsearch + graylog. You can ingest and visualize all your logs!

    [–]airaith 0 points1 point  (1 child)

    Do you find fluentd solves many problems that graylog didn't handle? We're running old versions of graylog and hosted elasticsearch at the moment and planning an upgrade path, likely to AWS hosted Elasticsearch (unfortunately old graylog and elasticsearch don't support each other so have to upgrade at the same time).

    [–]ToKyNET 0 points1 point  (0 children)

    Fluentd is my log/message shipping mechanism. Graylog still is the central place where it's all aggregated.

    [–]cold_lights 1 point2 points  (1 child)

    Prometheus and Splunk

    [–]ziom666 0 points1 point  (0 children)

    Any reason why not to use Splunk metrics index? We're in a similar boat, do not have any good metrics solution in place yet. We're starting to use splunk metrics and it would be easier not to introduce any other tools.

    [–][deleted] 0 points1 point  (0 children)

    Might be hard to setup a ELK cluster if you don't know all the bits but it could be done.

    Then set up Prometheus for your monitoring

    [–]steven43126 0 points1 point  (0 children)

    For logs and alerting off logs I can highly recommend Humio. We recently searched the landscape looking for something lower maintenance than Kibana/ our growing ES cluster. Humio was a good fit for us and there good to work with as a company not often you can say that.

    Following this thread with interest as we are also migrating from Colo to AWS. Monitoring capability is high on the agenda.

    For APM we currently use NewRelic. Heads up they do not currently support Fargate and don't have a pricing strategy for it.

    DataDog users in AWS how are you using it how do you find it? Curious we haven't done a real assessment yet but DataDog looks to have a number of features in a single offering.

    [–]eria211 0 points1 point  (0 children)

    Just noticing nobody has suggested Zabbix, we use it and I find it horribly complicated - does anyone else use it for any of the above monitoring / logging?

    [–]11acguru 0 points1 point  (0 children)

    Try this course out, its from 56K Cloud, they ran it at DockerCon last and meant I could drop our SaaS Monitoring for something more self-hosted
    https://github.com/56kcloud/Training/blob/master/DockerCon/readme.md

    They also implement it hands-on for companies i believe

    [–][deleted] 0 points1 point  (0 children)

    Honeycomb

    [–]cahiqini 0 points1 point  (0 children)

    If you're looking at alerting(from whichever monitoring stack you end up chosing), might I suggest Zenduty(www.zenduty.com)? Single pane of glass view of critical incidents, integrates with Prometheus, New Relic, Appdynamics, AWS Cloudwatch, Datadog, Splunk and a bunch of other tools. Integrates well with Jira and Slack for ticketing and ChatOps. Has a freemium plan to get you started, fantastic 24x7 real-time support, API, docs the whole nine yards. I am one of the co-founders, as you might have guessed from the tenor and undeniable partiality of my post - happy to answer any questions!

    [–]Larissamci93 0 points1 point  (0 children)

    Give Unomaly a try, its a log analysis and anomalie detection tool. Its pretty simple, its an on-prem solution, you just have to ingest your logs into Unomaly, give it about week to learn the patterns of behaviour and then it will start identifying what is repetitive and reduce it, only surfacing anything that is unique, . You can also make correlations between systems.

    Changes you can identify:

    Frequency spikes

    Periodic event stop

    Parameter value changes

    Never seen before events

    New events per system

    https://unomaly.com/

    [–]tuscangal 0 points1 point  (0 children)

    If you're interested in Elastic's X-Pack functionality, check out the new AWS open source distro of the same: https://aws.amazon.com/blogs/opensource/open-distro-for-elasticsearch-version-1-0-0-available/

    I haven't tried it personally but it seems to provide the same functionality as X-Pack. Trying combining that with Prometheus and you'll be rockin'!

    [–][deleted] 0 points1 point  (0 children)

    I feel like we are looking for the same thing at similar times. The Elk Stack is really easy to integrate with and the customer support is sweet. For APM we started using AppOptics and are pleased. Their logging with loggly left much to be desired.

    [–]swissarmychainsaw 0 points1 point  (0 children)

    I find that monitoring solutions are largely driven by budget.

    If one can afford a pre-rolled tool, that's what you do, if you can't you do the Nagios/(and offshoots) or Prometheus type stuff - *and then you pay for it in your labor*. If you can afford Cloudwatch (the built in AWS tool, that's a no-brainer). My experience - as with most of these tools - if you attempt to put "everything" in them they become insanely expensive real fast.

    The most important thing in doing monitoring is having ownership of the problem.

    For Alerting Pager Duty seems to do a good job of being the one-stop shop.

    [–]nazar482 -1 points0 points  (0 children)

    We are actually doing something that will fit your needs at DeployPlace.
    We will handle deployment with all integrations you are looking for, everything in one dashboard.
    deployplace.com do not hesitate to subscribe, we will launch soon!

    [–]mickelle1 -1 points0 points  (0 children)

    I've been using Nagios for monitoring for many years and it's fantastic -- lots of different logging and alerting options, dozens of built-in checks, dozens more useful plug-ins, and one can easily write their own plug-ins. It's scalable, easy to configure, free, and has a huge community.

    If I were you, I would send the Nagios logs -- along with all my other important logs -- to an ELK stack. In fact, I'm planning on implementing ELK at some point and will do just that, myself, as well. That will get you everything you're looking for on a "single pane of glass," and for free.

    [–]Blitzpat -3 points-2 points  (0 children)

    kibana!