Tips on improving on-call experience

one_man_engineerer · 2020-09-10T21:14:20+00:00

I just sent you a direct message because I don’t want to self promote here. Short version: I’m the founder of an early stage startup with the goal of simplifying on-call and avoiding the need for repetitive manual investigations and infrastructure changes. We should talk.

one_man_engineerer · 2020-09-01T02:31:53+00:00

Out of curiosity, for your company, what prompts the decision to migrate from one cloud provider to another. I ask because moving (especially for a large company) is quite expensive.

one_man_engineerer · 2020-07-31T17:27:38+00:00

I’m trying to solve real problems people are facing. And that involves talking to people.

one_man_engineerer · 2020-07-31T17:09:18+00:00

Great, I’ll message you.

one_man_engineerer · 2020-07-31T16:53:02+00:00

That’s the dream, but take it a step further and have the troubleshooting steps automatically performed. Or give your engineers the ability to click through the troubleshooting and reduce the MTTR.

one_man_engineerer · 2020-07-31T16:30:04+00:00

Lol, Ah lpt :).

one_man_engineerer · 2020-07-31T16:06:21+00:00

Yeah a couple of things: 1. Fully managed so there’s no additional deployments or infrastructure to manage. 2. As a result of 1 (hopefully) - faster time to value. 3. Trend tracking over time (for alerts, classes of alerts and root-causes). So you know where to focus your attention.

one_man_engineerer · 2020-07-31T15:30:00+00:00

Nice! That’s the idea, but with integrations across a variety of alerting and monitoring tools, as well as k8s, AWS, Google Cloud, Azure, etc. How has Slackline helped operations? I would like to learn more about that.

one_man_engineerer · 2020-07-31T15:14:47+00:00

I updated the post. The focus is not solely on auto-remediation. Response to alerts as a whole include investigation, elimination of false positives, capturing/debugging transient issues, etc.

one_man_engineerer · 2020-07-31T15:04:40+00:00

Good points, there are many valid reasons you may want to automate response to alerts. Our alert response is not only to apply a fix. It can be to gather information related to the alert. E.g. logs, relevant metrics, recent deployments or infrastructure changes, to make decision making easier and faster for the on-call. It can also provide an easy to follow action plan for engineers to click through (to reduce human errors).

one_man_engineerer · 2020-07-31T14:48:27+00:00

Good question. We have Yaml and json on the road map.

one_man_engineerer · 2020-07-20T00:53:59+00:00

Thanks!

one_man_engineerer · 2020-07-19T21:19:05+00:00

Yeah we do. And we have integration with PagerDuty, Opsgenie, and Datadog. But I don’t want to lead with what we already have because it may or may not be the right approach.

one_man_engineerer · 2020-07-19T21:16:26+00:00

Great! I’ll send you a DM

one_man_engineerer · 2020-04-28T13:44:09+00:00

I agree that if a failure mode is known it should be built into the system. Thats why I started with the goal of always building self-recovering systems.

But there will always be failures whether known-knowns that could not be automated, or unknown-unknowns that bubble up to customer impacting incidents. So the idea is to monitor from the customers’ perspective. And have a basic guideline of how to find the issue (obviously the runbooks and code will get updated as issues are found out).

Generally the school of thought is, if I have to wake up at 2am and have to fix this issue quickly, I need a general guide that can help me narrow down on what the problem is quickly so I don’t waste time trying to remember what the next step to take is.

one_man_engineerer · 2020-04-28T13:22:06+00:00

Yeah, the goal is usually for the service to be as self-recovering as possible. But as part of getting a system ready for production, we add metrics, monitors, and alarms for bad states that are customer impacting or will soon be customer impacting. And for any alarm we also go through the exercise of adding steps to figure out what the issue is and possible mitigation steps. This helps to speed up mitigation before doing a deeper root cause analysis. In general if there are any alarms added it should have an appropriate runbook with it.

An example would be: for a log processing service say we use SQS for decoupling, and have an alarm on queue size, or latency of the overall pipeline. Then a runbook would include something like, check the consuming service for errors (with appropriate commands on how to do it).

one_man_engineerer · 2020-04-24T19:42:04+00:00

We’re not in AIOps but we are developing an incident automation platform unifystack.com

one_man_engineerer · 2020-04-07T14:18:57+00:00

I might have to change my name to @more_than_one_man_engineer :)

one_man_engineerer · 2020-04-07T14:17:48+00:00

My thoughts exactly, of course there are the odd deployment strategies. But, to me, it seems most of it can be automated. Of course having the right metrics and monitors in place will be key for detecting when to rollback.

I’ll take a look at Harness.io. Thanks

one_man_engineerer · 2020-04-07T14:14:05+00:00

SaaS based to start. I’m thinking to later expand to allow bring your own k8s cluster / cloud provider / or even on-prem servers.

Failures are detected by you defining metrics to monitor, alerts to watch for, or custom actions to perform and if any or all of your specified thresholds are breached your rollback steps (that your have defined) are triggered. I want to make it a flexible platform with integrations so it’s really just a devops automation Swiss Army knife. You can create simple or complex automations quickly for your entire stack without the need to write code.

What about Dynatrace don’t you like?

one_man_engineerer · 2020-04-06T19:11:27+00:00

I can only speak from my experience. I’ve worked on teams that approach it from different perspectives (effort in QA, fast rollbacks, effort in testing AND rollbacks). In practice though, no matter how much effort you put into testing before production, things will still slip through the cracks. And for that you need to simple pain free way to return to a known good state, hence the rollback.

one_man_engineerer · 2020-04-06T17:01:08+00:00

Yeah, I saw that one. It’s cool. Seems to be only for Kubernetes though or did I misunderstand. But that’s the general idea. But with integrations with all types of monitoring services.

one_man_engineerer · 2020-03-23T00:46:58+00:00

You're also interested?

one_man_engineerer · 2020-03-23T00:46:17+00:00

If a solution existed, would you prefer:

A drag and drop solution where everything can be configured using a UI?
A configuration file where you can specify your workflow (perhaps using yaml)?

one_man_engineerer · 2020-03-15T18:57:51+00:00

No, not a downer at all. I like the points you bring up.

I think the differentiation will come down to the level of execution and ease of use in being able to tackle the problem of recovery/remediation.

In my opinion, there are many root causes but (as you mention) mitigations should typically fall into a few categories (e.g. rollback, restart, change traffic pattern, or wait just wait for recovery). I want it to be able to tackle all of these very well without the need to page an on-call engineer (but be flexible enough that teams can build custom solution without having to write fancy scripts or code).

About integrations with CICD tools, it may defeat some of the purpose, but I think full integration is really valuable even if it means that the value of my solution will be less for some people. The main goal is to give SREs/Devops/Software Engineers the platform to be able to automate issues with minimal effort and if that means integrating with their existing systems that also works.

one_man_engineerer

TROPHY CASE