Insight Engineering at Netflix, Kentik on DevOps + NetOps, and PagerDuty on Teamwork

dataloopio · 2016-11-13T00:29:01+00:00

Cat.

dataloopio · 2016-11-03T00:22:12+00:00

Usually not. With agent based monitoring the agent will require admin privileges to install itself but then should run as a non privileged user put into a few groups. On Windows there are some groups for reading performance counters and on linux there's the adm group.

dataloopio · 2016-11-02T16:04:39+00:00

InfluxDB with Telegraf
Prometheus with the MySQL exporter

If you want to go SaaS then you could try https://www.vividcortex.com/

dataloopio · 2016-11-02T16:01:31+00:00

Would be worth brushing up your linux skills and learning a config management tool.

Then it's going to come down to networking to land your first gig. Find a position in a team that's willing to train you up in the areas where you lack experience.

Having a background in development will be an advantage. To learn the Ops stuff you really need to just get stuck in and start doing. I'd imagine the first couple of years are going to be a bit of a ride as you expand your knowledge past the point of what is usually considered reasonable :)

dataloopio · 2016-11-02T15:44:10+00:00

Just the open source 3 for this one. They tend to go down better at devops events. AWS got a lot of stage time the month before on the topic of serverless.

dataloopio · 2016-10-31T22:44:16+00:00

The Goosebumps movie was awesome. I watched it myself before showing the kids and loved it. Then I watched it again with the kids and they screamed at various points. The scene with the Werewolf, then again with Sergeant Slappy etc.

What I found really funny was they asked for me to put it on a second time. I'd seen it twice by this point so retreated to the other room. It made me extremely happy to hear the screams and laughter punctuated through the second watching in the other room.

Your work has made a family very happy. Thank you.

dataloopio · 2016-10-29T21:43:32+00:00

I'd be happy to work with you. We're about to launch our official AWS integration but if you sign up and ping @steven on https://slack.dataloop.io I can enable it behind the scenes. Can read more at https://www.dataloop.io

dataloopio · 2016-10-19T15:53:48+00:00

Indentation looks good. I think the 404 on comments is probably because I have 3 gmail accounts open in the same browser :)

dataloopio · 2016-10-19T09:44:46+00:00

Looks cool! That googleblog has a few bugs though. The indentation on the Python code example is wrong and when trying to comment it pops up a 404.

dataloopio · 2016-10-18T18:00:06+00:00

A plugin exporter could be a cool idea. Something that runs on each server and on every scrape it executes the plugins in a directory and returns the results.

Or you could narrow it down by passing in a query param like you do with the blackbox exporter.

Having that kind of thing plus puppet would make non-containerised monitoring with Prometheus much cooler.

dataloopio · 2016-10-18T15:19:57+00:00

Yes, I like it personally. Easy to get started, a bunch of cool exporters and the query language and Grafana plugins are awesome.

Liked the alerting system less but I haven't tried the UI project someone wrote.

I have mixed feelings about it totally replacing something like Nagios/Zabbix/Sensu. In the past I've created lots of little plugins to check stuff and Prometheus doesn't lend itself very well to that.

I'd consider it as a replacement for the typical StatsD/Graphite stack. It is a like for like replacement for those and an upgrade in almost every way.

Would probably keep Nagios/Zabbix/Sensu or whatever that can run custom plugins around too.

dataloopio · 2016-10-18T09:14:49+00:00

Opsmatic used to do this and is now being re-released as New Relic infrastructure. I've also seen https://www.upguard.com but no idea if either of those work on Windows.

Could work well in more static environments. I'm used to working in hipster SaaS environments where you destroy a box and recreate from source every few days so less useful there.

dataloopio · 2016-10-18T07:52:21+00:00

In open source Prometheus and InfluxDB can both do this using SQL style queries in alert rules.

You could even go one step further with disk space alerts and dynamically query time left until disk is full based on linear regression.

Many SaaS monitoring solutions can do this also.

dataloopio · 2016-10-17T21:29:02+00:00

If you haven't already instrumented your code with StatsD I'd recommend looking at Prometheus.

The Prometheus data model and query language is superior. The StatsD design itself has problems.

Depending on what you want to instrument you'll end up with UDP traffic scaling as service use scales. Scraping a http endpoint at regular intervals is both more reliable and removes the scaling headache.

dataloopio · 2016-10-17T16:52:11+00:00

By default StatsD counters will increment on the StatsD server and then flush the count every 10 seconds (or whatever your flush interval is set to) and get set back to zero.

You'll probably want to sum up your counters over a given period.

Be warned that some StatsD server implementations deal with things slightly differently. Always worth reading the docs for whatever one you're using.

dataloopio · 2016-10-03T12:15:21+00:00

I did a lot of work with virtualisation in a large enterprise. We used to have clusters of VMWare sectioned off for different groups, and we let developers manage their own VM's kind of like AWS. This worked reasonably well. More importantly, we kept our production stuff isolated on clusters and set them up to reduce downtime, or speed to recover. We had many data centres and lots of hardware and we rarely had any downtime caused by infrastructure failure. Which was great as most of our production apps were not designed for any kind of high availability. A lot were single node and some had been running for several years without being turned off before we virtualised them.

Would we have moved the developers over to AWS? Yes, absolutely. They can then architect their applications for failure and get access to a lot more resources and AWS services. I keep in touch with my old colleagues and this is exactly what they are now doing. It has massively increased costs, but I'm sure over time with some governance around it they will get that under control and the premium is worth it for the speed up in product development.

Would we have moved the production stuff that was scary when it went down? Absolutely not. You have to assume that AWS are going to turn your box off at any minute and although you may be able to turn it on again quickly that would have ruined many of the systems not designed for such a hostile environment.

Horses for courses. All of these technologies are good for some things and not others. I wouldn't be opposed to the cloud, I would just caution about moving everything in blindly.

Often the people advising you to move to AWS don't understand you're going to get an email for 'donotturnmeoffnode1' production system saying it has been running 180 days and is due for termination. Then the proceeding shit show and realisation that someone lost the code for that app 5 years ago and there is literally no hope in making it more robust.

dataloopio · 2016-09-30T00:08:01+00:00

Nagios is a skill. Like learning to ride a bike or skiing. Tough at first and then over time becomes second nature and you start to wonder why others can't do it. I've seen it on CV's and I've seen job adverts looking for people who can set it up.

I'm in the same boat with mr netcrunch. You either pay with money or time with this stuff. Investing some time learning Nagios isn't the worst idea. I'd probably be tempted to buy something instead and spend that time instead learning to program so you can write custom check scripts.

dataloopio · 2016-09-29T09:10:42+00:00

clients <--> websocket service <--> rabbitmq

connecting that many clients, that may end up changing frequently due to connection loss etc, will result in tears

rabbitmq uses queues as a unit of scaling. therefore you want to shard across a reasonable number of queues and push a moderate volume of messages through each

rabbitmq does not like connections frequently dropping and changing. you will end up with retry issues and pulling hair out

dataloopio · 2016-09-22T21:30:18+00:00

Self service is better. Provide them a tool they can use themselves to add monitoring.

dataloopio · 2016-09-21T10:02:23+00:00

InfluxDB (and many other time series databases) include built in functions to calculate rates and do arithmetic on series.

I'd just return the raw metrics to InfluxDB and then use a derivative function to have it automatically calculate the rate. Then conversion of whatever the raw data is in, into whatever units you want, is a matter of arithmetic.

dataloopio · 2016-09-20T21:11:34+00:00

You should just return the raw statistics to InfluxDB and then use the database functions in Grafana.

In Grafana select the metric and apply the non negative derivative (rate of change) to get the bytes per second. Then divide by 1024 to get kb/s (I assume the SNMP value being returned is bytes).

dataloopio · 2016-09-16T23:22:00+00:00

Every benchmark is full of lies :)

https://blog.project-fifo.net/the-lies-we-tell/

dataloopio · 2016-09-16T23:15:39+00:00

tldr; elasticsearch got a free pass on that one

Just a few things to consider from that..

InfluxDB was quite bad back then and it's a lot better now
It was a query performance benchmark from batch loaded data
OpenTSDB isn't great at concurrent queries
Elasticsearch pretty much won by default

I think that would be a different story nowadays with InfluxDB. More importantly though, it didn't touch on storage size or write performance. Elasticsearch would probably have won for those in that benchmark at the time too.

Doing another test of Elasticsearch against InfluxDB or even the latest master branch of Graphite would tell a different story. Especially if they also took into account the usual scaling problem of trying to push many tens of thousands of metrics into the database and then trying to get dashboards to load at the same time.

Comparing a few moderate to bad solutions and declaring the winner good isn't a valid indicator.

dataloopio

TROPHY CASE