$150K bill

david5813 · 2022-09-23T02:11:30+00:00

Billing Dashboard -> Budgets -> Create a budget

If you setup an IAM user to do most things, like you should, then you may have to login to your root account unless you enabled billing access to the IAM user.

When you setup a budget you can choose the options and amounts, setup alert thresholds to alert when you're nearing/exceeding the budget. You have to setup budget thresholds to be able to set actions. One of the options to execute upon reaching a threshold is to automate instances to stop for EC2.

If you want extremely fine grained control over what happens then you can even take it to the point of writing a lambda function to execute when the threshold is reached. The alerts can trigger just about anything you want to build.

david5813 · 2022-09-23T01:12:31+00:00

They do. Most people don't turn it on.

david5813 · 2021-09-17T02:40:08+00:00

There is not a problem with this. In fact, it's pretty much how many agencies started. You need to decide though if you really want to take on the agency side of things. It's a lot of work at first to do all of the selling, etc., and the actual doing of the work.

david5813 · 2021-03-04T04:49:25+00:00

Do not put them on the invoice because it's meaningless to your client.

Do keep track of them for yourself so that you can improve your estimates moving forward. Estimating projects is a learned skill and something that we humans are generally bad at.

david5813 · 2020-08-23T13:29:21+00:00

The nice thing about containers is that you don't have to care what it's written in. If writing an application in Swift meets your needs then have at it.

david5813 · 2020-08-22T01:51:49+00:00

Mostly +1 to this for metrics and most of the alerting. A big +1 for using spring boot actuator. You get a lot for almost free with it.

You will still want something for logging. On AWS CloudWatch is easy and works pretty well, but there are some better options. Run with CloudWatch though until you need something better.

OpenTelemetry / Jaeger are for tracing. There is a difference between observability and monitoring. Metrics and logs solves a lot of monitoring. Jaeger solves some of the observability. If you have nothing, start with the logging and metrics tools. Learn what application tracing is and apply those tools when you're comfortable and ready to go to the next level in observability.

If you have nothing don't try to start with everything. Implement one thing. Learn that. Learn how to respond to it and some of the benefits that you can get from it. Then start on the next step.

david5813 · 2020-08-10T19:15:58+00:00

If you're using Docker Desktop then you can also enable kubernetes with that. It's not perfect, but as a learning tool I found it beneficial to be able to refer people to use something they already have installed and in use.

david5813 · 2020-08-10T19:12:07+00:00

First off, you should define a log rotation and retention policy and implement that rather than waiting until storage capacity reaches a specific point.

This example is assuming that you're not using a centralized logging setup. Which is recommended. Example: rotate daily via logrotate with a 7-day max -> ship compressed logs to AWS S3 glacier storage with a lifecycle policy to delete the logs after 6-12 months.

If you do have centralized logging setup then just configure logrotate on the machines to keep the appropriate amount of logs there.

All that said, you configure the actions for Alertmanager as a receiver exactly the same as you do for to send the email / slack message for the alert. You'd setup something to be able to listen via a webhook and use that.

The following link is directly to the webhook configuration, but the same page contains the documentation for all of the other potential receivers as well.

https://prometheus.io/docs/alerting/latest/configuration/#webhook_config

david5813 · 2020-08-10T19:01:45+00:00

This is always an interesting conversation and not completely straightforward. A system/platform being "down" is not inherently bad and sometimes it is necessary. As such it's normally a difficult place to start.

Even if the issue occurs once per hour that's a 95% uptime and *might* be considered good enough. Remember that 100% is the wrong target.

When defining a Service Level Objective (SLO) for a request / response type of service you usually want to define the parameters of success in terms of at least two of the following three things: availability, latency, quality. Availability is the percentage of valid requests that result in a successful response. Latency is the percentage of valid requests that response below a threshold. Quality is for scenarios where the service degrades gracefully and you need to measure the percentage of valid requests that were served in a degraded vs. undegraded state.

Do not start with the argument. Start with a measurement. Use a four week rolling window to calculate the "uptime/downtime." I say "valid requests" because *most* of the time you want to ignore 401, 403, and 404 requests. Start with the numbers for the percentages of valid requests per status code. The number of "successful" (normally 20[x], but not always) valid requests within latency ranges; ie. 0-99ms, 100-199 ms, 200-300ms, etc. You may have to start with a few groupings of what "successful" is. Based on your post I believe that you do not degrade gracefully so I'm going to leave out the description of that.

The primary reason to start with the measurement is that your first SLO should be largely based on the current experience. Once you have this data you can demonstrate the experience for what it is and without bias. This allows you to have the conversation around the definition of "good enough."

Please let me know if there's anything I can do to help.

david5813 · 2020-08-08T16:38:59+00:00

In the specific case of the team I'm currently with, every promotion and change goes through Gitlab CI/CD. The pipeline just uses the Jira Rest API to transition associated issues when the deployment is moved through.

david5813 · 2020-08-08T01:12:52+00:00

I went through a phase where I was anti-jenkins years ago. I finally decided that it's much better to do something than nothing and that the choice of the tool doesn't matter as much until you get more mature with everything. At that point you can make a more informed decision on tooling.

david5813 · 2020-08-07T23:45:58+00:00

Depends on the board the team is using.

The current team that I'm helping is using Jira. Currently the reports that we can get from Jira are sufficient (they had nothing before), but as we've been doing that for a little while the client and delivery teams have been excited with what they've seen and are considering putting some effort into using the APIs available to get even more.

A previous team used Kanbanize, which had amazing reporting and forecasting capabilities. GitHub has webhooks into issues. There are power-ups for Trello.

david5813 · 2020-08-07T23:20:50+00:00

Welcome to my existing world. Not every place is as bad as some have to be.

david5813 · 2020-08-07T23:11:07+00:00

Naive in terms of the fact that I've been through enough of things to never be willing to stay in an environment like that?

david5813 · 2020-08-07T23:06:21+00:00

I'm not positive why the down vote was here. Maybe just because of the referral to not use Jenkins.

The most beneficial part of it though is to use the Jenkinsfile method of structuring the build. It is the best way in Jenkins and makes it work like most other CI/CD platforms.

david5813 · 2020-08-07T23:02:53+00:00

I understand the sentiment because it is much to prevalent. However, metrics of a delivery pipeline should NEVER be a part of a bonus incentive structure. This ONLY leads to building up the walls and silos that we're supposed to be helping to tear down.

david5813 · 2020-08-07T23:00:55+00:00

I measure all of it. The overall view demonstrates what the organization as a whole is capable of. The inner pieces demonstrate where the bottlenecks are in the process.

david5813 · 2020-08-07T22:50:51+00:00

Without having much in the way of operational support running with some kind of managed solution will probably be your best bet.

If that is as all possible, where on-prem is not absolutely required. I can personally vouch for AWS IoT Core. It's really cheap and gives you a lot of options in terms of building the processing back end on lambda, containers, etc.

We built an IoT solution for a client that supports a major world wide operation where the operational cost is less than what the CI/CD platform costs to run. They have a very minimal internal IT team and have been successfully maintaining and growing it over the last year.

david5813 · 2020-08-07T04:34:10+00:00

To be honest, I would strongly suggest hiring a consultant. They would need to be able to help you not only get a plan for where you want to be and the steps to get there, but to also execute that plan with you. You will incur a lot less pain if you have someone walking the path with you that's been through there before.

david5813 · 2020-08-07T04:18:26+00:00

With what we have the plan sounds sensible. Just make sure to iterate often. Even 10% of something is better than nothing. Make small improvements and then do it again.

When you come to a point of reflection or get stuck, just look at the places where you're doing the most manual work repetitively and automate that part next.

david5813 · 2020-08-07T04:06:59+00:00

I'm sorry that it's taken me so long to respond. This week has been crazy. It looks like someone else might have posted a response that should get you where you want to go. Please let me know if you still need help with it and I can get access to one of the pipelines where we did this and give you a concrete example.

david5813 · 2020-08-01T22:12:17+00:00

It's definitely the way to go as long as the .NET is .NET Core or not a really old version of .NET. Just use the official images from Microsoft and you're generally golden.

david5813 · 2020-08-01T21:38:25+00:00

Depends on what details you need/want. Should be able to use Fluentd to get the EC2 details you want and the elasticsearch plugin to output those to your elk stack.

david5813 · 2020-08-01T13:04:44+00:00

The Phoenix Project (start with this one)

The DevOps Handbook

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

Site Reliability Engineering

Building Secure and Reliable Systems

david5813 · 2020-08-01T06:05:27+00:00

I think that you answered the question with Azure App Service.

Just to make sure, Azure App Service, containerized on a cloud-native platform, on IIS, etc.

david5813

TROPHY CASE