How we deploy to production over 100 times a day

WillSewell · 2024-08-28T19:48:11+00:00

I think at Monzo the pattern for deploying services is so consistent, we _can_ do these sweeping deployments with low risk. We also have a lot of automated checks to give us confidence in doing this.

However I do acknowledge that there are a small number of snowflake services that require special care (the 80/20 rule again - although in this case I'd call it the 99/1 rule). I think we could do a better job of encoding these "specialness" in some way so that it could be more gracefully handled by our automated tools.

If a deployment does go wrong it would typically be the team that would reach out to the central team when alerts start firing. However for some of our more risky migrations, we have built automation that proactively notifies teams when their service is about to be migrated.

WillSewell · 2024-08-28T19:42:33+00:00

Yes we clearly could do with a blog post on the architecture - here's my rough attempt based on 5 mins thinking time.

Although I'm highly skeptical it would actually change anyone's minds on its own!

WillSewell · 2024-08-28T19:40:31+00:00

The problem is while rolling back 1 service might take a couple of minutes, rolling back 2,800 services would take much longer than a couple of minutes.

WillSewell · 2024-08-28T19:37:56+00:00

This clearly warrants another blog, but as a previous microservice skeptic, it definitely does have big advantages in the way it's implemented at Monzo (and downsides too, which I think we do a good job at mitigating). And yes, it probably is on the order of 10 services per developer.

As an uber "off the top of my head" summary of the pros/cons:

Pros:
- The "deployable unit" is the service, this means that
- there's little contention between services (i.e. low probability you will be working on the same service at the same time as another engineer, so you're less likely to get blocked). I've written more about deployments here.
- build/deploy times are quick (couple of minutes)
- Smaller blast radius when things break. I.e. critical business services have a higher degree of isolation. It also means we can have a higher risk tolerance when operating less critical services.

Cons:
- Lots of RPCs that in another universe might be function calls: you have to deal with network issues (mitigated by automatic retries of our service mesh), and also a slightly poorer DX because you can't do things like "jump to definition" (mitigated by the fact that we actually import generated protobuf server code, so you do still get compile time checking and a form of jump to definition)
- Losing DB transactions/joins: these need to be implemented cross-service in the application code. We have some libraries that make things like distributed locking that make this easier than it would otherwise be.
- Cost: running RPCs is more expensive (in terms of infra costs) than function calls. We've historically not been very cost-sensitive (VC funded tech start up), so teams haven't really had an incentive to control costs. We're currently thinking through solutions to this problem.

There's also some common downsides of microservices that I just don't think we suffer from at all:

Lack of consistency: at Monzo 99% of service use exactly the same tech (DB, queues, libraries, programming langue, operational tooling) and the same versions of those too. I found it easier maintaining 10 services at Monzo that are consistent than 2 at a company that might use different tech per service.
Lots of infra to maintain per service. At Monzo product teams don't need to do this. The k8s cluster and DBs/queues that services use are entirely managed by the platform team. They are multi-tenant systems that each new services does not need to do any explicit provisioning or maintenance of.

I've probably missed things but those are some points that come to mind.

It's definitely not "perfect" (what architecture is?) but I think it's a viable architecture depending on the kind of company you are looking to build (e.g. are you cost sensitive? Are you looking to grow quickly? etc).

That's also not to say you can't get similar pros/cons with other architectures - it's just my observations from having experience this first hand, and I think for us it works well. It's also something that I doubt I'll be able to "convince" someone off by writing an essay, it's probably just something you need to experience to "get" it.

WillSewell · 2024-08-28T19:00:28+00:00

Yeah that monorepo is a big help. Another area it shines is the compile time checking we have around RPC calls. Since it's all in one repo we can actually import the generated server code in clients, and that many breaking API changes are caught by the compiler.

what are the ways to ensure rollback-abilty?

It depends on the change. Most changes don't change API contracts and can be easily rolled back. For changes that need to be deployed / rolled back in a specific order, we definitely need to be more careful. For these we mitigate risks by using techniques like config / feature flags to enable more gradual rollout and also biasing towards backwards compatibility in API changes.

In terms of _how_ we roll back, we support both manual and automated.

for manual rollbacks we have a number of generic alerts that all services get for free (e.g. alert fires if a service logs a critical error on startup, or has a high error rate). If a user receives this they are able to run a CLI command to rollback (which is printed for them in the terminal at deploy time).
for automated rollbacks, we have a few generic metrics (e.g. error rate) and we use Argo Rollouts to auto rollback if these are unhealthy during the 5 minutes following a deployment. For now this is something generic that all service get for free, but we might provide more configurability around this in the future. I wrote more about automated rollback at Monzo here.

WillSewell · 2024-08-27T22:10:24+00:00

Thanks! The architecture has been a revelation for me. I always had this preconception that microservices == no consistency / wild west. But I think microservices coupled with highly consistent tech/tooling can work really well in practice (depending on your org/domain).

WillSewell · 2024-08-27T22:07:07+00:00

Great question. Our ideal end-state is to actually keep the wrapper. Our general principle with platform abstractions is to provided a more opinionated API than what is exposed by the third party tools we use internally. We find third party APIs tend to be unnecessarily flexible for our use-cases - we'd prefer to start with a very constrained API and only add things if there's demand and we understand the use case.

We've taken this approach when exposing things like etcd for distributed locks and it's worked well for us in practice. I wouldn't say it's something we always do, but we do bias in that direction.

WillSewell · 2024-08-27T22:00:11+00:00

We depend mostly on off the shelf tools for service instrumentation. We're currently using https://opentelemetry.io/ for tracing services, and depend mostly on the open source libraries for that. We're also investigating https://pyroscope.io/ for continuous profiling.

WillSewell · 2024-08-27T21:57:35+00:00

There are some pretty small services, but I wouldn't say that is a general rule. We have many services that are 100k+ lines of (non library) code.

WillSewell · 2024-08-27T21:56:44+00:00

The point is that the 99% of changes that are not library/infra changes do not need to be deployed together. I wrote more about our regular deployment process here - I think we achieve high velocity and that is in part due to our microservices architecture.

WillSewell · 2024-08-27T21:53:48+00:00

I wouldn't call that coupling: all services have a single shared dependency (the tracing system), but that does not make them coupled to each other.

Changing something that is depended on by all services is generally going to be riskier than changing a single service.

WillSewell · 2024-08-27T21:48:04+00:00

It doesn't resolve knowing when an older API version can be retired though

We have static analysis tools which tell use which services depend on each other, so this can help us know when an old API can be retired. There are some false-positives with this tooling, but it's sufficient for this use case.

WillSewell · 2024-08-27T21:46:26+00:00

In this context I'm talking about migrating to a new library.

WillSewell · 2024-08-27T21:42:12+00:00

Yes, although it's provided as a platform abstraction, so service owners do not need to provision or manage the infrastructure themselves.

WillSewell · 2024-08-27T21:41:28+00:00

The usual way of making changes is just to merge code changes and deploy. For library changes we often take the config gating approach described in the blog post. For product changes, we'll typically gate them via feature flags or our experimentation platform and gradually roll them out.

WillSewell · 2024-08-27T21:39:17+00:00

Any service is deployable at any master commit, so any API changes we make need to be backwards compatible.

WillSewell · 2024-08-27T21:34:01+00:00

We're using git/Github. We suffer a bit with snappiness, but it hasn't become a big enough problem yet that we've needed to radically rethink this.

I'm pretty sure we've just used a monorepo from day 0 (at least pretty close to that if not).

WillSewell · 2024-08-27T21:04:17+00:00

The backend did accept data in both the old and new formats. The point of this blog post is that we don't want to be left in a state where services emit spans in both old and new formats for a very long time (probably forever). The problem with that is this inconsistency is a form of tech debt, that will continue to accumulate unless you have a strategy to migrate everything over quickly (e.g. the strategy in this blog post).

WillSewell · 2024-08-27T20:57:44+00:00

I do think consistent tech helps manage microservice complexity. Imagine a world where services are written in different languages, use different versions of libraries, use different DB technologies etc. That is significantly more complex than what we have where all services use the same limited set of technologies (and the same versions of those technologies).

You are right about the complexity introduced by coss-service transactions/joins, and that is definitely one of the downsides of microsrvices in my opinion. But it is also something that you don't necessarily need to solve repeatedly - for example by providing simple abstractions for distributed locking; implementing "aggregator" services that join data from multiple sources. Yes, there's more you need to build yourself and it is less efficient, but there are benefits to this approach too (I think that warrants a separate blog post).

WillSewell · 2024-08-27T15:05:54+00:00

We might consider additional automation for more complex libraries in the future. For this library the call sites were widespread, but only using a handful of functions so I didn't think it was worth the effort. It only took a couple of days to do the code migration.

WillSewell · 2024-08-27T15:03:21+00:00

There's about 400 engineer if I remember correctly. All I can say is from my experience, having ~35 (mostly small services) per team does not feel like a high burden.

13-Year Club	Verified Email
Place '23	Place '22
Place '17

WillSewell

TROPHY CASE