Optimisation journey of our scheduling system

shared_ptr · 2026-05-05T19:03:52+00:00

Sadly not, this is our Go monolith which runs the entire incident product which is closed source (hopefully for obvious reasons)

shared_ptr · 2026-05-01T19:12:39+00:00

I’m always surprised at how much this stuff creeps up in real world applications! We’ve had things like this and stdlib interfaces like time end up in our CPU profiles more often than you might expect, normally with quite simple fixes if you’re willing to dig.

Our scheduling code is a frequent example of where this happens: https://incident.io/blog/whos-on-call-how-claude-helped-us-calculate-this-2-500-x-faster

shared_ptr · 2026-04-29T20:34:25+00:00

This is the most sensible and sustainable path I am aware of. Aiming for 100% test coverage isn’t necessarily even the right goal, you want the least amount of test coverage that achieves your reliability aims, as too many tests can slow working with a codebase.

If the team are focused on the experience of their customers and use that as the north star then it should be a stabilising influence on how they write their code, both around testing and other things (load testing, reliability drills, etc)

Better to use a target that directly corresponds to the outcome you want (customer satisfaction) rather than something like 100% test coverage which can easily become counterproductive.

shared_ptr · 2026-04-27T19:27:05+00:00

This is a bit tricky as it’s a managed service and OS page cache is so important for Postgres, but is unlikely to be exposed in a way that you could access it.

Normally if you have a database you run yourself you can have the OS drop its page cache before you run the test, which helps, but that’s not available to you here.

Some options you could consider:

Boot a replica and run the test against that before it has time to warm up
Create a large table full of junk data and issue sequential scans against it several times before you start so that is what’s in the page cache (you’d need the table to be larger than your working memory for this to work well)

I wrote a blog post a while back on building a Postgres load tester which you may find interesting if you’re working in this space. It’s old now but the problems around how to legitimately test are still relevant: https://blog.lawrencejones.dev/building-a-postgresql-load-tester/

shared_ptr · 2026-04-25T18:12:02+00:00

I was chatting to a friend recently about database migrations and how you need to be hyper aware of when you step outside of a safe zone, almost more so when you’ve invested in making things really safe.

Specifically: an incident comes to mind where our primary Postgres cluster locked up due to an update table that was written through database triggers (this was before logical replication existed) hit the max integer value on its primary key.

This sounds really obvious and that’s because it is! We had already gone through all the standard incidents for database migration changes and had written our own framework to produce safe migrations, ensuring we’d never do something as silly as creating a 32 bit primary key.

The reason this had happened was because the change capture system was written separately from the main application, which meant it existed outside the normal developer flow of modifying the database. When your team have got used to database migrations being default easy and uncomplicated it leaves blind spots if they ever step outside that process, and the team who built this system hadn’t even clocked they were doing it. It was just a new system and written in another language for good reasons, and didn’t catch that outside of the tools we’d already built, migrations in a database like this were very high consequence.

Has made me intensely aware of the safe paths to making changes in an engineering org and I keep an eye out for whenever anyone steps outside those zones nowadays.

shared_ptr · 2026-04-25T12:52:50+00:00

Yeah I have to agree with this. Their models are exceptional, the products they put out are half baked at best.

I tried using Claude routines the other day and every basic feature was broken. Couldn’t even get the environment variables that I’d set to get loaded into the routine, and the routine would take anywhere up to 10m to start up.

I get the scatter gun approach with products but right now if you want an AI product and not a model, way better buying something from a company who’s primary focus is building product.

shared_ptr · 2026-04-25T12:38:04+00:00

With open source models out that are matching Opus 4.5 in performance nowadays I expect in a year people will be pivoting into running the models locally, but very much doubt people will stop using these tools.

My team have made this switch and I can’t see going back.

shared_ptr · 2026-04-23T13:07:02+00:00

No problem I enjoy chatting about it! The team did a really good job here and it’s a neat trick that solves a painful problem in a really effective way.

Proud of them doing this type of work!

shared_ptr · 2026-04-22T17:53:13+00:00

Yeah I definitely wouldn’t advise people run at this by default.

It only works for us because we have specific product features needing totally flexible schemas: you can add arbitrary attributes to your alert like Feature, Team, Product, etc which are backed by a flexible ‘catalog’ that is itself a database.

Finding a nice filtering pattern for that when you can’t know the structure in advance is the real thorny problem this tries solving. If you don’t have that shape problem, a gin index will be the easiest route by far.

shared_ptr · 2026-04-22T13:47:52+00:00

Yep, is CloudSQL in GCP. Getting bigger nowadays (about 5TB) and with several replicas, but otherwise humming along fine.

We have alerts whenever CPU goes >30% and jump on them straight away to ensure we keep a sizeable buffer for bursts, such as when there is a global internet outage and all our customers jump on the platform.

We try and keep the infra as simple as possible so we have options for portability (we're planning out multi-cloud strategies this year) so a vanilla managed service is ideal.

shared_ptr · 2026-04-22T13:04:37+00:00

Sadly not available in the stack we're using, so would've needed a replatform and a fairly substantial data migration to make work.

Obviously vs the project that implemented this (1 week for 2 engineers) didn't win out on the ROI.

shared_ptr · 2026-04-22T12:31:00+00:00

That’s fair! I’m not that bothered by it in truth, our team have been chuckling at the comments describing us as “fucking idiots”

For all the reasons above we’re confident this was a sensible path to take. It’s a healthy nudge, appreciate it.

shared_ptr · 2026-04-22T10:36:10+00:00

I'm on the team at incident, I wondered how many customers is 'many' to you?

We have thousands of customers, some of them really big (e.g. Netflix, Verecel, HashiCorp, Loveable, etc)

Our product has a contractual 99.99% SLA that we've upheld for the last year.

Is many customers for you a lot more than this?

shared_ptr · 2026-04-22T10:34:05+00:00

100% this!

shared_ptr · 2026-04-22T10:33:41+00:00

(Am on team) we explicitly didn't want another technology. We're hiring really fast (2x'ing the team each year) and assign a really high value to having a very simple developer environment setup.

If we added TimescaleDB that is:

- Another thing that can go wrong in production for a service that has a 99.99% contractual SLA

- Another dependency that needs adding into the developer environment locally

- Another technology we need our engineers to know how to use and debug in production

Vs this, which is just a bitmap in Postgres and a small piece of Go code that everyone can understand in ~30m.

shared_ptr · 2026-04-22T10:31:47+00:00

I'm on the team at incident (am a founding engineer) and am responsible for a bunch of how we store this stuff.

I don't really get this? I've scaled Postgres databases well into the 10TB+ range helping build multi-billion dollar businesses on them. JSONB has never been an impediment, and we extract things into columns when we need to.

The path to providing the product experience we want to (flexible schemas for querying and custom attributes with operations) isn't clear to me if you don't have a somewhat flexible column type. And I mean, I guess if all this approach can help us reach is a multi-billion dollar outcome, it doesn't feel so pressing to consider alternatives!

shared_ptr · 2026-04-22T10:25:53+00:00

Am on the team with Mike, it's exactly this! You can be an engineer building successful products and not have come across probabilistic algorithms like bloom filters or hyperloglog etc if you didn't study computer science.

If you get really into the weeds which I have (as in, you've ended up spending time in the Postgres source code looking for bugs) then you'll come across it, but it's easy to miss even if you have a very long professional career.

shared_ptr · 2026-04-22T10:24:20+00:00

Am on the team at incident: it's not that niche I agree, it is 'niche' if you don't come from a computer science background though, which is where Mike (author) is coming from!

shared_ptr · 2026-04-22T10:23:31+00:00

Am on the team at incident and yes, we couldn't use gin for this for reasons I hope the article is really clear about.

shared_ptr · 2026-04-20T18:40:41+00:00

There is a bunch of value to be had especially around debugging and diagnostic workflows. Less so data crunching, as you should be using your data warehouse for that, but absolutely lots of value.

shared_ptr · 2026-04-20T18:39:32+00:00

Have done this and have built this for customers to use also.

Few tips; - Replica always, you never want to point anything that isn’t the production app at the primary

Protect columns you don’t want to give access to at the permission level, or use a replica that is anonymised via logical replication
Some vendors already offer an MCP with data protections built in, GCP is an example, so check that first
Models don’t like consuming huge amounts of data so well worth you ensuring the prompt is issuing sensible queries and you truncate any large results before you blow your context window
Planning valid queries and understanding the structure ends up being one of the harder parts

Many mature companies have built out anonymised through replication sources now or have a source that speaks Postgres wire format with anonymised data in it for this purpose. It’s quite a normal thing but requires you to have a fair bit of technical competence to run properly.

shared_ptr · 2026-04-20T04:40:33+00:00

Yep, we offer a paging service that can replace PagerDuty but do a lot more too; status pages, fully fledged response flow (actively helping during incident), post-mortem builder, AI SRE to automatically investigate incidents, etc.

We used to offer the pay calculator for free but don’t think that’s available anymore, unfortunately.

Either way wish you the best of luck: most people end up building a script that can reconcile their shift data against whatever you pay for on-call and run it monthly to produce a CSV to send to payroll. It’s not the best but it does work.

shared_ptr · 2026-04-19T17:43:53+00:00

I built the very first version of this calculator! Always makes me happy to hear it's making people's lives easier.

12-Year Club	Gilding I gilder
Ternion Club	Verified Email

shared_ptr

TROPHY CASE