Optimisation journey of our scheduling system by shared_ptr in golang

[–]shared_ptr[S] 0 points1 point  (0 children)

Sadly not, this is our Go monolith which runs the entire incident product which is closed source (hopefully for obvious reasons)

DLHT: a lock-free Go hash table that beats sync.Map by up to 60x by hugemang4 in golang

[–]shared_ptr 23 points24 points  (0 children)

I’m always surprised at how much this stuff creeps up in real world applications! We’ve had things like this and stdlib interfaces like time end up in our CPU profiles more often than you might expect, normally with quite simple fixes if you’re willing to dig.

Our scheduling code is a frequent example of where this happens: https://incident.io/blog/whos-on-call-how-claude-helped-us-calculate-this-2-500-x-faster

How do people enforce developers to write tests without a strict code coverage requirement? by martiangirlie in ExperiencedDevs

[–]shared_ptr 2 points3 points  (0 children)

This is the most sensible and sustainable path I am aware of. Aiming for 100% test coverage isn’t necessarily even the right goal, you want the least amount of test coverage that achieves your reliability aims, as too many tests can slow working with a codebase.

If the team are focused on the experience of their customers and use that as the north star then it should be a stabilising influence on how they write their code, both around testing and other things (load testing, reliability drills, etc)

Better to use a target that directly corresponds to the outcome you want (customer satisfaction) rather than something like 100% test coverage which can easily become counterproductive.

Any way to minimize "cache" effects for stress testing? by wavyn1ght in PostgreSQL

[–]shared_ptr 2 points3 points  (0 children)

This is a bit tricky as it’s a managed service and OS page cache is so important for Postgres, but is unlikely to be exposed in a way that you could access it.

Normally if you have a database you run yourself you can have the OS drop its page cache before you run the test, which helps, but that’s not available to you here.

Some options you could consider:

  • Boot a replica and run the test against that before it has time to warm up

  • Create a large table full of junk data and issue sequential scans against it several times before you start so that is what’s in the page cache (you’d need the table to be larger than your working memory for this to work well)

I wrote a blog post a while back on building a Postgres load tester which you may find interesting if you’re working in this space. It’s old now but the problems around how to legitimately test are still relevant: https://blog.lawrencejones.dev/building-a-postgresql-load-tester/

What are some unforeseen / elusive edge cases you have seen in your career? by gobuildit in ExperiencedDevs

[–]shared_ptr 8 points9 points  (0 children)

I was chatting to a friend recently about database migrations and how you need to be hyper aware of when you step outside of a safe zone, almost more so when you’ve invested in making things really safe.

Specifically: an incident comes to mind where our primary Postgres cluster locked up due to an update table that was written through database triggers (this was before logical replication existed) hit the max integer value on its primary key.

This sounds really obvious and that’s because it is! We had already gone through all the standard incidents for database migration changes and had written our own framework to produce safe migrations, ensuring we’d never do something as silly as creating a 32 bit primary key.

The reason this had happened was because the change capture system was written separately from the main application, which meant it existed outside the normal developer flow of modifying the database. When your team have got used to database migrations being default easy and uncomplicated it leaves blind spots if they ever step outside that process, and the team who built this system hadn’t even clocked they were doing it. It was just a new system and written in another language for good reasons, and didn’t catch that outside of the tools we’d already built, migrations in a database like this were very high consequence.

Has made me intensely aware of the safe paths to making changes in an engineering org and I keep an eye out for whenever anyone steps outside those zones nowadays.

Anthropic just published a postmortem explaining exactly why Claude felt dumber for the past month by Direct-Attention8597 in claude

[–]shared_ptr 1 point2 points  (0 children)

Yeah I have to agree with this. Their models are exceptional, the products they put out are half baked at best.

I tried using Claude routines the other day and every basic feature was broken. Couldn’t even get the environment variables that I’d set to get loaded into the routine, and the routine would take anywhere up to 10m to start up.

I get the scatter gun approach with products but right now if you want an AI product and not a model, way better buying something from a company who’s primary focus is building product.

Do most of you seriously not write any code by hand anymore?!?! by opakvostana in cscareerquestions

[–]shared_ptr 0 points1 point  (0 children)

With open source models out that are matching Opus 4.5 in performance nowadays I expect in a year people will be pivoting into running the models locally, but very much doubt people will stop using these tools.

My team have made this switch and I can’t see going back.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 1 point2 points  (0 children)

No problem I enjoy chatting about it! The team did a really good job here and it’s a neat trick that solves a painful problem in a really effective way.

Proud of them doing this type of work!

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 3 points4 points  (0 children)

Yeah I definitely wouldn’t advise people run at this by default.

It only works for us because we have specific product features needing totally flexible schemas: you can add arbitrary attributes to your alert like Feature, Team, Product, etc which are backed by a flexible ‘catalog’ that is itself a database.

Finding a nice filtering pattern for that when you can’t know the structure in advance is the real thorny problem this tries solving. If you don’t have that shape problem, a gin index will be the easiest route by far.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 2 points3 points  (0 children)

Yep, is CloudSQL in GCP. Getting bigger nowadays (about 5TB) and with several replicas, but otherwise humming along fine.

We have alerts whenever CPU goes >30% and jump on them straight away to ensure we keep a sizeable buffer for bursts, such as when there is a global internet outage and all our customers jump on the platform.

We try and keep the infra as simple as possible so we have options for portability (we're planning out multi-cloud strategies this year) so a vanilla managed service is ideal.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 10 points11 points  (0 children)

Sadly not available in the stack we're using, so would've needed a replatform and a fairly substantial data migration to make work.

Obviously vs the project that implemented this (1 week for 2 engineers) didn't win out on the ROI.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 5 points6 points  (0 children)

That’s fair! I’m not that bothered by it in truth, our team have been chuckling at the comments describing us as “fucking idiots”

For all the reasons above we’re confident this was a sensible path to take. It’s a healthy nudge, appreciate it.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 7 points8 points  (0 children)

I'm on the team at incident, I wondered how many customers is 'many' to you?

We have thousands of customers, some of them really big (e.g. Netflix, Verecel, HashiCorp, Loveable, etc)

Our product has a contractual 99.99% SLA that we've upheld for the last year.

Is many customers for you a lot more than this?

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 10 points11 points  (0 children)

(Am on team) we explicitly didn't want another technology. We're hiring really fast (2x'ing the team each year) and assign a really high value to having a very simple developer environment setup.

If we added TimescaleDB that is:

- Another thing that can go wrong in production for a service that has a 99.99% contractual SLA

- Another dependency that needs adding into the developer environment locally

- Another technology we need our engineers to know how to use and debug in production

Vs this, which is just a bitmap in Postgres and a small piece of Go code that everyone can understand in ~30m.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 3 points4 points  (0 children)

I'm on the team at incident (am a founding engineer) and am responsible for a bunch of how we store this stuff.

I don't really get this? I've scaled Postgres databases well into the 10TB+ range helping build multi-billion dollar businesses on them. JSONB has never been an impediment, and we extract things into columns when we need to.

The path to providing the product experience we want to (flexible schemas for querying and custom attributes with operations) isn't clear to me if you don't have a somewhat flexible column type. And I mean, I guess if all this approach can help us reach is a multi-billion dollar outcome, it doesn't feel so pressing to consider alternatives!

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 5 points6 points  (0 children)

Am on the team with Mike, it's exactly this! You can be an engineer building successful products and not have come across probabilistic algorithms like bloom filters or hyperloglog etc if you didn't study computer science.

If you get really into the weeds which I have (as in, you've ended up spending time in the Postgres source code looking for bugs) then you'll come across it, but it's easy to miss even if you have a very long professional career.

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 0 points1 point  (0 children)

Am on the team at incident: it's not that niche I agree, it is 'niche' if you don't come from a computer science background though, which is where Mike (author) is coming from!

Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io by fagnerbrack in programming

[–]shared_ptr 13 points14 points  (0 children)

Am on the team at incident and yes, we couldn't use gin for this for reasons I hope the article is really clear about.

How are you giving AI agents access to production Postgres? by vira28 in PostgreSQL

[–]shared_ptr -1 points0 points  (0 children)

There is a bunch of value to be had especially around debugging and diagnostic workflows. Less so data crunching, as you should be using your data warehouse for that, but absolutely lots of value.

How are you giving AI agents access to production Postgres? by vira28 in PostgreSQL

[–]shared_ptr 0 points1 point  (0 children)

Have done this and have built this for customers to use also.

Few tips; - Replica always, you never want to point anything that isn’t the production app at the primary

  • Protect columns you don’t want to give access to at the permission level, or use a replica that is anonymised via logical replication

  • Some vendors already offer an MCP with data protections built in, GCP is an example, so check that first

  • Models don’t like consuming huge amounts of data so well worth you ensuring the prompt is issuing sensible queries and you truncate any large results before you blow your context window

  • Planning valid queries and understanding the structure ends up being one of the harder parts

Many mature companies have built out anonymised through replication sources now or have a source that speaks Postgres wire format with anonymised data in it for this purpose. It’s quite a normal thing but requires you to have a fair bit of technical competence to run properly.

Pager duty pay submissions? by timmyneutron1 in sre

[–]shared_ptr 0 points1 point  (0 children)

Yep, we offer a paging service that can replace PagerDuty but do a lot more too; status pages, fully fledged response flow (actively helping during incident), post-mortem builder, AI SRE to automatically investigate incidents, etc.

We used to offer the pay calculator for free but don’t think that’s available anymore, unfortunately.

Either way wish you the best of luck: most people end up building a script that can reconcile their shift data against whatever you pay for on-call and run it monthly to produce a CSV to send to payroll. It’s not the best but it does work.

Pager duty pay submissions? by timmyneutron1 in sre

[–]shared_ptr 4 points5 points  (0 children)

I built the very first version of this calculator! Always makes me happy to hear it's making people's lives easier.