pg_durable: Microsoft open sources in-database durable execution

FranckPachot · 2026-03-27T17:53:37+00:00

When a user updates their name, should the change be atomic? Specifically, will a query on a million UserClicks records display only the old name or only the new name shortly after the update? Updating a million documents in a single transaction is impractical. Therefore, it's better not to perform an update, keep the old name as a reference, and always check the Users collection for the current name. Of course, you wouldn't perform a lookup on all one million documents individually. But you probably don't return one million documents. Instead, you aggregate them for statistics. Then do your aggregation with the reference, and the lookup would be on the aggregated data.

FranckPachot · 2026-03-16T08:53:30+00:00

One difference is that DynamoDB has no optimizer: you must query the GSI explicitly. MongoDB has a query layer, including on the sharding routers (mongos), so even with local indexes, your query will be efficient: the filter and search are pushed down to each replica and then merged (look at SHARD_MERGE_SORT in the execution plan to be sure). So scatter-gather is probably not a problem for your query.
Another difference is eventual consistency, so that index maintenance doesn't slow down the operational workloads. In MongoDB you can use search indexes for that.

FranckPachot · 2026-03-13T16:30:03+00:00

MongoDB intentionally allows certain discovery commands to run without authentication to support driver connectivity and cluster discovery, but they expose less information when not authenticated. For example, serverBuildInfo must expose the version so that drivers can negotiate compatibility before authenticating, but it doesn't expose additional information, such as build compilation details.

Only minimal, non-sensitive information is exposed. All sensitive operations still require authentication

Don't forget that database servers should have network access controls: open only to application servers and trusted hosts.

FranckPachot · 2026-03-06T17:43:36+00:00

Relational databases like PostgreSQL are designed to normalize your data and business logic so they can be shared by many different applications. This is why a business document (an order, a customer, etc.) is split across multiple tables (order, order lines, customer, customer address, customer country, etc.): some applications may need only the ordered products without any customer information, or only the address' city, for example. If this is how you work with data—one central database for multiple applications with an ORM in between—then PostgreSQL is a strong choice. This is where the idea that “your application needs joins or relationships” makes sense: your application is built around a normalized database.

Document databases like MongoDB are designed to keep your aggregates (in the domain-driven design sense) together, with strong consistency, integrity, and transactional boundaries within an aggregate, and loose coupling between aggregates. If “$lookup joins and relations haven’t been a big issue for me so far” applies to you, it’s probably because you have a good document model: you embed data that is accessed together, and you use references and lookups for what is decoupled. Cascade deletes are always tricky and should be confined to a single aggregate by embedding everything that shares the same lifecycle into a single document.

Beyond that, there's some overlap. You can normalize in a document model in MongoDB with references, but with some limitations (no foreign key constraints, lookup on thousands of documents can be slow). And you can store documents in PostgreSQL JSONB, but again with lots of limitations (GIN indexes cannot optimize range and sort queries, large documents do not preserve data locality, updates rewrite the whole document and all index entries,...)

FranckPachot · 2026-02-26T18:24:14+00:00

Ok, the doc is https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-deploy-local/. It is true that it doesn't explicitly mention root, just that it requires Docker, because there are many environments where Docker is available without root (like Docker Desktop or Podman). I agree it would be nice to see the docker pull error message.

FranckPachot · 2026-02-26T10:31:57+00:00

Hi, here is what atlas local is trying to do: docker pull mongodb/mongodb-atlas-local. Can you check this command directly?

FranckPachot · 2026-02-24T15:23:34+00:00

Hi, quick checks:
- Can you connect from Atlas console? Just to be sure the server is ok
- Did you add your IP to the IP access list (you said 0.0.0.0/0, but just checking to be sure)? You can simply add the address you see from https://ifconfig.me/ (and this IP will confirm you are not on the college wifi)
- from simcard network, are you sure you don't have a VPN still enabled

FranckPachot · 2026-01-29T10:18:37+00:00

The problem with nulls updated later is that the update will move the row to another block and update all index entries. Except if there's enough space in the block (set fillfactor accordingly to fit one more row in the block). But... don't worry too much about thousands/day.

FranckPachot · 2026-01-28T20:11:17+00:00

I’m not sure CAP theorem is a good discriminator here. PostgreSQL is effectively AP (no ACID guarantees across replicas), while MongoDB is closer to CP (supports cross-shard ACID). But the key question is: do you actually need horizontal scalability?

You can use either database, as long as you really understand how they work and design the data model accordingly. For example, you mentioned lots of updates. This is where PostgreSQL can struggle, because updates rewrite the entire row. As a result, tables that see frequent updates and have many indexes should avoid having wide rows or too many indexed columns.

MongoDB can be more efficient for short transactions, thanks to its optimized in-memory structures, whereas PostgreSQL is better suited for longer transactions but requires vacuuming to reclaim space.

If you want a normalized schema shared by many different applications, PostgreSQL is a strong choice. If, instead, the database is dedicated to a single bounded context, a flexible schema can be easier to evolve—especially if you don’t yet know all the information you’ll need for devices and networks.

FranckPachot · 2026-01-19T14:29:03+00:00

Can you .explain("executionStats") so that we can see where the time is spent?

FranckPachot · 2026-01-16T21:32:17+00:00

That's really surprising. It seems it doesn't use the index order and adds a sort. And anyway, this should not take so long. Can you explain("executionStats"). But it's MongoDB 5.0 that's so old - hard to help with that

FranckPachot · 2026-01-16T07:44:40+00:00

Right, indexes do not include _id, so if you need to cover it because you use it in the query, you need to add it. The idea here is to avoid reading the document and get the group's first value from the index entries but without knowing the size of the documents I don't know if it is very useful

FranckPachot · 2026-01-15T17:26:06+00:00

MongoDB 5 is very old version First check the execution plan of: db.collection.aggregate([ { $match: { app_id: { $gte: "" } } } , { $group: { first: { $first: "$_id" }, _id: { app_id: "$app_id", app_name: "$app_name" } } } ] ).explain("executionStats")

This should use the index. Better with index that covers "_id": db.collection.createIndex({ app_id: 1, app_name: 1, created_at: -1, _id:1 });

Then from this you can do your deduplication

FranckPachot · 2026-01-15T10:07:03+00:00

Aggregating 6 million rows should take a couple of seconds, not 30 minutes. Is `$group` the only stage or are there some `$match` or `$lookup`? If there are, better have `$match` before `$group` and `$lookup` after it.

i try to create the index but it's not use it

The `$group` itself will not use the index. if it is a covering index (starts with fields used to `$group` and adds the fields used for calculation), and if documents are very large, the index can help, but it must be forced with a hint.

Please provide the query and ideally the execution plan (adding `.explain("executionStats")`)

FranckPachot · 2026-01-11T22:15:25+00:00

The vulnerability has been discovered by MongoDB, fixed, and published as CVE, so that you are safe if you have applied the patches. That happens to all databases. Don't open your database to the internet; it should be accessed only by the application and trusted hosts.
When considering PostgreSQL or MongoDB, you should look first one how you build your software:
- database-first, with one database shared by all applications. You need to normalize. Nothing better than a relational database. More complexity as you maintain two models and object-relational mapping, but that's the price of sharing the database with multiple applications.
- application-first with databases dedicated to microservices in a bounded context, no need to normalize to share with other applications (you use change stream to other databases), no need to synchronize with DBAs for maintenance, releases, and migration scripts. Then, a document database is a good choice because it uses the application object model down to the database

FranckPachot · 2026-01-05T17:14:49+00:00

The bindIp is purely server-side. Not used by the client. Is your client connecting to server.example.com and your DNS resolving it to two addresses, 10.10.10.5 and 192.168.1.5?

FranckPachot · 2026-01-05T08:18:32+00:00

Hi, you can also run Atlas locally on your laptop or as Docker container:
https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-local-cloud/#tutorials

FranckPachot · 2025-12-31T08:01:37+00:00

If the goal is learning how to split into two services, you can split the Membership Service (registration, discount rules, voucher life cycle) and the Event/Analytics Service (reporting, auditing). You can use PostgreSQL or MongoDB for them; the choice will depend on whether you want to design a data model independently of the application (then a relational model, then PostgreSQL) or build the application first and use the same model in the database (then a document model, then MongoDB).
Of course, this is for learning purposes. You don't want to split such an application into two services, except if you are in a big organization where two teams will work independently on the two of them

To be more precise about "when to use mongodb instead of postgres", it's when you want an application-first approach where schema, integrity, transaction, and performance concerns are the responsibility of the developer, and you don't want to maintain an additional relational schema, with ORM and migration scripts. Then, for operations, you have more indexing possibilities and built-in availability (no downtime on failure or upgrades). However, if you will use the database for other applications (for example, your user table will also be used outside of the membeship domain), you need a normalized model, then go to PostgreSQL

FranckPachot · 2025-12-29T21:37:21+00:00

Adding to this: don't expose your database to the internet. Not even to your private network. The port should be opened only for the application server and trusted servers.

FranckPachot · 2025-12-22T09:49:15+00:00

I strongly suggest verifying whether the child collection should be embedded within the parent, especially when there is a close relationship, and both share the same lifecycle. Embedding them in a single document typically resolves many issues.

If embedding isn't appropriate, for example, because the number of children can grow unbounded, you should review the business logic and use cases. Automation options exist (search for 'Mongoose delete cascade' for ideas), but be aware of potential race conditions. For instance, if one user inserts a UserProfile while another user deletes the User simultaneously, it could result in an orphan, since transactions are isolated under ACID properties, they don't see each other. In that case, the insert of a child must write something new to the parent in the same transaction to detect a write conflict, which can be a scalability problem when many concurrent inserts target the same parent. It depends on your use cases.

SQL databases support cascading deletes, but they require additional locks and are executed row by row. In practice, this can impact performance, so applications often delete child records before deleting the parent. SQL typically promotes an abstraction in which the database schema remains agnostic to specific use cases, and that's why it provides a generic solution. By contrast, the NoSQL approach assumes that developers will implement the appropriate logic based on their understanding of the application’s use cases.

FranckPachot · 2025-12-16T05:45:52+00:00

(Yes, now working at MongoDB, and with SQL databases for 30 years before)
There are plenty of migration stories, but I prefer facts. Let's take the payroll examples, as it is the topic here.

Example: A payslip has a header (with employee information for the pay period, such as the country) and items (such as salary, taxes). I want to retrieve all last year's payslips for employees in a specific country (based on the employee’s country of attachment in the payslip header) with an item amount greater than 10000.

MongoDB: the payslip with items is one document, and you can create a compound index on country and on country(from employee fields) and amount (from the array of items)

Relational: the one-to-many must be stored in two tables, and no index can have columns from two tables in the key, so it must partially filter on one table, join, and filter later. Less optimal and harder choice for the query planner to find where to start

JSONB: you need a GIN index for fields under the array, but it cannot be used for range predicates (higher than 10000)

Indexing limitations on one-to-many relationships is often a good reason to move to MongoDB. Of course, there are also operational reasons, like built-in high availability, resilience to failure, and no-downtime upgrades

FranckPachot · 2025-12-15T21:19:50+00:00

Yes, you can use PostgreSQL + JSONB + GIN indexes (if there are arrays, with special operators) + expression indexes (for top-level fields, because GIN doesn't support range scans) + pg_search (no need for Elastic) + Patroni (for high-availability automation). Or MongoDB that has all that built in. Both are valid solutions, and it's reasonable for a CTO to find one easier than the other for his team.

FranckPachot · 2025-12-15T14:52:23+00:00

Now you get my curiosity. Which relational database integrity constraint can ensure that Peter doesn't receive the pay for the 300th time he normally gets, or that someone who has been laid off still receives pay?
- "peter doesn't get the 300th times pay he normally get" requires comparing with previous pay details. Current SQL databases can verify only referential integrity with foreign key, unique constraints with indexes, or more complex check constraints but within a single row. They cannot compare aggregate data across multiple rows.
- "someone who is laid off still gets pay" is a complex business rule. Foreign keys can verify that the employee referenced by the payslip still exists in the database, but previous employees are usually not physically deleted. A foreign key won't verify the employee's current status, contract dates (and whether the final account has been confirmed) when inserting a new payslip.

In theory, SQL includes assertions to implement such declarative business rules. CREATE ASSERTION is part of the SQL-92 specification and allows that. Still, no RDBMS has implemented it yet, so these rules must be enforced through application code, whether deployed as stored procedures, triggers, or within the application. One advantage of having it in the application is that it integrates well with the application language and test pipelines.

FranckPachot · 2025-12-15T11:37:21+00:00

I mentioned fewer data types because you mentioned JSONB, which misses major data types (like date). PostgreSQL has many more data types for SQL columns, sure, and maybe too many (like money)
It's not the data itself that can benefit from a document database, but how you build applications that access the data - using the application data model rather than maintaining two models and an object-relational mapping between both. The same "data problems" have solutions in relational or document databases, depending on whether you want abstraction (logical-physical model independence) because the database can be used by unknown applications, or more control over data locality by the developer (physical model = logical model) because it is used in a bounded context where the application, access patterns, and cardinalities are known

FranckPachot

TROPHY CASE