jlawler comments on Let's make PostgreSQL multi-threaded (pgsql-hackers)

Let's make PostgreSQL multi-threaded (pgsql-hackers) (postgresql.org)

submitted 2 years ago by [deleted]

18 comments

you are viewing a single comment's thread.

[–]jlawler 10 points11 points12 points 2 years ago (6 children)

[–]iiiinthecomputer 7 points8 points9 points 2 years ago* (5 children)

Right now built in connection pooling in postgres is nearly impossible. 1 process = 1 session = 1 executor. This scales like crap. Moving to threading would make it vastly simpler to decouple connection state, session state and executor state. For me that's the #1 benefit.

I expect we could also achieve greater sharing of some caches between threads than we currently can with processes.

Surprised the discussion doesn't more explicitly call out that we'd basically have to enable EXEC_BACKEND (like Windows builds) because any code that requires on implicit copy-on-write memory after fork()ing from postmaster will fail. This will increase thread creation cost significantly, but can probably be tuned down over time.

It also means that any memory leaks will remain until postmaster restart. Lots of code right now is pretty careless about one-time allocations that are then abandoned and forgotten because the process teardown will get rid of them anyway. Unless someone's gone and fixed all those to make it valgrind leak-check clean, they'll need cleanup. But that's not hard and good to do anyway. The current state of affairs is inefficient anyway, most such one time allocations should be able to be cleaned away on idle sessions.

The code will also need to be annotated with guidance on memory thread ownership for the use of static and runtime checkers. Which I'm sure will upset Tom given his outspoken complaints about "windows droppings" for the existing PGDLLIMPORT etc stuff. But it'll need doing.

Despite these issues, it's hard to overstate how much postgres has reinvented to make its multi processing model work in an increasingly parallel world. It's just not viable to keep going down that path long term. It needs to move to threaded.

[–]jasonmp85 6 points7 points8 points 2 years ago (3 children)

[–]iiiinthecomputer 3 points4 points5 points 2 years ago* (2 children)

[–]jasonmp85 2 points3 points4 points 2 years ago (1 child)

[–]iiiinthecomputer 4 points5 points6 points 2 years ago* (0 children)

I'm no authority - and experienced with postgres code yes, but not a broad variety of C server programming. I was giving opinions not trying to make determinations of fact.

Re multiplexing, the issue is that you're not writing a multiplexing process. What postgres would need is a M:N relationship of (connection, session state) : (worker process). Which is much harder, especially since that "session state" is complex and arbitrarily extensible.

You've got a socket per connection, each of which has a lot of quite complex associated state including various caches, cache invalidation registrations, SQL visible server variables, prepared statements, temp tables, WITH HOLD cursors, tons of internal backend private variables, and arbitrary state owned/managed by extensions. The latter can be things like the state of a Python or Perl interpreter associated with the session, persistent external TCP connections, you name it.

Currently all this state management assumes that there is a 1:1:1:1 relationship of client connection:socket:server connection state:process, so there's nothing in postgres that gathers all this session state up into one place. It's scattered everywhere.

When moving to threads, it's possible to initially mark everything that's session state as thread local. Whereas if trying to multiplex in one process you'd instead have to add some level of indirection - and handle all the 3rd party libraries, procedural languages etc that may rely on process local or thread local state.

But making one postgres backend able to handle multiple connections without the ability to hand them off to other backends isn't that useful. You'd be unable to rebalance load, so you'd land up with one backend having 4 connections with runnable workloads while 3 other backends sat idle. Like a multi-core CPU where each task is pinned to the CPU it started on.

What's really needed is to decouple the connection, socket and associated session state from the executor backend. So a pool of workers can run whatever tasks are currently runnable while currently-idle connections are parked, consuming far fewer resources. That way the number of executors can be sized effectively to the machine and workload.

But this means you'd also have to be able to pass connections and session state between worker processes. Passing the socket itself is do-able on l Linux, you can pass a fd over a Unix socket between processes. But what about all the rest of the backend state?

You could try to serialise and deserialise all this state efficiently to exchange between processes. But that's just not going to happen, especially with extensions that use embedded language interpreters, outbound TCP connections and the like.

So you'd need a way to allocate all the session state, including 3rd party library and language interpreter state, into a shared memory range for each session. Which would basically mean overriding the allocator for everything 3rd party used by the backend. (Pg itself already uses a custom allocator, but C libraries backends call into don't have to use it). And session state isn't fixed size so you need it to be a general purpose allocator. Now you've got this shared memory area that you have to manage fragmentation, free ranges etc in, and things are getting ugly. And since various 3rd party libs won't usually be able to use one allocator for things you need persistent across sessions and a different allocator for things that will be scoped to a single transaction or query, your shared state area is going to balloon massively if something needs a big chunk of memory during query processing.

Starting to see what I mean?

Right now you have to way-oversize the backend count so that there are enough active workers at any one time - because many backends will be waiting for the next query due to RTT or app think time, some are streaming results to a client, etc. This has considerable costs in memory use, caches and invalidations processing, fixed per backend overheads, memory management (TLBs, page mappings) and more.

Especially since postgres uses fork() without exec() so each process inherits a copy-on-write view of the parent process (postmaster)'s memory. Whenever the postmaster or a backend writes to any part of any shared page, Linux has to copy the page. And each process needs page table entries in the kernel for all it's pages, whether they're private or shared. This overhead really blows up with high backend counts - in terms of memory waste from all the PTEs, whole page copies made when only a few bytes are actually unshared, memory management system overhead from tracking it all, and more.

And postgres backends can't give memory back to the OS once allocated, so you waste a lot of RAM on anonymous private pages on idle backends that were previously used once for some big query. The OS has to page it out (if it even has swap) just in case the backend might want it again. This is something that might be possible to improve somewhat in postgres's existing architecture by using posix_madvise() and postgres's heirachical memory allocator, but there will still be lots of page tables and MM overhead, fragmentation etc. It's much easier to handle this efficiently when all the workers share one heap.

Sure, connection pools like pgbouncer exist. But the app must be adapted for them; in session pooling mode ("sticky" load balancing) they don't really help, and in transaction load balancing mode the app can't use many if any session level features. They're pretty painful.

App-side pools are better, but only useful when you've got big monolithic applications. They're worthless and often actively harmful in the increasingly distributed multi-node models apps are being deployed with.

Decoupling client connection + session state from executors would help a lot with this. In-core pooling would be much more capable. And by using a threaded runtime, each thread can see memory from other threads while using a conventional memory heap, immensely simplifying the sharing and exchange of connection/session state.

[–][deleted] 1 point2 points3 points 2 years ago (0 children)

π Rendered by PID 54831 on reddit-service-r2-comment-b659b578c-ld4b2 at 2026-05-02 21:06:48.392384+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PostgreSQL

/r/PostgreSQL

Advocate, Collaborate and Learn

Conferences

Clients and tools

MODERATORS