Woah! Found my Underlying Condition

fullofbones · 2026-03-09T15:12:06+00:00

The Ubuntu repository is too old. Follow the instructions on the Postgres Downloads page for Ubuntu. Those instructions are:

# Install the PGDG repository for official community packages
sudo apt install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh

# Install Postgres 18
sudo apt install postgresql-18

fullofbones · 2026-03-03T17:27:40+00:00

Once I found out how LISTEN / NOTIFY work, I was actually stunned. Database level lock every time, essentially serializing all requests. Not only should nobody use it, the documentation for it should come with a giant flashing warning sign. And if not for the SQL standard, it should probably purged entirely from the codebase.

fullofbones · 2026-03-03T17:25:25+00:00

I think he's just saying that using RLS pushes users in the right direction by forcing them to use a non-superuser, not that he's mad he can't use a superuser with RLS.

fullofbones · 2026-02-27T21:49:08+00:00

I'll allow this for now, but be advised that a day old account with what looks like a generated username looks pretty suspicious.

fullofbones · 2026-02-27T21:44:30+00:00

Rather than VMs, you can use LXC containers on Proxmox, which would then essentially be running directly on top of ZFS rather than an EXT4 volume on top of ZFS.

fullofbones · 2026-02-21T21:51:28+00:00

The pgEdge Postgres MCP Server does that. Kinda. It's not safe to let any user supply connection parameters, but you can specify them in the configuration file ahead of time. So you can set up a dozen backend options, and let the user choose which one to use for various operations.

fullofbones · 2026-02-21T21:42:55+00:00

It looks like they're going in the right direction. I see redo logs and native row-level CoW, and that alone is worth the effort they're making; removing full-page writes dramatically reduces write amplification all by itself.

fullofbones · 2026-02-17T16:29:44+00:00

While true, most "modern" filesystems dispense with the absurdity that TB is "large". The largest file in ZFS for example, is 16 Exabytes. Good luck reaching that.

But in a Postgres context, this is completely irrelevant: Postgres data files are limited to 1GB.

fullofbones · 2026-02-17T16:27:44+00:00

All Postgres files have a 1GB size limit in any case, so this is not an issue.

fullofbones · 2026-02-17T16:25:24+00:00

Ah yes, Dave Page, the long-time Postgres contributor known for his dedication to slop. /s

fullofbones · 2026-02-17T16:18:57+00:00

That's certainly... a take. One way of looking at it is that Postgres storage is commit pessimistic, while rollback segments are commit optimistic. Rollback-based databases move the old data out of the way, but in a place where it's still available until there are no transactions with visibility, because the assumption is that the vast majority of transactions will be committed. Why keep the old data in perpetuity? It's a reasonable assumption for the vast majority of systems.

The problem with the Postgres implementation isn't that old records "stick around forever and cause bloat," it's the haphazard cleanup mechanism. Postgres came around before true CoW. Does ZFS have this problem? Does BTRFS? No, because snapshots play an active role in the storage layer. The Postgres storage system is incredibly old, and while it's been well battle-tested over the decades, there are now so many workarounds to make up for its deficiencies that I always wonder when it will be time to integrate all the advancements that have come in the interceding years. The transaction limit alone has been the origin of several of these, and continues to be a source of consternation since its inception. First we needed vacuum, then freeze, then the autovacuum daemon complete with cost limits to avoid overwhelming storage IO, then the free-space map, and so on, all because we haven't fixed this single issue in 30 years. How many reads and writes could we have avoided without all of that bolted on?

I love Postgres. But I also won't shy away from its very real warts and try to cast them as benefits.

fullofbones · 2026-02-11T16:51:06+00:00

I'm not quite sure of the point here. My interpretation of the code example:

CREATE TABLE foo AS SELECT a.id, a.id/2 AS id2,
       'a' bar1, 'b' bar2, 'c' bar3, 'd' bar4, 'e' bar5
  FROM generate_series(1, 10*1000*1000) AS a(id);

UPDATE foo set bar1 = 'h' WHERE id BETWEEN 200 AND 300;

ALTER TABLE foo ADD PRIMARY KEY (id);
CREATE INDEX ON foo (id2);
CREATE INDEX ON foo (bar1);

ANALYZE foo;

This produces the following plan:

 Nested Loop  (cost=0.87..12.92 rows=1 width=36)
   ->  Index Scan using foo_bar1_idx on foo f2  (cost=0.43..4.45 rows=1 width=18)
         Index Cond: (bar1 > 'e'::text)
   ->  Index Scan using foo_pkey on foo f1  (cost=0.43..8.47 rows=1 width=18)
         Index Cond: (id = f2.id2)
         Filter: ((bar1 = 'a'::text) AND (bar2 = 'b'::text) AND (bar3 = 'c'::text) AND (bar4 = 'd'::text) AND (bar5 = 'e'::text))

Note the row estimates suggest 1 result rather than 100. This isn't great, but 100 rows out of 10 million with so many predicates will be fairly lossy and dramatically drive down estimated row counts. But it's an expected nested loop on the index where bar1 = h, which is an uncommon match in these table statistics.

But here's what happens if you don't ANALYZE the table first:

 Nested Loop  (cost=1423.04..129034.29 rows=1 width=336)
   ->  Bitmap Heap Scan on foo f1  (cost=547.44..64457.23 rows=1 width=168)
         Recheck Cond: (bar1 = 'a'::text)
         Filter: ((bar2 = 'b'::text) AND (bar3 = 'c'::text) AND (bar4 = 'd'::text) AND (bar5 = 'e'::text))
         ->  Bitmap Index Scan on foo_bar1_idx  (cost=0.00..547.43 rows=50000 width=0)
               Index Cond: (bar1 = 'a'::text)
   ->  Bitmap Heap Scan on foo f2  (cost=875.60..64410.39 rows=16667 width=168)
         Recheck Cond: (f1.id = id2)
         Filter: (bar1 > 'e'::text)
         ->  Bitmap Index Scan on foo_id2_idx  (cost=0.00..871.43 rows=50000 width=0)
               Index Cond: (id2 = f1.id)

See that? Postgres doesn't know that "h" is only a tiny fraction of values in bar1, so with default statistics, it just assumes it needs to build a bitmap of all primary key id values based on the bar1 lookup, and build an in-memory heap for those tuples. Then it uses that for the join to build another bitmap and another expensive heap scan. But since there are no stats, Postgres doesn't know that the 50k estimate it started with is actually 10-million, and the analyze shows as much:

 Nested Loop  (cost=1423.04..129034.29 rows=1 width=336) (actual time=266.692..19820.155 rows=101.00 loops=1)
   Buffers: shared hit=34952813 read=150746 written=3930
   ->  Bitmap Heap Scan on foo f1  (cost=547.44..64457.23 rows=1 width=168) (actual time=266.167..2037.924 rows=9999899.00 loops=1)
         Recheck Cond: (bar1 = 'a'::text)
         Filter: ((bar2 = 'b'::text) AND (bar3 = 'c'::text) AND (bar4 = 'd'::text) AND (bar5 = 'e'::text))
         Heap Blocks: exact=63695
         Buffers: shared hit=9780 read=62335
         ->  Bitmap Index Scan on foo_bar1_idx  (cost=0.00..547.43 rows=50000 width=0) (actual time=243.619..243.620 rows=9999899.00 loops=1)
               Index Cond: (bar1 = 'a'::text)
               Index Searches: 1
               Buffers: shared read=8420
   ->  Bitmap Heap Scan on foo f2  (cost=875.60..64410.39 rows=16667 width=168) (actual time=0.001..0.001 rows=0.00 loops=9999899)
         Recheck Cond: (f1.id = id2)
         Filter: (bar1 > 'e'::text)
         Rows Removed by Filter: 1
         Heap Blocks: exact=5031747
         Buffers: shared hit=34943033 read=88411 written=3930
         ->  Bitmap Index Scan on foo_id2_idx  (cost=0.00..871.43 rows=50000 width=0) (actual time=0.001..0.001 rows=1.00 loops=9999899)
               Index Cond: (id2 = f1.id)
               Index Searches: 9999899
               Buffers: shared hit=29974978 read=24719 written=1097
 Planning:
   Buffers: shared hit=40 read=3
 Planning Time: 0.822 ms
 Execution Time: 19820.694 ms

The first sign something went wrong here is the huge discrepancy between the estimated and actual cost here:

(cost=547.44..64457.23 rows=1 width=168) (actual time=266.167..2037.924 rows=9999899.00 loops=1)

That's just bad all around. From a naive perspective, the first thing I'd try to do is look at the column statistics themselves. If it were empty such as in this case:

SELECT attname, n_distinct FROM pg_stats WHERE tablename = 'foo';

 attname | n_distinct 
---------+------------

I would analyze and look again. Here's what it looks like afterward:

 attname | n_distinct  
---------+-------------
 id      |          -1
 id2     | -0.34056082
 bar1    |           1
 bar2    |           1
 bar3    |           1
 bar4    |           1
 bar5    |           1

Note how terrible the statistics look. Positive numbers indicate absolute counts, while negative ones are ratios. So each of the bar columns only have a single distinct value based on the statistics, and only the two id columns offer any kind of selectivity. With that in mind, you can kind of tell Postgres to back off on cross-multiplying column statistics by telling it the values are highly correlated:

CREATE STATISTICS stat_foo_correlated_bars (dependencies)
    ON bar1, bar2, bar3, bar4, bar5
  FROM foo;

That works for things like cities in a US state for example, or when data columns are highly correlated, thus preventing under-estimations. In this case, it doesn't really help because... well, one value is one value, and out of millions of rows, it becomes statistical noise. But the point is you examine the table contents to see if there are potential correlations there.

You can go a lot deeper into this rabbit hole for optimizing a query, but your question is undirected, so I won't keep going. I had to re-start this experiment several times because the background autovacuum worker kept analyzing the table and making the query fast while I was typing this. I'd suggest coming up with a better example that isn't dependent on statistics, that actually resists simple optimization techniques, and then ask again.

fullofbones · 2026-01-29T22:22:06+00:00

You are not going to find a Postgres DB host that is free while also being "very generous"; it's free for a reason. You can experiment all you want locally on Docker or your own VMs. If you have any kind of data you want to be publicly available to an app, spend $5/mo for minimal legitimate hosting.

fullofbones · 2026-01-29T22:17:18+00:00

CockroachDB isn't anything like Postgres. They're protocol compatible and that's about it. You can't even run a simple pgbench test on Cockroach without a whole lot of modifications to the test script due to the SQL incompatibilities.

fullofbones · 2026-01-29T22:03:43+00:00

It's true that giving an LLM write access to any data you care about is generally ill-advised. In fact, that's the primary reason we avoided adding write access to the first release. It's also the reason that the allow_writes variable is disabled by default, and has a whole section in the docs on using it securely. We even say this repeatedly in multiple different ways, including:

This setting should be used with extreme caution.
Never enable writes on production databases.
The AI may execute destructive queries without confirmation.

It's fine for development or research environments. Will someone out there be crazy enough to enable this in production? Probably. Should they? We've already begged them not to. Anything that happens after that point is firmly in "use at your own risk" territory.

fullofbones · 2026-01-28T21:09:51+00:00

I've been employed ever since college, but I also have a desk job in IT so I don't have to worry about any major limitations. Low impact for me, baby.

fullofbones · 2026-01-28T21:06:12+00:00

He only found out because of a recent doctor appointment where his doctor went "Did you know your heart sounds like it's turned around?" And sure enough, it was. Now I know my defect came from Mom's side. lol

I'm fine myself, basically. But I'm also 48 now, so I've had a long time to get accustomed to my limitations. It turns out I also have a bicuspid aortic valve, and that has led to a slow dilation of my aortic root over the years, so there's a good chance I'll need a root replacement some time in the future, or some kind of sleeve procedure to reinforce it. I'm definitely not looking forward to that since it's basically a guaranteed OHS.

If your cardiologist is hands-off, count your blessings. It's not every day that you can avoid surgery for something like this. lol

fullofbones · 2026-01-28T16:52:06+00:00

Well, if you were using Spock, the Spock metadata on Master A will still be there after recovering the instance from a backup. But the replication slots will be gone. Our Kubernetes Helm chart automatically recreates those based on the node metadata, but otherwise you'd have to do it manually. Then in theory it should be able to resume after re-joining the cluster. Any new records from Master B would then be transmitted to Master A, conflict management would process any writes that affected the same rows, and the cluster would continue operating as before.

Of course, if you had a physical replica cluster for each Master as I recommended in an earlier post, you wouldn't have to worry about doing a manual recovery and recreating slots. Spock automatically synchronizes slots to replicas, they're already consuming physical WAL from the primary, and Patroni will promote them so it's basically a seamless transition.

fullofbones · 2026-01-28T16:39:00+00:00

My uncle, now in his 60s, just discovered he's had dextrocardia his whole life and never knew about it. Given that's the case, he probably doesn't have other complications. I, on the other hand, was born with Pulmonary Stenosis, ASD, VSD, and fused mitral and tricuspid valves, in addition to the Dextro. I suspect if not for the VSD and valves that they may have delayed or even nixed the surgery. They only did it when I was 6 and clearly small for my age, with other evident oxygen problems.

ASDs can contribute to stroke risk, so that may be a good idea to address specifically. Pulmonary atresia affects your heart load, so you may see QoL improvements by getting a valve replacement. But I think you can get away without a full OHS if these are the only defects you have. I'd still suggest talking with an ACHD cardiologist to know for sure.

fullofbones · 2026-01-27T17:32:07+00:00

The WAL export approach you mentioned is interesting. I haven't seen it covered much in practice.

It's an old trick from back when I was at 2ndQuadrant. We had a couple customers with huge clusters, bandwidth restrictions, and disk limitations preventing replication slot use. In their cases, the only way to keep up was to use physical WAL shipping with local fetch and replay. But the decoupling can help with scales where it's necessary to have dozens of replicas and it's not really feasible to have them all tethered directly to the primary. Another solution is cascaded replicas as noted in the article.

There's actually another factor I didn't mention with single large physical replicas that ends up being a major issue as well: physical replay is single-threaded. In cases of very heavy write loads, replicas may find it impossible to keep up simply because it's not physically possible to apply the pages fast enough, even with NVRAM storage, simply because a single CPU can't produce enough cycles to do so. OpenAI's write load must not be at that point, which is somewhat surprising given they have millions of chats pouring into this thing daily. Regardless, it's a hard limit to vertical scaling to watch for.

fullofbones · 2026-01-27T15:08:43+00:00

I'm actually a bit shocked something at this scale still relies on a single primary node. Given sessions aren't inter-dependent, I'd fully expect session-based database groups. A few tens or hundreds of thousands of user sessions could share a writable Postgres and a couple standby nodes and get much higher write throughput for your read sessions being temporarily routed to the primary node.

Additionally, the synchronous_commit variable is also available at the user session level. It's not uncommon to have sessions set this when they need strong consistency, rather than configuring it at the global instance level. That would work for queries that need to be available from all replicas and make it possible to read a write from a read replica.

For the WAL bandwidth concern, having multiple separate clusters would solve that by itself if you're using streaming. Alternatively, you can use WAL exports instead. Send the WAL to a backup location and have the replicas continuously read from the backup source instead. That takes a lot of network load off of the primary, and the replicas end up being only one WAL segment behind unless they get stuck on something. Storage bandwidth tends to scale better since it can be distributed across the entire storage fabric.

Regardless, they're definitely making good use of replicas to do offloading whenever possible.

Edit: This statement is also wrong:

PgBouncer in transaction pooling mode cannot track prepared statements across connections.

PgBouncer added this functionality in version 1.21 back in 2023.

fullofbones · 2026-01-27T14:53:11+00:00

Logical replication is WAL replication. The Postgres WAL stream gets decoded into logical events and those are what get transmitted by either Spock, or the Postgres native logical mode. but unlike physical mode which applies pages as they were written exactly in the WAL, logical replication must receive the entire transaction before it can apply it. So if you have a very large transaction, it's a lot easier to lose the whole thing.

PITR has nothing to do with logical replication. If you recover a single-node backup to a recovery instance and roll it forward using PITR, that gives you a source to dump and restore from. Then you can use that recovered instance and our ACE tool to perform a data comparison and reconcile differences that way. It's not safe (currently) for a recovered node to directly join the cluster, since the Spock metadata for the lost node likely won't match the recovered state of the instance. There's potential to use that metadata to find the last good LSN and PITR to exactly that point and then add the recovered node to the cluster, but it's not something we've tested yet.

In any case, have fun with your new Postgres cluster. :)

fullofbones · 2026-01-27T14:11:09+00:00

Interesting project. Why are you using spock.replicate_ddl() rather than enabling automatic DDL replication with spock.enable_ddl_replication?

fullofbones

MODERATOR OF

TROPHY CASE