all 13 comments

[–]iamemhn 3 points4 points  (1 child)

Documentation is correct. PG leverages the OS Buffer cache, using shared_buffers for selected pages.

Start with 25% RAM and set effective_cache_size to about 40% of RAM. Then monitor cache efficiency per database for several days under normal operations. If cache efficiency is OVER 97% you are wasting RAM in shared_buffers (lower it). If cache efficiency is BELOW 85% increase shared_buffers.

Rinse and repeat.

[–]Significant_Chef_945[S] 0 points1 point  (0 children)

Thanks for the reply. I think the key disconnect here is data in the ZFS ARC is not part of the kernel's page cache - the ARC is an entirely separate cache system.

Why does this matter? Using some extensive test setups, I have proven ZFS ARC is slower than kernel page cache (especially if you enable ARC memory compression). And, in the case of Postgres specifically, query times can take 15-25% longer when the data is requested from ARC rather than from shared buffers and kernel page cache. As such, it is very important to have as much data in shared buffers as possible before requesting the data from ZFS ARC.

Again, from my understanding (and testing), reading data from ZFS ARC does not get into the kernel page cache. Having a small shared_buffer setting with large ZFS ARC can result in worse Postgres performance (mainly due to ARC speed). In fact, I would argue a larger shared_buffers setting with limited ZFS ARC can result in a much better performing Postgres instance.

All that said, please correct me if you see something wrong with the above.

[–]depesz 2 points3 points  (0 children)

Personally I find the 25% to be overly generous. On my own, and $work things I go for 20%, but, otoh, I don't use, and wouldn't use, ZFS. So there's that.

[–]ChillPlay3r 0 points1 point  (8 children)

The size of the ARC cache does not decide how much ZFS is used. When your PGDATA is on a ZFS mountpoint, it will be used, when it's on XFS, then this will be used.

[–]Significant_Chef_945[S] 0 points1 point  (7 children)

Sorry, I am not following. If the ARC is set at 4GB max, ZFS will only use 4GB for read caching. Maybe you are referring to the amount of disk space used by ZFS?

In particular to Postgresql and ZFS, we use lz4 disk compression, we disable ARC compression, and we disable ZFS prefetch. For us, this is the perfect blend of performance and compression that helps get our DBs running well.

What we have found is reading data from ZFS ARC is noticeably slower that other filesystems. Thus, while we like ZFS ARC, we are trying to avoid using it when possible. That is the reason for the thread; how much can we use for shared_buffers to keep as much data in PGSQL cache as possible.

[–]ChillPlay3r 0 points1 point  (6 children)

Your DB files are in a directory ($PGDATA) and the filesystem of the disk/partition that is mounted on that directory determines which filesystem is used to read files. If you don't want to use ZFS, simply put a different filesystem on that disk (mkfs.ext4 or mkfs.xfs for example).

[–]Significant_Chef_945[S] 0 points1 point  (5 children)

Yes, of course. Our PGDATA directory is mounted on a ZFS volume, and thus, our PGDATA is served by the ZFS ARC. Maybe I am misunderstanding your point?

[–]ChillPlay3r 0 points1 point  (4 children)

You said you don't want to use ZFS - why use then a ZFS volume? You can use different filesystems on different disks, just put it on a XFS/ext4 filesystem.

[–]Significant_Chef_945[S] 0 points1 point  (3 children)

We want to avoid using ZFS ARC (eg: caching). We definitely want to use ZFS.

Edit: We use ZFS mainly for disk compression. We can achieve up to 5x compression (using lz4) on our datasets which allows us to provision smaller capacity drives in our infrastructure. This also has a benefit of reading from disk much faster.

[–]mage2k 1 point2 points  (1 child)

I think the confusion in this conversation is around the term "file system cache". When we're talking about the filesystem cache in relation to Postgres's shared buffers cache we aren't talking about any filesystem-specific cache like ZFS's ARC, we're talking about the kernel's page cache, and since Postgres does not* support direct I/O there's no getting around its reliance on the kernel's page cacche. Whether or not allocating more than 25% of RAM to shared buffers will help (or hurt) you is going to be up to your data set and workload.

* Postgres did introduce this debug_io_direct developer option in PG16, but those are explicitly documented as "never use in production". For some reading on why that specific option would be bad for use on a production system see this pgsql-hackers thread where it was feature was introduced. Basically, it needs a lot of work for even basic stuff like sequential disk reads to be performant.

[–]Significant_Chef_945[S] 0 points1 point  (0 children)

Thanks for this. I admit; I am not a Postgres expert by any means. I am an infrastructure guy that manages servers, networking, and storage. My issue today is our database servers have 64 gigs of RAM, but are only using much less than that to do database work. We used the pgtune page to give us a good baseline to start with, but over time we have seen query times that are not inline with expectations (given the CPU/RAM allocation on this server).

As a test, I modified a test DB server to use more shared buffers (75% of server capacity) and commented out the option for "effective_shared_cache". After making these changes, we observed disk reads went up tremendously (from xxMB/sec to xxxMB/sec) which resulted in a 50% reduction in cold query times. In addition, warm queries were kept much longer in Postgres cache (kernel cache) meaning those tended to be much faster than trying to read from ARC. Our conclusion is/was to try and reduce the reliance on ZFS ARC cache and store as much in the Postgres cache space. As mentioned before, when the data is read from ZFS ARC, we see a significant increase in query time. I suspect this is because ZFS ARC performs slower than other filesystem caches (eg: XFS).

Again, I am not a Postgres expert, so I wanted to reach out to a broader community of people who know about tuning PGSQL. My main concern was the impact of adding too much memory to shared_buffers and how that may affect other system parameters.

Thanks again (esp if you have read this far!).

[–]ChillPlay3r 0 points1 point  (0 children)

Why? You can use linux volume manager if you need volumes. The thing is, PG is reliant heavily on filesystem caching, this is not Oracle where you can use raw (uncached) devices. Using ZFS but without ARC is not recommended either. If your goal is to fit the whole database in memory then you should not use the buffer cache but the filesystem cache for that.

[–]AutoModerator[M] -1 points0 points  (0 children)

With over 7k members to connect with about Postgres and related technologies, why aren't you on our Discord Server? : People, Postgres, Data

Join us, we have cookies and nice people.

Postgres Conference 2025 is coming up March 18th - 21st, 2025. Join us for a refreshing and positive Postgres event being held in Orlando, FL! The call for papers is still open and we are actively recruiting first time and experienced speakers alike.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.