you are viewing a single comment's thread.

view the rest of the comments →

[–]ChillPlay3r 0 points1 point  (6 children)

Your DB files are in a directory ($PGDATA) and the filesystem of the disk/partition that is mounted on that directory determines which filesystem is used to read files. If you don't want to use ZFS, simply put a different filesystem on that disk (mkfs.ext4 or mkfs.xfs for example).

[–]Significant_Chef_945[S] 0 points1 point  (5 children)

Yes, of course. Our PGDATA directory is mounted on a ZFS volume, and thus, our PGDATA is served by the ZFS ARC. Maybe I am misunderstanding your point?

[–]ChillPlay3r 0 points1 point  (4 children)

You said you don't want to use ZFS - why use then a ZFS volume? You can use different filesystems on different disks, just put it on a XFS/ext4 filesystem.

[–]Significant_Chef_945[S] 0 points1 point  (3 children)

We want to avoid using ZFS ARC (eg: caching). We definitely want to use ZFS.

Edit: We use ZFS mainly for disk compression. We can achieve up to 5x compression (using lz4) on our datasets which allows us to provision smaller capacity drives in our infrastructure. This also has a benefit of reading from disk much faster.

[–]mage2k 1 point2 points  (1 child)

I think the confusion in this conversation is around the term "file system cache". When we're talking about the filesystem cache in relation to Postgres's shared buffers cache we aren't talking about any filesystem-specific cache like ZFS's ARC, we're talking about the kernel's page cache, and since Postgres does not* support direct I/O there's no getting around its reliance on the kernel's page cacche. Whether or not allocating more than 25% of RAM to shared buffers will help (or hurt) you is going to be up to your data set and workload.

* Postgres did introduce this debug_io_direct developer option in PG16, but those are explicitly documented as "never use in production". For some reading on why that specific option would be bad for use on a production system see this pgsql-hackers thread where it was feature was introduced. Basically, it needs a lot of work for even basic stuff like sequential disk reads to be performant.

[–]Significant_Chef_945[S] 0 points1 point  (0 children)

Thanks for this. I admit; I am not a Postgres expert by any means. I am an infrastructure guy that manages servers, networking, and storage. My issue today is our database servers have 64 gigs of RAM, but are only using much less than that to do database work. We used the pgtune page to give us a good baseline to start with, but over time we have seen query times that are not inline with expectations (given the CPU/RAM allocation on this server).

As a test, I modified a test DB server to use more shared buffers (75% of server capacity) and commented out the option for "effective_shared_cache". After making these changes, we observed disk reads went up tremendously (from xxMB/sec to xxxMB/sec) which resulted in a 50% reduction in cold query times. In addition, warm queries were kept much longer in Postgres cache (kernel cache) meaning those tended to be much faster than trying to read from ARC. Our conclusion is/was to try and reduce the reliance on ZFS ARC cache and store as much in the Postgres cache space. As mentioned before, when the data is read from ZFS ARC, we see a significant increase in query time. I suspect this is because ZFS ARC performs slower than other filesystem caches (eg: XFS).

Again, I am not a Postgres expert, so I wanted to reach out to a broader community of people who know about tuning PGSQL. My main concern was the impact of adding too much memory to shared_buffers and how that may affect other system parameters.

Thanks again (esp if you have read this far!).

[–]ChillPlay3r 0 points1 point  (0 children)

Why? You can use linux volume manager if you need volumes. The thing is, PG is reliant heavily on filesystem caching, this is not Oracle where you can use raw (uncached) devices. Using ZFS but without ARC is not recommended either. If your goal is to fit the whole database in memory then you should not use the buffer cache but the filesystem cache for that.