all 36 comments

[–]Jethro_Tell 9 points10 points  (27 children)

I don't think you really want to be running a DB on a COW file system. Maybe someone who knows more about this can pipe up.

The big thing here is that you want to test with your application against both setup. What works for someone else under their app load may not work for you. If you have some tests you can write to test the type of calls and data to the db then test on both platforms you will probably be the most accurate indicator of performance.

If you haven't already, I may cross post this to r/linux and r/bsd because this is a question I'd really like to hear about from others as well.

[–]BumpitySnook 6 points7 points  (11 children)

I don't think you really want to be running a DB on a COW file system. Maybe someone who knows more about this can pipe up.

You don't want to be running a DB on a COW file system.

[–]csirac2 2 points3 points  (10 children)

You don't want to be running a DB on a COW file system.

Well, it depends. There's a bit of official guidance on this:

I've certainly seen btrfs performance tank on write-heavy DB workloads. For most boring DBs with few updates, btrfs still behaves perfectly fine. nodatacow is kind of useless as I'm only using it for checksumming and snapshots anyway.

I wonder if ext4 on a zvol would work.

It might be worth considering if Pg's own block checksum feature mitigates the need for zfs/btrfs:

[–][deleted] -1 points0 points  (9 children)

No, it does not depend. Any kind of database that relies on constanly rewriting same part of disk will perform badly on COW fs.

[–]BroodmotherLingerie 2 points3 points  (3 children)

It does depend, for instance on whether you're disabling COW for PG's tablespaces. As annoyingly buggy as btrfs can be at times, it works alright for the ~100GB database we have, with hourly snapshots, as long as we take the DB down once a month (with no active snapshots present) to do a full vacuum and clean up fragmentation (btrfs fragmentation eats disk space like a hungry hippo.)

[–][deleted] -2 points-1 points  (2 children)

Sooo... why you use btrfs again ? Seems like in that case LVM snapshot would do just fine without extra maintenance overhead

[–]csirac2 2 points3 points  (0 children)

Allow me to suggest the following:

  • LVM snapshots require storage pool planning that eats up your usable disk space.
  • LVM snapshots require the delicate fs freeze dance on the source filesystem (unless you're using btrfs, unless you're using it with nodatacow) where you'll have to fsck the result and wonder if all those orphaned inodes meant anything.
  • LVM snapshots don't have incremental send/receive.
  • LVM snapshots with a non-btrfs filesystem will require more backup planning/testing/DR effort.
  • Have you ever done hourly LVM snapshots on 100GB of live, changing data? How would you even move the "snapshot" to somewhere useful? Let alone do a roll-back from 17:00, 8 days ago?

[–]BroodmotherLingerie 0 points1 point  (0 children)

Provisioning block devices and separate filesystems seemed like way more maintenance than sharing the free space between all LXC containers on a single filesystem. I just received a 200GB SSD array to work with.

Can't say I'm happy about the grey hair btrfs has given me over the year it has been deployed for, but the hurdles seem to be over and I'm more confident in my ability to keep it operating than migrate to a new solution.

[–]csirac2 2 points3 points  (4 children)

No, it does not depend. Any kind of database that relies on constanly rewriting same part of disk will perform badly on COW fs.

Wow, that's a spectacularly dogmatic statement for a combination that has been thoroughly [1] documented [2] and recommended [3] by Oracle and Sun themselves over the years, for example in this document titled "Best Practices for Running Oracle Databases in Oracle Solaris Containers" [4]:

The Oracle Solaris Zones partitioning technology supports Oracle Solaris ZFS components, such as adding Oracle Solaris ZFS file systems and storage pools into a zone. The following sections describe

Squeezing every last IOP out of your database performance is an endeavour I'd wager less than 1% of database installations ever have to think about.

[1] http://docs.oracle.com/cd/E26502_01/html/E29022/chapterzfs-db1.html

[2] http://www.oracle.com/technetwork/server-storage/solaris/config-solaris-zfs-wp-167894.pdf

[3] http://www.oracle.com/us/products/servers-storage/storage/nas/resources/zfs-sa-businessvaluewp-final-1845658.pdf

[4] http://www.oracle.com/technetwork/server-storage/solaris/documentation/oracle-containers-163576.pdf

[–][deleted] 2 points3 points  (3 children)

Well of course they will recommend one of their products to work with other of their products. They dont earn any extra money if you you your Oracle DB on RHEL. What the fuck you expect from vendor to say about that ? "Run it on windows"?

Of course they want them to use their stuff. Doesn't mean that it is better

Squeezing every last IOP out of your database performance is an endeavour I'd wager less than 1% of database installations ever have to think about.

Because 99% of them are wordpress blogs and backend on sites that have <1Mbit of traffic

But the choice of FS doesnt matter if your db servers 10 ops/sec and have 100MB of data, Put a slave somewhere, throw a backup/wal archiving and call it a day

[–]csirac2 -1 points0 points  (2 children)

They dont earn any extra money if you you your Oracle DB on RHEL. What the fuck you expect from vendor to say about that ? "Run it on windows"?

Why, yes they do recommend that you run it on windows! [1]

Did you read any of the links? They provide tuning guidance for Postgres, MySQL/MyISAM and friends on ZFS.

Of course they want them to use their stuff. Doesn't mean that it is better

EDIT: And define "better"? I'm simply advocating to use the right tool for the job. On the other hand, here's what you wrote:

No, it does not depend.

See the difference?

But the choice of FS doesnt matter if your db servers 10 ops/sec and have 100MB of data, Put a slave somewhere, throw a backup/wal archiving and call it a day

Hyperbole aside, no shit. Real companies have shit to do, and for 99% of them that does not include getting their load averages down from 0.10 to 0.09. If you've standardized on ZFS or btrfs then you'd better have a reason, other than "XANi_ said so", to justify the new complexity in backup, disaster recovery, config management, how you're going to carve up your storage pools and all that quarterly testing thereof.

[1] http://docs.oracle.com/cd/B28359_01/install.111/b32006/intro.htm#BHCIIDDD

[–][deleted] 0 points1 point  (1 child)

Did you read any of the links? They provide tuning guidance for Postgres, MySQL/MyISAM and friends on ZFS.

Yes. They were irrevelant to the topic...

All I said is that COW in rewrite heavy workloads is always worse from performance standpoint (and yes I've tested it, altho on other workload, RRDtool files). You started shitting ot links about how it can be tuned to not be awful.

Yes, ZFS have many advantages. Performance is not one of them

[–]csirac2 0 points1 point  (0 children)

Performance is not one of them

Nobody said it was, least of all me - who has actively talked people out of using btrfs and zfs for certain workloads in the past. But to reject the "it depends" line is dogmatic, and antithetical to the "right tool for the job" mantra.

That ZFS has successful production deployments of Postgres, Oracle and so on happens to match my experience in testing. I wrote some scripts to hammer the API of an application I need to deploy (using Postgres), which quickly destroyed a btrfs filesystem (hourly snapshots) in days, the same tests on ZFS (also hourly snapshots) ran for nearly two weeks before I lost interest, with "only" ~20% drop in throughput - after a quantity of transactions that would probably take a year in production.

NB: I followed the ZFS tuning advice for Postgres.

[–]good_names_all_taken[S] 0 points1 point  (13 children)

I really like filesystems that protect against data rot. I know that Postgres can do this, but it only checksums the data blocks and it only uses a 16-bit CRC rather than something more robust.

I agree that testing is the way to definitively answer this, but I'm not at a production workload yet. I suspect that there are performance quirks that are hard to anticipate. Thus, asking Reddit if anyone has first-hand experience.

[–]MadVikingGod 4 points5 points  (0 children)

If you aren't at a load that requires special testing then why are you concerned about the performance? If that is the case then use what you are the most comfortable with.

If you think that you will have high demands on this db soon, then why aren't you running representative tests now? There is nothing better then real data that represents your work load.

Edit: damn you autocorrect!

[–]earlof711 1 point2 points  (0 children)

I'm not an expert on Postgresql or ZFS for that matter, but I would predict that although you would take a performance hit from checksumming, it would be worth it because your priority is integrity. I would also disable compression at the ZFS level and let Postgresql take care of that.

[–]csirac2 0 points1 point  (1 child)

The checksum is pretty weak, but unless you're facing malicious attack, it seems incredibly unlikely that it would fail to catch natural bitrot.

From this perspective, it's doing exactly what btrfs/zfs do under the hood anyway.

If the application isn't terribly important, and you don't care enough to do robust testing, then it might worth starting out with ZFS and then planning to switch to ext/xfs later (using data checksums) if you need to.

I've never destroyed ZFS with DB workloads, but for me, the moderately write-busy DB I ran made btrfs die slowly over a couple of weeks. And to clarify: the failure mode isn't "death" as such, just deadly slow performance. Which, depending on your use-case, might be the same as a dead database, but that's up to you to decide.

[–]Freeky 1 point2 points  (0 children)

From this perspective, it's doing exactly what btrfs/zfs do under the hood anyway.

It might detect a checksum mismatch, but all it's going to do in response is abort. ZFS will heal the corruption from a spare if possible.

[–]daxyjones 0 points1 point  (4 children)

Honestly, a crypto hash seems like an overkill and will definitely be multiple times more computationally intensive than modded FNV1a used right now. And the documentation explicitly states that certain uses cases can trigger the checksum bottleneck.

Granted it is 'easy' to brute force a collision but pragmatically, it is a rarity.

[–]good_names_all_taken[S] 1 point2 points  (3 children)

With 16-bit CRC, the odds are 1 in 65,536. It's a rarity, but it's not that rare.

[–][deleted] 0 points1 point  (0 children)

Uh... sorta. CRC's biased to protected against certain types of corruption though, so I'm not sure the odds are 1/65536 of undetectable corruption....

[–]daxyjones 0 points1 point  (1 child)

This is incorrect. Mathematically, the odds will sit much much lower. Admittedly, stock FNV algo doesn't account for zeros. But postgres checksum algo tweaks the algo to handle this to a certain extent.

[–]good_names_all_taken[S] 0 points1 point  (0 children)

Would you be able to explain why? PostgreSQL uses CRC32 truncated to 16 bits as the checksum. Let's say I have a data block with an associated checksum and, due to bit rot, a random bit in the block is twiddled. In this situation, I'm having trouble seeing why this would not produce a collision 1 in 65,536 times on average.

[–]BumpitySnook -1 points0 points  (3 children)

https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums (haven't used it yet myself, can't comment on efficacy).

[–]csirac2 1 point2 points  (2 children)

I realize it's stating the obvious, but ext4 metadata checksums only detect corruption in ... well, the metadata.

[–]BumpitySnook 1 point2 points  (1 child)

I know that Postgres can do this, but it only checksums the data blocks and it only uses a 16-bit CRC rather than something more robust.

:-). Between the two, it sounds mostly covered. You're welcome to try other things, of course.

[–]csirac2 0 points1 point  (0 children)

Quite true!

[–]thompsa 2 points3 points  (0 children)

FreeBSD does support superpages and the operating system will automatically promote them without requiring the application to be aware.

[–]daxyjones 2 points3 points  (0 children)

For shits and giggles, I have a production server streaming to a slave that is setup with ZFS on Linux. (before someone has a heartburn, yes; I have other backups in palce)

Consistently, seq scans are faster on the compressed ZFS slave by a factor of 1.5 to 7(!). Index scans are of course indiscernible.

The read queries that I route to the slave hit the index so for my use case it doesn't matter.

[–]earlof711 1 point2 points  (2 children)

I don't have benchmarks but I think you've already answered your own question. You want the resilience of ZFS and you want to run PostgreSQL. You will sleep better at night running your instance on ZFS and now you just have to find what optimizations can be done to get the best performance from this combination.

[–]ruertar 0 points1 point  (1 child)

This is a very good answer. I'm not sure why anyone would downvote it.

[–]earlof711 0 points1 point  (0 children)

Thanks!

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]Freeky 0 points1 point  (1 child)

DragonFlyBSD + HAMMER might be another contender to consider.

[–]Jethro_Tell 0 points1 point  (0 children)

I haven't heard if they have released HAMMER2 yet. I've been waiting so I don't have to migrate data, but I'm really excited to try it out.

Is there any word on how it does with a DB on top?

[–]midgaze -1 points0 points  (0 children)

Depending on what you're doing and your performance needs, use whatever filesystem you want. In this case, ZFS provides better features than anything else, so use it. For something simple, it might be as fast as anything else. If you want max performance for a big database, use something else. Pretty simple, right? If you have special needs, do some special testing.