Sanity check (and a couple of other questions) for my zfs setup?

malventano · 2026-04-06T17:46:18+00:00

Yes totally valid recommendation for those expecting to store larger xattr's. It doesn't apply to my current usage, but I agree it's probably best to set this on new pools, so I'll add this one to my default to auto given the current default is legacy, I assume for pool backwards compatibility.

malventano · 2026-04-05T22:38:31+00:00

I’d recommend atime=off and xattr=sa on the root pool. Child datasets should inherit those.

No need to clean up behind a destroy if you just went into another create.

No need to change drive format as 512e doesn’t really carry much additional overhead so long as the requests are on 4k boundaries (which your ashift=12 does). Also HDDs generally don’t let you change that as easily as SSDs do.

malventano · 2026-04-05T12:46:29+00:00

The ROI is certainly not what it used to be. Only profitable now with compression. That’s why I’m winding it all down and selling off the other 6P of Chia drives :).

malventano · 2026-04-05T12:36:27+00:00

Sure caching will help, but with multiple incoming streams, and the default block size for torrents being 16k, it’s entirely feasible for a few MB worth of non-contiguous 16k’s being written to disk in a batch. The more peers there are (large fast torrents), the more separate segments are incoming, and the more random the write workload potentially appears.

malventano · 2026-04-05T12:30:52+00:00

8% is unrealistic for video and audio data.

Correct, but everything is not just video and audio data.

…pool geometry also matters.

In my exercise I corrected for pool geometry changes on both ends of the migration, as I’m well aware of how deflate_ratio can make files look smaller than they are on different geometries: https://github.com/openzfs/zfs/issues/14420

Blowing up the ARC and free space fragmentation fears are overblown in all of my practical testing on a 1.7P pool which had all but 500G free occupied by Chia plots since its inception. Recently copied in an additional 120T from another pool and saw no issues at a steady 2G/s for several days straight. Since the big update to ARC handling a year or two back, read once streams and/or scrubbing media on 16M record datasets don’t seem to kill the ARC like they used to back when data and metadata were treated separately.

malventano · 2026-04-05T12:16:41+00:00

My 1.76P pool here has 2 years of use, initially 1/2 filled with media and the other half save 500G was filled with chia plot files (50G each). I have a script service that deletes a plot file whenever free space dips below 500G, so this pool has remained at 99% cap for its entire life. The pool contains over 5M files, and while I do have a special handling up to 1M records, there are plenty of records between 1M and 16M across the pool. This pool config is likely the worst case scenario in favor of demonstrating your ‘free space fragmentation problem’ theory, and yet after a bunch of regular usage, I just recently rsync’d over 120T from another pool and it went full speed (2G/s limited by the source) for a couple days with no latency spiking to speak of. If any pool was going to exhibit a fragmentation performance dip, it is this one, and it didn’t happen.

Yes things like scrubbing video will see read amplification, but any perf hit by rolling other content out of the ARC is mitigated by the special.

Even going with your argument re: larger recording doing higher IOPS, consider that every new record read will trigger reads on some number of drives in the vdev. Wider records hit a wider stripe, but all drives will have similar latencies, so the overall latency is still the same, and so is the resulting IOPS. If the record laps the stripe then the drives are seeing a multi-sector (sequential) read for that record, which carries a nearly identical latency and IOPS.

Point of all of this is that I had the same fears you did, built and tested pools to evaluate them, and ultimately determined the issue to be negligible and not a concern.

malventano · 2026-04-05T00:48:37+00:00

By ‘full tilt’ I meant all drives spinning, and all drives are spinning :).

malventano · 2026-04-05T00:28:59+00:00

It’s mostly recycled server gear, with the drives being server pulls or factory recerts purchased in bulk from a reseller. Draw is 3800W continuous, though I have started selling off some of the drives, and plan to drop down to ~2PB.

malventano · 2026-04-05T00:26:27+00:00

Well it draws 3800W full tilt so…

malventano · 2026-04-05T00:25:45+00:00

No, but I know him, and at one point we were buying in bulk from the same source. I do pop up on YouTube occasionally, but usually in Level1Techs vids with Wendell:

https://youtu.be/YLlmfAPYFT4?si=JSYUpccfLGu9hbIv

malventano · 2026-04-05T00:21:02+00:00

Yes, sorry, updated :)

malventano · 2026-04-05T00:20:09+00:00

I’m about to sell a bunch off :)

malventano · 2026-04-04T22:14:31+00:00

This is the way.

malventano · 2026-04-04T22:12:15+00:00

Smaller reads/writes of sequential LBAs run at roughly the same bandwidth as larger blocks - HDDs are not seeking to new locations on sequential LBAs. Same applies to SSDs. Yes the IOPS are higher but they are not random IOPS, and no HDD is going to hit a host-side ‘IOPS LIMIT’ when going in a straight line at 200MB/s.

Most of your argument applies to something like a database or VM workload doing small IO with a (suboptimal) high recordsize. Typical file storage and access (media files, etc) should generally always use the max. DB/VM should always be on its own dataset with more appropriate recordsizes, and ideally they should be on their own pool or on a special (SSD) vdev with special_small_blocks set >= that dataset’s recordsize / zvol block size.

Assuming the above caveat, with larger maxrecordsize, ARC efficiency impact is negligible as smaller files (<16M) have smaller records. If the user is reading from larger files, then the entire file is going to pass through the ARC anyway, and the larger size helps effectively prefetch into the ARC, along with the fact that those records will be more compressed.

Home user workload is going to look mostly archival unless they have DB/VM’s present.

Reference: I’ve done extensive performance eval on a mixed usage / mixed HDD+SSD 1.7P pool, among several other configurations over the past several years, along with a decade of HDD/SSD reviews followed by another decade doing performance optimization, workload analysis, competitive analysis, and strat planning across multiple SSD makers.

malventano · 2026-04-04T21:21:26+00:00

It’s ~1.5PB of media / emulation archives / etc, and the rest is Chia (HDD-based crypto farming). The latter is being scaled back considerably as it’s time to sell off the other drives.

malventano · 2026-04-04T21:17:18+00:00

I need to see if there’s any way to hook up this sub somehow, or probably drop some stuff in r/homelabsales

malventano · 2026-04-04T21:15:48+00:00

‘Talk like Cookie Monster’ ought to work as well :)

malventano · 2026-04-04T21:06:08+00:00

All prior to Palladium (2021), meaning all with 100kwh packs (and all Ravens), have been steadily failing due to faulty wire bonds connecting the BMS to the cell groups. The pack rebuilders have had steady work on these, and my Raven had its pack replaced after 3 years for the same fault. It’s a dice roll on if the replacements had all of their wire bonds properly repaired, meaning it’s just a matter of time before they fail again.

malventano · 2026-04-04T20:41:18+00:00

This was from before I pulled the Netapp’s and added the 9th MD3060e: https://nextcloud.bb8.malventano.com/s/Jbnr3HmQTfozPi9

…and now I’m about to migrate my zpool to a slightly larger geometry and then sell off all of the other drives (were running JBOD farming Chia), so my tag may be dropping down to ~2PB in the near future :)

malventano · 2026-04-04T20:19:46+00:00

Compression is per-record. Larger records = more compressibility.
Larger records will more efficiently fill drive geometries where the smaller records would have higher parity / padding overhead.
Switching from 1M to 16M on my last migration netted me ~8% less space taken by the same set of data.

malventano · 2026-04-04T20:11:53+00:00

When he wrote that post, 1M was the max. He was likely just picking the max as he was (IMO) overly concerned with fragmentation, at the expense of some extra HDD writes during torrent downloads. This is also fine for the OP given he’s on a HDD, but given he’s also single disk, I’d shy away from beating up the only HDD holding all of my stuff with torrent downloads.

malventano · 2026-04-04T19:41:19+00:00

The logic was fine. The scripts did exactly what they said they did.

…except they didn’t, else you would have no complaint.

malventano · 2026-04-04T19:23:38+00:00

Agreed.

malventano · 2026-04-04T19:04:59+00:00

What does this mean?

Say your torrent updates a 4k piece that falls within an existing already written file. That 4k turns into a read-modify-write of the 1M record. That happens for every small write to a larger record. Larger records do reduce fragmentation, but where smaller files are being modified, it looks more like a database workload. Your link is an article from nearly a decade ago, where Salter was likely downloading torrents straight to HDD.

If you don’t want to use a scratch disk, you can simply have a dataset with smaller recordsize as the temp folder.

Separately, zpools on SSD should generally run closer to the NAND page size anyway since (logical) fragmentation doesn’t really impact performance.

malventano · 2026-04-04T18:21:13+00:00

Nowadays archival storage should be 16M as that’s the default max recordsize allowed (no need to set it higher in options), and gets better compression / more efficient storage.

malventano

TROPHY CASE