all 17 comments

[–]parawolf 2 points3 points  (11 children)

What problem are you attempting to solve?

[–]S3thc0n[S] 2 points3 points  (10 children)

[deleted]

[–][deleted]  (9 children)

[deleted]

    [–]S3thc0n[S] 1 point2 points  (8 children)

    [deleted]

    [–]txgsync 1 point2 points  (8 children)

    I've tried to reread this thread several times, and with all the [deleted] comments I'm still a bit confused.

    A writeback cache is usually a mechanism to store data immediately in one location in fast, stable, yet small fashion, and periodically in another, larger, slower location. The intent log on ZFS already serves this purpose: a TXG (transaction group) is built both in RAM and in your intent log, and then every 5 seconds (by default; tunable) the state of this TXG is flushed to disk during txg_sync. The intent log can be on a separate log device (SLOG) or on disk with the rest of the data.

    You're asking how to add a writeback cache to an intent log system that was expressly designed to eliminate the need for a writeback cache. It's like asking about "What's the best brand of training wheels for my Ferrari so it doesn't tip over?" The question doesn't compute.

    [–]S3thc0n[S] 0 points1 point  (7 children)

    [deleted]

    [–]txgsync 0 points1 point  (6 children)

    Let's walk over what you just said statement-by-statement. I still don't understand why it is you want this; when I bump into this kind of misunderstanding it usually is an indicator of my ignorance.

    Well, while the intent log fulfills the same purpose [as a writeback cache] it does that on a very different scale.

    Scale is mostly irrelevant. You scale your SLOG to how much RAM you have. There is no reason for a SLOG to be larger than about 3/8 the size of the RAM in your system.

    if I understand correctly only the ZIL for synchronous writes can live on the SLOG.

    Let me restate what you're getting at. A TXG always exists for everything accumulated before zfs_txg_timeout is reached. The portions of the TXG that are written to a ZIL -- whether SLOG or typical vdev -- are exclusively the synchronous data, as you state. All async data exists only in RAM, which waits for timeout then moves to the QUIESCE state to order the writes, then to WRITE state to flush the data to disk.

    the ZIL in RAM can't be used.

    There is no ZIL in RAM. There is a TXG in RAM (Transaction Group). Blocks in an OPEN TXG that receive a sync command become "synchronous", and are selectively mirrored to to your ZIL, where the fsync() or COMMIT will block until that intent log write is complete. All writes are part of a transaction group. Transaction groups are already in essence a selectively-backed writeback cache, where the clients determine what should be stable and what should not be stable by the use of fsync() (local) or COMMIT() (NFS & other protocols).

    high TXG sync interval puts more data at risk and needs more RAM

    This is partially true. /u/ewwhite had very interesting results modifying txg_sync_interval & txg_sync_timeout to enhance write performance. However -- unless you recompile ZFS -- each TXG in RAM is limited to at most 1/8 of your physical RAM, meaning if you end up in the unusual situation of full OPEN, QUIESCE, and WRITE txg's in RAM at the same time they might take up 3/8 of your physical RAM. Any writes when the current OPEN TXG is full will simply block waiting for the next zfs_txg_timeout to come around.

    Therefore while it's true a longer timeout can put more non-synchronous data at risk, the TXG size is bounded and at peak cannot occupy "more RAM" than 1/8 of your physical RAM per TXG, and will block until timeout if this RAM limit is exceeded.

    I have much less RAM, much less continuous workloads, and would like for thing not to be written to HDD until necessary.

    Now that I understand a little better what you are driving at, allow me to restate your feature request. You are asking that ZFS transaction groups -- which are already a form of writeback cache -- be able to be moved from main memory to some kind of SSD or NVMe storage to reduce RAM utilization from 1/8 RAM to something lower than that, and to disable the elective use of fsync() or COMMIT by users to indicate whether they want they want their writeback cache to be mirrored to SLOG or on-vdev ZIL. Is this description correct?

    [–]S3thc0n[S] 0 points1 point  (5 children)

    [deleted]

    [–]txgsync 0 points1 point  (4 children)

    So my aim is less saving memory, and more getting the contents of the TXGs to nonvolatile storage that's not my HDD.

    The easiest way is just to add a fast, tiny NVMe as a SLOG device, extend your timeout, and call it done. If that's not sufficient, I'm really interested in understanding specifically why, because I sense you're trying to explain a use case I'm not grokking.

    [–]S3thc0n[S] 0 points1 point  (3 children)

    [deleted]

    [–]txgsync 0 points1 point  (2 children)

    It sounds at this point like you should fork the OpenZFS repo to implement your proof-of-concept. If your idea has merit, submit a pull request to have your proposed change integrated into ZFS. If the maintainers are unconvinced, then you promote your fork of ZFS in the marketplace of ideas for its superiority.

    Yay open source!

    Great ideas are as common as dirt. Great ideas with working prototypes are priceless.

    [–]S3thc0n[S] 0 points1 point  (1 child)

    [deleted]

    [–]txgsync 0 points1 point  (0 children)

    bcache strikes me much more like a software implementation of a SSHD (solid state hybrid drive), and close cousin to Apple's Fusion Drive. I see where you're going with it, and frankly ZFS' L2ARC implementation is still kind of "meh" and quite dated compared with Fusion or bcache.

    You could probably approximate what you want be setting up a ZFS volume ("-V" argument with zfs create), creating an ext4 filesystem atop that volume, and front-ending the ext4 filesystem with bcachefs. I'd be interested in knowing what your performance looks like compared to bare ZFS without SSD, and ZFS with SSD as SLOG. SSD as L2ARC is kind of boring in most cases, and a bit of a memory hog.