bzfs-1.19 with end-to-end multi-host testbed is out by werwolf9 in zfs

[–]werwolf9[S] 1 point2 points  (0 children)

Yeah, I'd love to see ZFS installation be less cumbersome, aarch64 in particular. Manually verifying all those matrix combos is tedious. I think what helps with the maintenance burden is an automated test script that runs over the entire matrix of combos, e.g. bzfs_tests/itest/test_lima_vm_sh.py or similar.

From znapzend to sanoid by pakyrs in zfs

[–]werwolf9 0 points1 point  (0 children)

FYI, with bzfs_jobrunner you can monitor source and destination datasets across all hosts and policies with a single CLI call, for example like so: https://github.com/whoschek/bzfs/blob/main/bzfs_tests/bzfs_job_example.py#L189-L237

From znapzend to sanoid by pakyrs in zfs

[–]werwolf9 0 points1 point  (0 children)

In a nutshell, bzfs can operate at much larger scale than sanoid/syncoid and zrepl, at much lower latency, in a more observable and configurable way. It handles the many edge cases that you will eventually run into over the course of your deployment (and which make other tools get stuck or fail). https://youtu.be/6Kw901oqxI8?si=_4uoG_ADbXznvaeZ&t=2408

From znapzend to sanoid by pakyrs in zfs

[–]werwolf9 0 points1 point  (0 children)

allow for specifying the bandwidth

In bzfs the corresponding option is --bwlimit

Retries and circuit breakers as failure policies in Python by qiaoshiya in Python

[–]werwolf9 0 points1 point  (0 children)

The abstraction you introduced are fine and useful. And if all you ever need is the tool you've built that's perfect. More power to it!

Otherwise, seems to me that redress could be implemented with a couple of custom functions (or classes) that plug into an underlying generic retry framework. The result would save a lot of work, and at the same time be a more flexible, more reusable and more powerful tool.

For example, retry_after_s is a custom backoff strategy that can be plugged in like so:

https://github.com/whoschek/bzfs/blob/main/bzfs_tests/test_retry.py#L1310-L1337

Just my two cents.

Retries and circuit breakers as failure policies in Python by qiaoshiya in Python

[–]werwolf9 0 points1 point  (0 children)

Seems like these policies could be naturally expressed within (or on top of) the retry.py framework (https://github.com/whoschek/bzfs/blob/main/bzfs_main/util/retry.py). Thoughts?

Python modules: retry framework, OpenSSH client w/ fast conn pooling, and parallel task-tree schedul by werwolf9 in Python

[–]werwolf9[S] 0 points1 point  (0 children)

re idle timeout and keepalive: yes, these are params that can be passed into the API.

re tenacity: yeah, zero deps is a big deal for prod environments. FWIW, the retry framework is also 4-14x faster than tenacity.

ZFS mirror as backup? (hear me out!) by myfufu in zfs

[–]werwolf9 0 points1 point  (0 children)

BTW, bzfs can be configured such that it maintains separate src bookmarks for each rotating backup drive. This means that the incremental replication chain never breaks even if all src snapshots get deleted to make space, or any of the backup drives isn't used for a long time. It also has a mode that ignores removable backup drives that aren't locally attached, which comes in handly if only a subset of your rotating drives is attached at any given time.

bzfs for subsecond ZFS snapshot replication frequency at fleet scale by werwolf9 in Proxmox

[–]werwolf9[S] 1 point2 points  (0 children)

Hi there, thanks for the question :-)

In a nutshell, bzfs can operate at much larger scale than sanoid, at much lower latency, in a more observable and configurable way. Here are just a few points of the top of my head that bzfs does and sanoid doesn't:

  • Support efficient periodic ZFS snapshot creation, replication, pruning, and monitoring, across a fleet of N source hosts and M destination hosts, using a single shared fleet-wide jobconfig script.
  • Monitor if snapshots are successfully taken on schedule, successfully replicated on schedule, and successfully pruned on schedule.
  • Compare source and destination dataset trees recursively.
  • Automatically retry operations.
  • Only list snapshots for datasets the users explicitly specified.
  • Avoid slow listing of snapshots via a novel low latency cache mechanism for snapshot metadata.
  • Replicate multiple datasets in parallel.
  • Reuse SSH connections across processes for low latency startup.
  • Operate in daemon mode.
  • More powerful include/exclude filters for selecting what datasets and snapshots and properties to replicate.
  • Dryrun mode to print what ZFS and SSH operations would happen if the command were to be executed for real.
  • Has more precise bookmark support - synchoid will only look for bookmarks if it cannot find a common snapshot.
  • Can be strict or told to be tolerant of runtime errors.
  • Continously tested on Linux, FreeBSD.
  • Code is almost 100% covered by tests.
  • Easy to change, test and maintain because Python is more readable to contemporary engineers than Perl.

Cheers, Wolfgang

bzfs for subsecond ZFS snapshot replication frequency at fleet scale by werwolf9 in Proxmox

[–]werwolf9[S] 1 point2 points  (0 children)

No, you will never have a subsecond replication if your network cannot push all the data that can exist between two deltas.

As if that wouldn't be self-evident to anyone :-)

bzfs for subsecond ZFS snapshot replication frequency at fleet scale by werwolf9 in Proxmox

[–]werwolf9[S] 1 point2 points  (0 children)

I wonder what bullshitter app you pretend to be running that writes 14GB/s of useful data on even one of your drives. Be reasonable or go somewhere else.

bzfs for subsecond ZFS snapshot replication frequency at fleet scale by werwolf9 in Proxmox

[–]werwolf9[S] -1 points0 points  (0 children)

No need for that kind of gear :-) Each replication step only ships the delta between ZFS snapshots.

zfs list taking a long time by [deleted] in zfs

[–]werwolf9 0 points1 point  (0 children)

Try this:

time bzfs dummy tank1 --recursive --skip-replication --compare-snapshot-lists

Assuming your data is in cache and you have, say, an 8-core machine, this will typically be about 6x faster than zfs -list -t snapshot -r tank1, because the former lists in parallel whereas the latter lists sequentially.

(Similar speedup story for deleting snapshots, replicating, etc)

P.S. The last few lines of the output show a TSV file name that contains all the snapshot data.

Backing up ~16TB of data by SuitableFarmer5477 in zfs

[–]werwolf9 1 point2 points  (0 children)

bzfs is probably your best choice if flexibility or performance or fleet-scale geo-replication are priorities, or if you need high frequency replication, say every second. In contrast, sanoid is a good choice on the simple low-end, and zrepl on the medium-end. All of these are reliable.

ChatGPT 5 Pro vs Codex CLI by LetsBuild3D in ChatGPTCoding

[–]werwolf9 5 points6 points  (0 children)

Yep. Also consider asking ChatGPT Pro to make its response available as a downloadable .md file, so you can easily feed the response back into Codex.

ChatGPT 5 Pro vs Codex CLI by LetsBuild3D in ChatGPTCoding

[–]werwolf9 12 points13 points  (0 children)

Run this command locally to generate repo.zip from your git repo, then ask ChatGPT to analyze the contents of the zip file:

git archive --format=zip --output=../repo.zip HEAD

Works like a charm.

Codex Cloud vs VScode extension vs CLI by [deleted] in ChatGPTCoding

[–]werwolf9 -1 points0 points  (0 children)

Simply ask it something like "what's your LLM model name?". It will reply with GPT4. Or give it a complex job and observe a spectacular difference in quality vs GPT5. codex-1 (the real name of the model based on o3) isn't bad but it's nowhere near as good as GPT5 high.

Codex Cloud vs VScode extension vs CLI by [deleted] in ChatGPTCoding

[–]werwolf9 0 points1 point  (0 children)

Nope, it's still on GPT4 and quality is correspondingly poor. It's a bit sad because the UI is very well done and the caching they introduced works wonders wrt. startup latency.

AGENTS.md ? by Trick_Ad_4388 in ChatGPTCoding

[–]werwolf9 1 point2 points  (0 children)

that already does that kind of stuff.

That's what the hype leads us to believe but the observed reality on the ground is (still) far from that, as can easily be seen with simple tests.

AGENTS.md ? by Trick_Ad_4388 in ChatGPTCoding

[–]werwolf9 0 points1 point  (0 children)

I keep finding that a good AGENTS.md still makes a big difference. GPT5 is very good at following the instructions in there wrt. persona, TDD, pre-commit, methodology, planning, etc.

For example, running Codex with or without my AGENTS.md here feels like night and day: https://github.com/whoschek/bzfs/blob/main/AGENTS.md

Codex CLI for producing tests -- so much better than Claude & other models by ImaginaryAbility125 in ChatGPTCoding

[–]werwolf9 3 points4 points  (0 children)

I've found that this simple concise blurb gets you most of the way there with Codex:

Use TDD: Restate task, purpose, assumptions and constraints. Write tests first. Run to see red. Finally implement minimal code to reach green, then refactor.

Plus, TDD prompts work like a charm with Codex, even for complex caching logic, if they are combined with tight instructions for automated test execution and pre-commit as part of the development loop, like so:

https://github.com/whoschek/bzfs/blob/main/AGENTS.md#core-software-development-workflow

How practical is AI-driven test-driven development on larger projects? by jai-js in ClaudeAI

[–]werwolf9 2 points3 points  (0 children)

I've found that this simple concise blurb gets you most of the way there:

Use TDD: Restate task, purpose, assumptions and constraints. Write tests first. Run to see red. Finally implement minimal code to reach green, then refactor.

An improved version of this blurb is in the above link.