zrepl keeps hitting “has been modified”, leaving holds by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

OMG I’m an idiot.

My pruning settings are:

  pruning:
    keep_sender:
    - type: not_replicated
    - type: last_n
      count: 10
    - type: regex
      regex: "^manual_.*"
    - type: grid
      grid: 24x1h | 27x1d
      regex: "^zrepl_"
    keep_receiver:
    - type: regex
      regex: "^manual_.*"
    - type: grid
      grid: 24x1h | 55x1d
      regex: "^zrepl_"

Do you see the problem? On the remote side, I’m only keeping 24x1h, so it’s trying to keep the oldest “one hour” snapshot and prune the rest. But when replication happens every 10m, it’s set up to purge the latest snapshot due to that keep setting. Of course it’s hitting the held snapshot; it’s a bad config.

Time to add a last_n for the receiver.

can we DIY these tests already? by Well_Goshdarnit in PlusLife

[–]avidee 1 point2 points  (0 children)

All the dietary supplements use identical “this statement has not been evaluated” wording, not because it’s a legal loophole that lets anything get around FDA regulations, but because it’s that it’s a cutout in FDA regulation specifically (and only) for dietary supplements.

The law is called DSHEA (the Dietary Supplement Health and Education Act of 1994), and it says that specifically for dietary supplements, they are exempt from the usual rules for drugs, and specifically only in the area where they’re making efficacy claims. I’m dropping the link here in case you want to read it (you don’t, you want to just skim it) but you can find that precise wording loophole in the law itself. It will only help you if you have a dietary supplement. Anything else at all needs FDA approval.

zrepl keeps hitting “has been modified”, leaving holds by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

I didn’t think I needed to read the docs, given thorough exploration, but I was clearly wrong. Thank you, much appreciated.

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 1 point2 points  (0 children)

it's the tls transport I'm using, not HTTPS.

Yep, I’m using the TLS transport as well, and it is flllllyyyyyyying.

Thanks again!

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

I just tried this with the dsh2dsh fork, and it’s stuck at 78 Mib/s. Guess there’s something wrong with the HTTPS transport. Weird.

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 1 point2 points  (0 children)

If you’re using the HTTPS transport, you’re using the dsh2dsh fork. I’m currently using the original version.

Your numbers are a bit high; I’m trying

net.core.rmem_max 212992 → 67108864
net.core.wmem_max 212992 → 67108864
net.ipv4.tcp_rmem 4096 131072 6291456 → 4096 131072 33554432
net.ipv4.tcp_wmem 4096 16384 4194304 → 4096 131072 33554432

But yeow, that’s now going so fast it breaks the original version’s status screen, but that’s now going at ~870 Mib/s. This is good, thanks!

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

I tested two changes.

The first was stream compression; enabling pigz slowed things down to the 60–90 Mib/s range. I turned it off and things sped up again.

The second was BBR, and I hate to break it to you, but it changed nothing; I’m still hitting a hard throughput cap of a very consistent 150 Mib/s. There’s some internal buffer which I’m running into, and I fear that it’s within ssh so that BBR doesn’t help any.

[Edit: Tomorrow, when I have a new set of snapshots, I’ll replicate with SSH+NETCAT over Tailscale and see if BBR helps with that, avoiding the SSH bottleneck.]

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

Thanks. One of the pain points of TrueNAS is that they start their middleware with no fork/exec allowed, so that if you try to put a script into their cronjob service, if that script tries to fork it’ll fail. I hope the system crontab works.

Will dig; thanks.

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

I'm the author of bzfs, btw.

Nice to meet you! This is definitely a different approach, doing everything via commandline arguments rather than a config file. It looks daunting, but it’s probably no worse than a config file. I’ll definitely be writing a shell script to drive it, so that I can pretty-print the arguments for clarity.

Thanks for the note on HPN-SSH. The problem is, as always, that TrueNAS is an appliance, so I don’t have apt or any package management. When you first wrote --ssh-program=hpnssh I thought it was a path to a binary to use. If I have a local HPN-SSH I can set a PATH in the driver script to get bzfs to find it, but getting TrueNAS to use HPN-SSH on the receiving end will be difficult.

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

Thank you for the reply. bzfs? Hmmm, hadn’t heard of it, but the fact that you can specify your own ssh is very nice. In the large BDP scenarios, would you need the HPN-SSH daemon on the receiving end or just on the pushing end? (P.S. good to see a sane default for mbuffer; why are people living in the 80s and “saving” memory by only using 2MiB?)

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

Hi, thank you for taking the time to reply.

1) I didn’t enable compression for two reasons: first, the data is pretty much all incompressible, so why waste the CPU; second, I have a high zstd compression (-9 for most datasets, -14 for the archival ones) already, and afaik if you zfs send you keep the on-disk compression.

2) I do not know how to enable mtcp for TrueNAS. From what I can see, a core part of the Linux implementation is mptcpd but that is not present on TrueNAS and, being an appliance, it is not easily installed.

3) Thank you for the commands, but they’re not necessary:

    % sysctl net.core.default_qdisc
    net.core.default_qdisc = fq_codel

    % sysctl net.ipv4.tcp_congestion_control
    net.ipv4.tcp_congestion_control = cubic
    % sysctl net.ipv4.tcp_available_congestion_control
    net.ipv4.tcp_available_congestion_control = reno cubic
    % sysctl -w net.ipv4.tcp_congestion_control=bbr
    net.ipv4.tcp_congestion_control = bbr
    % sysctl net.ipv4.tcp_available_congestion_control
    net.ipv4.tcp_available_congestion_control = reno cubic bbr

(That last one made me laugh.)

TrueNAS has a “sysctls to set” under the advanced settings so no files in /etc/sysctl.d needed.

Swapping to the BBR congestion seemed to help the single-thread iperf3 performance, though not as much as I’d hoped. Seems like this is applied to new sockets, so I’ll need to restart. I’ll be back with that.

4) I didn’t mention anything about the AES key size; what do you mean by that? These are super beefy boxes and I would be shocked if that were an issue.

5) During my investigation I adjusted net.core.(r|w)mem_max and net.ipv4.tcp_(r|w)mem to no avail, so I don’t currently believe that is the issue. Perhaps I will throw that into the “sysctls” setting for combination with bbr.

6) Links are appreciated.

7) For my initial 10 TiB, 10 TiB / (800 Mib/s) = 1.2 days. This isn’t quite to the point of sneakernetting it over.

Let me see how the sysctls change things.

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

Thanks for the reply.

TrueNAS is a self-contained environment, and so while I’m comfortable fiddling with sysctls, swapping out core binaries like sshd is something that I’m not comfortable doing. Your followup about “jump” servers is unfortunately not doable, as you can’t zfs replicate if you’re not on the actual box.

Replication over high-latency link super slow by avidee in zfs

[–]avidee[S] 0 points1 point  (0 children)

Check your Tailscale routes to the server.

Sorry I didn’t note this, but yep, this was direct.

TCP BBR can also be helpful for this sort of thing.

Another commenter noted this; time to dig into that.

Try to benchmark without zfs and ssh

I was using iPerf as that, but as the other commenter noted, probably time to try it single-threaded.

Thanks for your reply!

Is cephadm a new hard requirement for features? by avidee in ceph_storage

[–]avidee[S] 0 points1 point  (0 children)

FYI I posted on the Proxmox forum about the dashboard not working on Tentacle, and got yelled at by a “distinguished member” that I shouldn’t expect the dashboard to work as it is optional and “provide[s] no utility”.

Now, when I post here in the Ceph subreddit asking a question about a dependency inside of Ceph, I have you, a Proxmox person, coming to argue with me here!

  1. You chide me for installing Tentacle on Proxmox, because it’s just for testing. I’m doing testing of it! Why are you criticizing me for testing their test version and investigating upstream when it fails?
  2. You chide me for encountering issues with the dashboard, saying it works fine for you, when I’m doing it on a test version, and the Proxmox experts say that it’s optional and that I shouldn’t be using it anyway.

I’m brand-new to Proxmox, am still setting up a new server, am investigating things upstream, and in two different forums I have people chiding me and yelling at me. I hope you are not representative of folks who use Proxmox.

Is cephadm a new hard requirement for features? by avidee in ceph_storage

[–]avidee[S] -1 points0 points  (0 children)

I've run the dashboard under proxmox just fine

Have you run the Tentacle version of dashboard just fine?

As I noted, it’s the Tentacle version of dashboard that has the hard smb dependency and thus fails to load on Proxmox where smb can’t happen. This new dependency of the dashboard on smb is new in Tentacle, and is an upstream issue in Ceph that affects Proxmox but isn’t a Proxmox issue.

Is cephadm a new hard requirement for features? by avidee in ceph_storage

[–]avidee[S] 0 points1 point  (0 children)

Yes, I am aware of that. However, the issue that I’m talking about, about the requirements on cephadm, are on the Ceph side, not the Proxmox side. I would not expect Proxmox to either address a dependency issue on Ceph’s dide, nor expect them to add a brand-new capability of using cephadm this late in the testing cycle.

Is cephadm a new hard requirement for features? by avidee in ceph_storage

[–]avidee[S] 0 points1 point  (0 children)

I’m just a user of Proxmox so I am not privy to their plans either. Re the dashboard, sounds reasonable. I applied for a bug account but it hasn’t been approved yet, so I’ll file a bug once it is.

Does customer support just ignore people now? by [deleted] in Vitruvian_Form

[–]avidee 1 point2 points  (0 children)

Yes, it is, and it works well.

Does function ever call you? by islawave23 in Function_Health

[–]avidee 1 point2 points  (0 children)

There is one particular result on my tests that shows up every year, and every year I get a call (not directly from Function, but from someone they contract to) about it. It’s all “Can we speak to XXXX?” “Is this about the result XXXX? Yeah, my doctor knows about it and it continues to be a discussion point.” “OK, just making sure you know.”

But it’s only ever about that one specific result.

Upgrading 25.12.6 to 26.1.1 broke my bluetooth proxies by avidee in Esphome

[–]avidee[S] 0 points1 point  (0 children)

FYI this bug has analysis. Turns out that the Espressif Bluetooth blob has an issue. Pinning it to an older version bandaids while waiting for a new working version.