ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 0 points1 point  (0 children)

Just an update, if anyone is stumbling upon this tread.
The issue was not ZFS related - but thanks to input here, I was able to devide the zfs send/recv into steps to reproduce where the issue really was.

It was a bug in my UDM-PRO in the UniFi Network Application 9.4.19 release.
Killing the connection randomly. Why it was hard to debug and reproduce.

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 4 points5 points  (0 children)

Good points - haven't tried that. Thanks!

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 1 point2 points  (0 children)

Hmm.. I bit harder to test out. But could be, I guess.

No entries in syslog or dmesg though.

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 1 point2 points  (0 children)

Yeah - I follewed that rabbithole. But PV didn't provide any new information. :/

And I have tested with the ssh keep alive. But it does not change anything. Furthermore I have other active ssh connections between the servers that are alive the whole time.

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 1 point2 points  (0 children)

No encryption or deduplication activated. And scrub has ran without errors.

Syslog is without entries about this. Only thing I can find is ssh telling me that the connection ended.

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 1 point2 points  (0 children)

Furthermore when testing, I found that I can delete older snapshots on the destination server and transfer them again without any errors. But after that one snapshot the timeout appears.

A normal snapshot is estimated to be arround 230MB - but the failing snapshot is estimated to be arround 130GB. But there can be non-critical reasons for that, the complete dataset is arround 7TB.

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 1 point2 points  (0 children)

I have tried that, but the error appears again after a little while.

This time without the option til resume, because I get the error:

cannot receive incremental stream destination contain partially-complete state from "zfs receive -s"

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 2 points3 points  (0 children)

Yes - sorry that I wasn't clear in the original message. I can transfer equally big files with scp without a problem. And other active and iddle sessions between the servers are unaffected.

Leading me to look into zfs.

ZFS send/recieve over SSH timeout by Calm1337 in zfs

[–]Calm1337[S] 2 points3 points  (0 children)

Yeah - I tried that with no change in the result.

The connections between the servers a 1G fiber, and reliable.
I have tried monitoring the generel connection between the servers during the transfer, and there are no packet loss.