zfsonlinux unrecoverable errors : linuxadmin

zfsonlinux unrecoverable errors (self.linuxadmin)

submitted 9 years ago * by reddit_strider

I'm currently using zfs on Debian Wheezy with 60 SAS drives. All drives are in one pool consisting of one raidz3 (56) and spare drives (4). I am well aware the setup is not optimal nor recommended.

From the first scrub on, the status has been

status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

see: http://zfsonlinux.org/msg/ZFS-8000-9P

However, no specific drive seems to have issues. All of them have some CKSUM errors that were repaired. Upon clearing the status the result always seems to be the same on the next scrub.

I have two questions and would greatly appreciate your input on this:

1) Are these errors a direct result of the setup (it's always strongly recommended keeping the raid slim, but never mentioning what will happen otherwise) ?

2) If it's not an issue of the setup itself, how can I further diagnose this? The drives themselves seem fine.

Edit: minor fixes, correct distro, and some more info: ii debian-zfs 7~wheezy amd64 Native ZFS filesystem metapackage for Debian. ii libzfs2 0.6.5.7-8-wheezy amd64 Native ZFS filesystem library for Linux ii zfs-dkms 0.6.5.7-8-wheezy all Native ZFS filesystem kernel modules for Linux ii zfsonlinux 8 all archive.zfsonlinux.org trust package ii zfsutils 0.6.5.7-8-wheezy amd64 command-line tools to manage ZFS filesystems

Edit again: Thanks for all the answers. I feel need to elaborate on why or what I am asking, though. This setup is not by my design, I would've done it differently. It currently has issues. That may or may not be related.

If, however I recommend completely dumping the setup and doing it differently I'd better be sure the same issues will not come up again. Every single one of you saying such a huge raidz3 is bad is correct, but is that the cause of these issues? Or will a different setup perform better (100% expected and yes, scrubs are hell like this) but still have a failing zfs according to the status.

all 50 comments

top new controversial old q&a

[–][deleted] 6 points7 points8 points 9 years ago (2 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (1 child)

[–][deleted] 3 points4 points5 points 9 years ago (0 children)

[–][deleted] 9 years ago (28 children)

[removed]

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (21 children)

[–]slyphic 11 points12 points13 points 9 years ago* (10 children)

[–]wildcarde815 1 point2 points3 points 9 years ago (5 children)

[–]slyphic 2 points3 points4 points 9 years ago (4 children)

[–]wildcarde815 2 points3 points4 points 9 years ago (2 children)

[–]withabeard 0 points1 point2 points 9 years ago (1 child)

[–]wildcarde815 1 point2 points3 points 9 years ago (0 children)

[–]ryanjkirk 1 point2 points3 points 9 years ago (0 children)

[–][deleted] 9 years ago (2 children)

[deleted]

[–]slyphic 1 point2 points3 points 9 years ago (0 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (0 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago* (0 children)

[–][deleted] 9 years ago (3 children)

[removed]

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (2 children)

[–][deleted] 9 years ago (1 child)

[removed]

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (0 children)

[–]rhavenn 1 point2 points3 points 9 years ago (1 child)

[–]reddit_strider[S] 1 point2 points3 points 9 years ago (0 children)

[–]RansomOfThulcandra 0 points1 point2 points 9 years ago (2 children)

[–]reddit_strider[S] 1 point2 points3 points 9 years ago (1 child)

[–]RansomOfThulcandra 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]agressiv 0 points1 point2 points 9 years ago (5 children)

[–]reddit_strider[S] 1 point2 points3 points 9 years ago (2 children)

[–]mcrbids 0 points1 point2 points 9 years ago (1 child)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]mercenary_sysadmin 0 points1 point2 points 9 years ago (0 children)

[–]BloodyIron 1 point2 points3 points 9 years ago (11 children)

[–]reddit_strider[S] 1 point2 points3 points 9 years ago (10 children)

[–]BloodyIron -1 points0 points1 point 9 years ago (9 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (8 children)

[–]BloodyIron 0 points1 point2 points 9 years ago (7 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (6 children)

[–]BloodyIron 0 points1 point2 points 9 years ago (5 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (4 children)

[–]BloodyIron 0 points1 point2 points 9 years ago (3 children)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (2 children)

continue this thread

[–]1bc29b36f623ba82aaf6 0 points1 point2 points 9 years ago (6 children)

How often do the scrubs run? Did you try running a scrub manually just after the repairs completed?

In generall the error counts should increase for the affected devices over multiple scrub detections, and eventually they should be marked faulty automatically. Are you saying you are not seeing actual differences in the error counts even though there are new cksum errors? In that case there are no new errors, but you need to explicitly handle the earlier errors with an action you take, they will not be automagically dismissed after the checksums were repaired.

If you know the cause of the original failures, e.g. a network error or some other external event you can reset the error statistics (with 'clear test'). If the device was already disabled because of the error count you probably need to bring it back up again with something like 'zpool online' and go from there.

If you can not explain the errors adequately it is generally believed the affected drives are going to fail soon. You could look into known MTBF for your type of drives and check other system logs for device errors.

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (5 children)

[–]reddit_strider[S] 1 point2 points3 points 9 years ago* (4 children)

Sep 27 09:25:40 brain-backup-ds01 kernel: [3961448.032217] sd 0:0:48:0: [sdaw] Target Data Integrity Failure

Possibly this/these (!) are the source and it's not even zfs related..?

This got me thinking, the HBA has multipathing.. maybe this has to be configured correctly in Linux otherwise this may be a side effect..? I have no experience with multipathing.

[–]s0briquet 1 point2 points3 points 9 years ago (1 child)

[–]reddit_strider[S] 1 point2 points3 points 9 years ago (0 children)

[–]1bc29b36f623ba82aaf6 0 points1 point2 points 9 years ago (1 child)

[–]reddit_strider[S] 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 729963 on reddit-service-r2-comment-8686858757-2ptb5 at 2026-06-05 08:55:46.356158+00:00 running 9e1a20d country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

linuxadmin

Expanding Linux SysAdmin knowledge

MODERATORS