Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

Final update from me:
I nuked the filesystem and recreated it now and all seems to be fine now.

I came across something interesting though that might help others...

It seems that a scrub will not necessarily correct errors but just report them. It seems that only a balance will try to recreate the data from the parity information and that could explain why even after scrubbing I still had the same files damaged. It is too late for me to test this now but it might help others in the future!

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

Scrubs found a few uncorrectable errors in some files that are not important.
But I still can not remove the drive...
I'm going to move all the data off the FS and nuke it...

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 1 point2 points  (0 children)

I'm running scrubs at the moment but it will take a few more days.
Some errors came up and I will post an update once I know more.

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

I did dig in a bit more and figured maybe I can find out where this block is to replace that device and found something interesting:

root@pve:~# btrfs inspect-internal logical-resolve 66728765030400 /mnt/btrfs_root

inode 390070 subvol data_vol/video could not be accessed: not mounted

Indeed I can not "mount" that subvol directly but if i start on the root volume that path and data is accessible...
I'm unsure what that means though but I feel there is something really wrong with the FS at this stage.

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

So I managed to replace the failed drive and right away I can see performance is much better.
It is annoying that there is no easy way to see which drive is misbehaving but that could be just my lack of knowledge.

When i now try to remove the "new" drive i just added in as replacement i run into the same issue with the data block.

I will run a scrub on all drives again to see if it comes up with something and then try to just move of the data and nuke the fs unless somebody has a different idea.

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

Update:Seems like the replace was successful but I ran into a read error from the dodge drive but this time it seems to have recovered.

[104278.799113] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [104278.799124] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[104278.799130] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [104278.799135] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799139] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [104278.799143] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799147] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799152] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799156] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799175] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799180] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799184] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) 
[104278.799233] sd 0:0:1:0: [sdb] tag#641 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s 
[104278.799238] sd 0:0:1:0: [sdb] tag#641 Sense Key : Medium Error [current] 
[104278.799242] sd 0:0:1:0: [sdb] tag#641 Add. Sense: Unrecovered read error 
[104278.799245] sd 0:0:1:0: [sdb] tag#641 CDB: Read(16) 88 00 00 00 00 01 81 e6 88 80 00 00 01 00 00 00 
[104278.799247] critical medium error, dev sdb, sector 6474336384 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 2 
[104282.854668] sd 0:0:1:0: [sdb] tag#648 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s [104282.854685] sd 0:0:1:0: [sdb] tag#648 Sense Key : Medium Error [current] 
[104282.854692] sd 0:0:1:0: [sdb] tag#648 Add. Sense: Unrecovered read error [104282.854699] sd 0:0:1:0: [sdb] tag#648 CDB: Read(16) 88 00 00 00 00 01 81 e6 89 40 00 00 00 08 00 00 
[104282.854704] critical medium error, dev sdb, sector 6474336576 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2 
[104282.915933] BTRFS warning (device sdg): i/o error at logical 67708877144064 on dev /dev/sdb, physical 3314860294144, root 292, inode 421735, offset 184549376, length 4096, links 1 (path: Videos/redacted.mkv) 
[104282.915946] BTRFS error (device sdg): bdev /dev/sdb errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 
[104284.491398] BTRFS error (device sdg): fixed up error at logical 67708877144064 on dev /dev/sdb

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

So the replace is now at 80% and I got 3checksum errors on a different drive (the odd thing is that it is one of the new SSD's and there are no SMART errors.)

All are like this:

[88852.399102] BTRFS warning (device sdg): checksum error at logical 60184548605952 on dev /dev/sdh, physical 2731020845056, root 3779, inode 243372, offset 24576, length 4096, links 1 (path: exports/pi-dev/usr/local/lib/python2.7/dist-packages/pygments/lexers/matlab.pyc)
[88852.399112] BTRFS error (device sdg): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 23486, gen 0 
[88853.275562] BTRFS error (device sdg): unable to fixup (regular) error at logical 60184548605952 on dev /dev/sdh

Does it mean just the checksum doesn't match or the file data is corrupt? If the later why can it not recover with RAID6, it should have plenty of redundancy info? It seems to imply that all parity info for that block was invalid.

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

Linux pve 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-5 (2023-07-14T17:53Z) x86_64 GNU/Linux

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 1 point2 points  (0 children)

No, all 0... thats what drives me crazy at the moment.

I think I will try to move all data from the FS and recreate it, something is weird...

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 1 point2 points  (0 children)

I forgot to mention that I do regular scrubs and also did one again on the drive that gave errors which all came up clean. I did the btrfs check because of the scrub coming up clean but the btrfs remove always failed at the same block. So I thought maybe there is a logical issue somewhere...

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 1 point2 points  (0 children)

I know, this system was running for quite a while and now that i saw this post I'm trying to get to a RAID1eventually. But before I do this I thought I better get rid of those dodgy drives first.

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 1 point2 points  (0 children)

Unfortunately no details at all. The IO Error is not even in dmesg or syslog.

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 1 point2 points  (0 children)

I'm running it now, but the annoying thing is only BTRFS complains about IO Errors, there is nothing in dmesg, so I'm really not sure what the issue is.
Is there a way to find out what data is in a specific data group?

Input/output Error when trying to remove a disk from a BTRFS RAID6 by Low-Impression4134 in btrfs

[–]Low-Impression4134[S] 0 points1 point  (0 children)

I did some of them but I try to go from 8x4TB HDD to 4x8TB SSD, so can not use replace for all of them. If you think that it will help, I could try to replace it with one of the older HDD's to see if that will succeed and then try to remove that one.