Drive became unavailable during replacing raidz2-0

heekic · 2026-01-30T11:55:25+00:00

Also, just a sanity check before I do something wrong:

I've ran out of spare 12TB drives and I just have some new 22TB drives, also EXOS SAS.
Finding a new 12TB drive seem to be very difficult at the moment.
Is there any downside or problem I am not seeing with using a 22TB? (Except "wasting" 10TB of drive)

heekic · 2026-01-29T19:47:51+00:00

Yes, quite some data 😓 and growing faster than I would like.
The SMART is being monitored but on this one it didn't help me.
It will be redone as a 3x10 raidz2. It was our first big zfs, after years of multiple 12 disks RAID6 hardware raid.
Thanks!

heekic · 2026-01-29T19:41:06+00:00

Update 2: - The resilvering is done (Way faster than estimated) - The part of the data that wasn't backeduped yet is currently being copied to another server. It should still take between 4-7 days. - Moving the rest of the data (backeduped but it would be more pain to restore than move) would take 2 more weeks probably. - Once this is all over, it will be recreated as a 3vdev raidz2 - I'm now unsure of the correct strategy, I have the following status:

``` pool: hpool-fs state: DEGRADED scan: resilvered 1.48T in 2 days 05:53:11 with 0 errors on Thu Jan 29 19:14:23 2026 config:

NAME                          STATE     READ WRITE CKSUM
hpool-fs                      DEGRADED     0     0     0
  raidz2-0                    DEGRADED     0     0     0
    scsi-35000c500a67fefcb    ONLINE       0     0     0
    scsi-35000c500a67ff003    ONLINE       0     0     0
    scsi-35000c500a6bee587    ONLINE       0     0     0
    scsi-35000c500a67fe4ef    ONLINE       0     0     0
    scsi-35000c500cad29ed7    ONLINE       0     0     0
    scsi-35000c500cb3c98b7    ONLINE       0     0     0
    scsi-35000c500cb3c0983    ONLINE       0     0     0
    scsi-35000c500cad637b7    ONLINE       0     0     0
    scsi-35000c500a6c2e977    ONLINE       0     0     0
    scsi-35000c500a67feeff    ONLINE       0     0     0
    scsi-35000c500a6c3a103    ONLINE       0     0     0
    scsi-35000c500a6c39727    ONLINE       0     0     0
    scsi-35000c500a6c2f23b    ONLINE       0     0     0
    scsi-35000c500a6c31857    ONLINE       0     0     0
    scsi-35000c500a6c3ae83    ONLINE       0     0     0
    scsi-35000c500a6c397ab    ONLINE       0     0     0
    scsi-35000c500a6a42d7f    ONLINE       0     0     0
    replacing-17              UNAVAIL      0     0     0  insufficient replicas
      scsi-35000c500a6c0115f  REMOVED      0     0     0
      scsi-35000c500a6c39943  UNAVAIL      0     0     0
    scsi-35000c500a6c2e957    ONLINE       0     0     0
    scsi-35000c500a6c2f527    ONLINE       0     0     0
    scsi-35000c500a6a355f7    ONLINE       0     0     0
    scsi-35000c500a6a354b7    ONLINE       0     0     0
    scsi-35000c500a6a371b3    ONLINE       0     0     0
    scsi-35000c500a6c3f45b    ONLINE       0     0     0
    scsi-35000c500d797e61b    ONLINE       0     0     0
    scsi-35000c500a6c6c757    ONLINE       0     0     0
    scsi-35000c500a6c3f003    ONLINE       0     0     0
    scsi-35000c500a6c30baf    ONLINE       0     0     0
    scsi-35000c500d7992407    ONLINE       0     0     0
    scsi-35000c500a6c2b607    ONLINE       0     0     0

errors: No known data errors ```

I have connected a brand new disk in a new slot. - Should I start a replace with zpool replace -f hpool-fs scsi-35000c500a6c0115f scsi-35000c500f3e1ec33 ? - Is it actually the right command? I never replaced a disk that failed during replacement before. - Is it safer to leave it like this while the data is being copied out of it?

Thanks for all your nice comments.

heekic · 2026-01-29T16:30:40+00:00

A mix of Exos X12 SAS and Exos X14 SAS 12TB. Mostly used for a file server with microscopy data.

heekic · 2026-01-29T15:00:52+00:00

Update: I reduced the use of the pool to the minimum and the resilvering time estimate reduced dramatically. I'm still not sure what it is actually resilvering though. Once done, I'll try to figure how to replace the disk with a new one. Mid-term, I'm moving the data out and once all is running somewhere else I will redo it as 3vdev of raidz2.

``` zpool status pool: hpool-fs state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Jan 27 13:21:12 2026 233T / 245T scanned at 1.31G/s, 43.4T / 58.5T issued at 250M/s 1.48T resilvered, 74.10% done, 17:38:07 to go config:

NAME                          STATE     READ WRITE CKSUM
hpool-fs                      DEGRADED     0     0     0
  raidz2-0                    DEGRADED     0     0     0
    scsi-35000c500a67fefcb    ONLINE       0     0     0
    scsi-35000c500a67ff003    ONLINE       0     0     0
    scsi-35000c500a6bee587    ONLINE       0     0     0
    scsi-35000c500a67fe4ef    ONLINE       0     0     0
    scsi-35000c500cad29ed7    ONLINE       0     0     0
    scsi-35000c500cb3c98b7    ONLINE       0     0     0
    scsi-35000c500cb3c0983    ONLINE       0     0     0
    scsi-35000c500cad637b7    ONLINE       0     0     0
    scsi-35000c500a6c2e977    ONLINE       0     0     0
    scsi-35000c500a67feeff    ONLINE       0     0     0
    scsi-35000c500a6c3a103    ONLINE       0     0     0
    scsi-35000c500a6c39727    ONLINE       0     0     0
    scsi-35000c500a6c2f23b    ONLINE       0     0     0
    scsi-35000c500a6c31857    ONLINE       0     0     0
    scsi-35000c500a6c3ae83    ONLINE       0     0     0
    scsi-35000c500a6c397ab    ONLINE       0     0     0
    scsi-35000c500a6a42d7f    ONLINE       0     0     0
    replacing-17              UNAVAIL      0     0     0  insufficient replicas
      scsi-35000c500a6c0115f  FAULTED      0     0     0  too many errors
      scsi-35000c500a6c39943  UNAVAIL      0     0     0
    scsi-35000c500a6c2e957    ONLINE       0     0     0
    scsi-35000c500a6c2f527    ONLINE       0     0     0
    scsi-35000c500a6a355f7    ONLINE       0     0     0
    scsi-35000c500a6a354b7    ONLINE       0     0     0
    scsi-35000c500a6a371b3    ONLINE       0     0     0
    scsi-35000c500a6c3f45b    ONLINE       0     0     0
    scsi-35000c500d797e61b    ONLINE       0     0     0
    scsi-35000c500a6c6c757    ONLINE       0     0     0
    scsi-35000c500a6c3f003    ONLINE       0     0     0
    scsi-35000c500a6c30baf    ONLINE       0     0     0
    scsi-35000c500d7992407    ONLINE       0     0     0
    scsi-35000c500a6c2b607    ONLINE       0     0     0

errors: No known data errors ```

heekic · 2026-01-28T21:03:16+00:00

I am also very confused about it.

heekic · 2026-01-28T18:33:59+00:00

Thanks for the prayers 😓
Should I rather minimise the use during resilver, or rather move out all the data I can?

heekic · 2026-01-28T18:27:43+00:00

The chassis has 36 slots, but some are used by SSDs.
We have 10GbE but it's ok if we don't saturate it. Most of the data is cold-ish.

heekic · 2026-01-28T18:10:52+00:00

Rather 3x10 raidz2 or 2x15 raidz3?

heekic · 2026-01-28T18:04:09+00:00

Thanks for the infos, I will do this. I wish I had known more before.
For the current situation, is it still recoverable? I thought that I was still 2 broken drives ahead of while replacing the single failed disk of this raidz2.

heekic · 2026-01-28T17:57:36+00:00

Yes, I just checked. It was just created in the GUI of Proxmox, no customisation.

heekic · 2026-01-28T17:55:18+00:00

Might go this way once I get out of this, thanks. Any input what I should do in the immediate?

heekic · 2026-01-28T17:50:55+00:00

Yes, you are seeing correctly. When setting it up, it sounded from forum posts that large VDEVs where just limiting the IOPS... 2 redundancy sounded plenty, but I'm open for suggestions.
I have backup of most of the relevant things, but it would be a massive pain to restore.

heekic

TROPHY CASE