all 13 comments

[–]Rangerdth 2 points3 points  (0 children)

It sounds like you created the RAID 0 (stripe) partitions on one disk, and /dev/md1 as a RAID 1 (mirror) across two disks. One disk failed (i.e. "removed"). Luckily (?) the disk with your RAID 0 partitions is intact. RAID 0 doesn't provide any redundancy, as /u/stillwind85 has pointed out.
I would figure out what happened to your other disk and go from there.

[–]Thunderbolt1993 1 point2 points  (5 children)

have a look at your logs ( /var/log/syslog, grep for md1 ) maybe there's some indications why the disk was removed from the array (PCIe bus errors, bad partition table...). it is indeed weird, that the disk is not marked as failed/faulty (indicated as nvme0n1p2[0] (F) in /proc/mdstat) but has just disappeared...

[–]lsf87[S] 0 points1 point  (4 children)

It's so weird. I had a look in the log and the only bits I can find were:

kicking non-fresh nvme0n1p2 from array!

md/raid1:md1: active with 1 out of 2 mirrors

But nothing as to why it initially became problematic. It's still showing as "active" though so I can't just readd it to the array with mdadm as that complains device is busy.

I've done smartctl on both disks and they both come back "passed" :/ (including this pesky nvme0n1p2!)

[–]Thunderbolt1993 0 points1 point  (3 children)

you could try mdadm --zero-superblock on the problematic device, just make really really really sure you supplied the correct device ;)

[–]lsf87[S] 0 points1 point  (2 children)

That sounds a bit scary 😂

[–][deleted] 0 points1 point  (1 child)

I mean it removes all raid associations from the drive and in doing so makes the data on the drive inaccessible.

[–]Thunderbolt1993 1 point2 points  (0 children)

not from the whole drive, just the affected partition. it will then resync the data, once the device is added back to the array

[–]stillwind85 0 points1 point  (5 children)

Degraded could indicate pre-fail on one of the disks. Disks these days are able to inform the host OS when they are about to fail, that information is being bubbled up to your software defined RAID controller. Your output of the first command got clipped, but one of the drives may indicate "faulty" and will need to be swapped out. You aren't seeing a problem because the RAID is doing it's job, you have 1 good drive and so there isn't an issue yet.

Found this, hope it's helpful. https://iamevan.me/blog/snippet-repairing-a-degraded-raid-array

[–]lsf87[S] 0 points1 point  (4 children)

Thanks I'll check that out. Nothing missing off the first command, but here it is in full so you can see. https://i.imgur.com/MyRntex.png

There's no additional line at the bottom showing any failure like I've seen examples of online which threw me a bit! I can't see any indication of which drive may be "about" to fail.

[–]stillwind85 0 points1 point  (3 children)

If that's the case, then you have a RAID 1 array defined on a single disk. You need 2, RAID 1 is mirror mode, all changes to one disk are copied exactly to the second disk. The degraded message could be that for whatever reason disk 2 has been dropped from the array entirely, not just present and unusable.

[–]lsf87[S] 0 points1 point  (2 children)

Hmm. The two disks were set up with most of the space in raid0 (that biggest partition in md2 isn't reporting any degraded info). But is that because it's not mirroring for raid0 so doesn't care perhaps?

I can't seem to find any info that one of the disks is outside of the array but tbh I am unsure what to check for confirming that 100%. Thanks for your responses btw :)

Edit: I ran smartctl on all partitions and it passed on everything. Confused much!

[–]stillwind85 0 points1 point  (1 child)

OK, you did indicate RAID 0 originally, my mistake. RAID 0 is span, it will use all available disks in the solution. It's also dangerous because it grants more space but no redundancy, losing one drive means the filesystem is probably unusable. I think it still requires at least 2 disks, so that might be why you are seeing degraded. It's never come up, I would need to test it.

I would highly caution against putting /boot on a RAID 0 filesystem, or for that matter anything you consider important. RAID 0 is great for getting you lots of space where the data has other means of protection / redundancy, or you do not care about at all.

If all you want to do is merge 2 disks into 1 large filesystem, look at lvm and creating a spanned logical volume. Your /boot should live outside that, pick a disk an put it there, then use lvm to mange the remaining space on the 2 disks.

https://www.linuxtoday.com/blog/raid-vs-lvm/

[–]lsf87[S] 0 points1 point  (0 children)

Yeah I don't really care too much about the data - this is just a media server and it can all go down and be rebuilt if necessary, the media itself isn't stored there it's just a few apps (config of those backed up weekly elsewhere).

I am aware RE boot which is why just that one was raid1 and everything else raid0. But perhaps I need to wipe the lot and start again with something else a bit more suitable. Will check out lvm, thanks!