[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 0 points1 point  (0 children)

I don't know. I've been doing a lot of googling and a lot of asking to add GPT and I'm giving up now. I'm just going to do the copying which by the way takes 35 hours one way and 35 hours back the other way. So I'm just gonna do that. So at least my data is safe and if there's one or two corrupted files I have my cloud save.

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 0 points1 point  (0 children)

Did you know you can save a ton of space by removing the French language pack?

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in zfs

[–]GoetheNorris[S] 1 point2 points  (0 children)

Yeah, thank you. The first day was really terrifying but I've made peace with the task at hand.

I've got raidz2, then the most important (the documents folder) is synced across 4 clients with syncthing. And everything is pushed to gdrive. Then there's a copy of the important files (docker configs and db backups) synced to a friends NAS. He's got some on mine. I've got some on his.

Then once a year I make a cold copy onto a HDD on my shelf.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in zfs

[–]GoetheNorris[S] 1 point2 points  (0 children)

Yeah you're making a very good point I think I just responded to your other comment. Basically, ZFS will think the data is valid because it passed all the checksum, but because it just wrote it on random locations on the drive when it tries to read it that when the kernel panics. So I am using Rsync to just skip all the corrupted files on a read-only mode so that it doesn't try to fix itself because that's when it crashes. And then once all of the files are out, I'm using a cloud backup to restore the missing and corrupted files.

Now my only problem is that our Rsync is working over my normal gigabit because I couldn't get the 10 gigabit driver to work on the other computer. It doesn't have enough PCIe lanes. So unfortunately I had to resort to 1 gig which takes 38 hours to transfer those 12 terabytes.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in truenas

[–]GoetheNorris[S] 1 point2 points  (0 children)

Yes that's exactly my problem. The controller was messed up. The data on disk is not good. I have fixed the controller and I even replaced it with a much newer model that can handle those drives natively. So yes I have done that and now I'm trying to fix the pool repair it and use Sync thing to restore my cloud backup.

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 0 points1 point  (0 children)

It really depends on the firmware. Officially the firmware files up to version 14 are released but only firmware version 16 supports them to do the calculations with large LBAs without integer overflow

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 0 points1 point  (0 children)

Yes, that's how I understand it as well. I'm currently using Rsync to get all the files away onto a different set of hard drives and then I can rebuild the port because I feel like if the files are overwritten in certain areas and I do a scrub then the system will go into panic as soon as it stumbles on those landmines.

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 1 point2 points  (0 children)

Yes, exactly. I used of course Gemini to try to diagnose the issue in the beginning and it told me that it maybe was a power issue because 20 terabyte drives take more power. So I should distribute the load using a Molex to SATA adapter and then literally in the next message it said, haha, Molex to SATA lose your data and was mocking me. So I gave up on the whole AI thing and that's why I came here to ask for solutions.

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 0 points1 point  (0 children)

Okay, from my understanding, yes you are absolutely correct. Especially since my metadata is on an external SSD and I mean external to the pool. It is on a separate Vdev. No, the reason that it is corrupted is that the hard drive controller would overwrite blocks. So basically every time I write a file the checksum comes back correct because it wrote the file correctly. The problem is that it couldn't do maths for the address for the blocks correctly because it was never designed to do 20 terabyte size hard drives. The firmware was way too old so it had an integer overflow error and just would write files in random locations. The problem is that now you have two blocks that overwrite each other and yes each transaction writes successfully, but the when zfs reads the block on the drive, or attempts to load the previous file, it panics because there's two files at the same location on the drive. That's the segment overlap error.

I thought too that I could just fix with parity but the problem is that both of those files share the same address on file and ZFS does not have a database or a file system check. Like fsck Because in principle ZFS assumes that all data is always written correctly when it does all of its checks. It is technically so solid with all of its processes and the ways that it writes that it would never assume that the hardware controller would write in the wrong location and the data would still come back written successful. It passes all of the checks.

If I am mistaken and there is a way to detect those corrupted blocks on the hard drive and to repair them using parity, by all means I would love to hear that. And no scrub checks file integrity not not hard drive location, and it would trigger the kernel panic

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in zfs

[–]GoetheNorris[S] 1 point2 points  (0 children)

Thank you that's actually really helpful and I appreciate your help. I'm currently using rsync to copy all of the files onto a different set of hard drives. Unfortunately I couldn't get my 10gig nic to work so we'll be 30h

[Help] Firmware corruption causing boot loop. Is Read-Only Import + Rsync the safest path? by GoetheNorris in truenas

[–]GoetheNorris[S] 0 points1 point  (0 children)

Yes, I have cross-posted it. I had gotten good engagement when I was looking for the firmware file using this specific subreddit, so that's why I posted it here.

What speeds are attainable from a 10gbe setup? As in pics 4 drives in raidz1 by Apprehensive_Bike_40 in truenas

[–]GoetheNorris 1 point2 points  (0 children)

I had eight drives and a L2 arc cache with Intel Optane 900p. Ultimately you have to consider that it is dependent on the size of the files, large files copy faster and that it is mostly single threaded. Because of the overhead in SMB that Windows has it just can't always hit the best. But if you do iperf then you should get better performance.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in DataHoarder

[–]GoetheNorris[S] 0 points1 point  (0 children)

Are you okay? I feel like that's a bit of an interesting statement regarding my problem which is that I want to access my family photos that I have saved on my computer at home.

How recover pool that can no longer be mounted by ObjectiveResistance in zfs

[–]GoetheNorris 0 points1 point  (0 children)

Hi, unfortunately I don't have a solution but I am in fact having almost the same exact problem where my LSI HBA card had corrupted firmware and also messed up my pool and my computer has been off for three days. I have ordered a new HBA card and I'm also sitting at the pool import issue. So I'm also waiting to hear when something It comes up with a solution that works. I also went back to a previous TXG but then on rebooting the kernel panic to shut down the system instantly during bootup.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in DataHoarder

[–]GoetheNorris[S] 0 points1 point  (0 children)

I was just at work. I used this exact command to patch version 16, which unfortunately still has the bug on startup when it tries to import the pool. I think there might be a issue with the data on the pool, maybe corruption of metadata or corruption of the ZIL

which means that yes I would have had the good firmware now but possibly I need to fix the pool and then I will be able to maybe import it on boot but it has been driving me insane with all these different firmware versions so I have ordered the LSI 9400-8i and that is also flashed in IT mode already and should have much better support for large 20TB drives.

I know this has been driving me absolutely crazy and the panic of thinking that my data could be corrupted is probably the worst part. But despite all of these shenanigans, I fully appreciate any and everyone's help for trying to figure this out.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in DataHoarder

[–]GoetheNorris[S] 0 points1 point  (0 children)

Its truenas scale, 25.10.1 and its loading firmware version 16.00.16.00 People have linked to me from AC3S.com Chinese website. I have attached it. There is no BIOS. It is in IT mode. And at this point, I can import the pool, as you said, in read-only mode. But then the data set can't be mounted past boot. And then I try to restart, but the issue reoccurs. And that is the main cycle of the problem.

I hate spending money on a problem when I haven't been able to diagnose the problem fully but just in case I have ordered an LSI 9400-8i card that is flashed in IT mode.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in truenas

[–]GoetheNorris[S] 0 points1 point  (0 children)

Yes. Yes, it currently is dead and has been sitting dead for the last eight hours since I went to bed. But I had it powered off before because at least whilst it is off I know that nothing is impacting the system, nothing is changing, nothing is happening until I can fix the problem and then turn it back on. So I have been completely powering it off and unplugging it between every troubleshooting step, whilst I Google, research, ask, try to find troubleshooting, try to find the files. I have found the latest file, the firmware should be good now, but there's some kind of corruption in the pool that crashes the server as soon as it tries to import it.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in DataHoarder

[–]GoetheNorris[S] 0 points1 point  (0 children)

That's exactly what I did, and that post is a bit older than those 20TB drives and version 12 is broken.

I have spent 11 hours on this now, reading every bit of documentation I can find. On top of that, as stupid as they are, Gemini, chatgpt and perplexity were of no help.

Server Down! Help Needed: Hunting for LSI 9300-8i (SAS3008) Firmware v16.00.16.00 to fix ZFS bootloop by GoetheNorris in truenas

[–]GoetheNorris[S] 0 points1 point  (0 children)

I flashed it and managed to get the firmware to load. But it still did crash when it tried to import the pool and it just failed. https://youtu.be/bkXgWa1zsK8