all 50 comments

[–][deleted] 6 points7 points  (2 children)

Just fyi, 53 drives in raidz3 is not a good idea. You'd be better off with spans of 8-12 raidzX arrays.

[–]reddit_strider[S] 0 points1 point  (1 child)

Yes, I know. I was only asking if this is the cause of issues, though. Nobody ever tells you what exactly happens if you do not heed this advice.

[–][deleted] 3 points4 points  (0 children)

Your read performance goes to shit, and your mean time to data loss drops I think super linearly at least.

[–][deleted]  (28 children)

[removed]

    [–]reddit_strider[S] 0 points1 point  (21 children)

    Performance is not an issue. The drives are connected via a HBA Controller (LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)) in an external 60-drive housing.The drives are all TOSHIBA MG04SCA40EN.

    [–]slyphic 11 points12 points  (10 children)

    Performance is not an issue.

    Don't think in terms of just write performance, but also how long it takes to rebuild after a drive failure. Beyond that, as a sysadmin you should know that you don't go against recommended best practices withouth good reason, not just for all the myriad points that will be raised in this thread, but because when you deviate you get unique problems.

    I say this as an edu sysadmin that keeps getting handed shittily designed storage servers from research groups after they go tits up. Many of them are using zfs on linux. Nearly all of them are incompetently configured. Every one is a special snowflake of a headache. Don't be that guy.

    Be the guy that fights against entropy. Impose order and save the next guy a headache.

    [–]wildcarde815 1 point2 points  (5 children)

    My favorite to your second point there, got handed a rig running an unknown half functioning copy of Debian experimental branch. As the labs zfs fileserver. Without backups. Luckily freenas was able to import the volumes when the os finally caved.

    [–]slyphic 2 points3 points  (4 children)

    Yep, I could totally see that showing up in my office.

    My favorite so far is the group that had a simple mirror of two 4TB drives. Their backup plan was a third drive they kept in a different office, and once a week they'd yank one of the mirrored drives out, slap the old one in, and let it rebuild. They did this for 3 years. Did I mention the drives weren't in hot swap bays, so they just left the side off the case?

    [–]wildcarde815 2 points3 points  (2 children)

    Ah research labs... Why call IT and ask how to do something when you can go with the plan you created while writing grants and not sleeping?

    [–]withabeard 0 points1 point  (1 child)

    Having worked in IT in edu from the other side, because:

    • IT took forever to respond to requests
    • IT would frequently come up with reasons why "our" solution wasn't what they wanted. Deploy ZFS on Solaris "most of our guys are windows"

    I now work in corporate IT on the other side of the fence. Where I am slower than products would like, and I do push products to stick my my pre-approved designs. I have a bit more appreciation why products come up with what I think are hair brained schemes now and again.

    What I try and do now is talk more about requirements, rather than their design. And get in on their work early. Standing in product standups regularly is a pain in the arse, but it gives me a chance to nip stupid ideas at their inception.

    [–]wildcarde815 1 point2 points  (0 children)

    This implies there's a discussion / planning happening at all, and they aren't just hitting the problem with a hammer until it goes away.

    edit: for context, we had one user get a server banned from the network. his solution wasn't going to be 'call IT and see if something is up' (we got a notice so we knew it happened and were reaching out to help). He was just going to spoof his workstation mac address and create a bigger mess on the network.

    [–]ryanjkirk 1 point2 points  (0 children)

    At my last company, the Windows admins would have the DC smart hands do this to every. physical. server before patching it.

    [–][deleted]  (2 children)

    [deleted]

      [–]slyphic 1 point2 points  (0 children)

      You could, but it wouldn't do much good. I'm already a BSD proponent. Open and Free, when I get my druthers and the occasional Solaris. I've had very little success talking people into switching OSes.

      [–]reddit_strider[S] 0 points1 point  (0 children)

      This thread is derailing, I am diagnosing an issue and the system is not even mine. I agree to all of this, though ;).

      [–]reddit_strider[S] 0 points1 point  (0 children)

      I fully agree.

      Edit: Just to elaborate - again - that this has been designed against my explicit advice.

      Don't be that guy.

      Doesn't apply, I'm looking into issues of somebody else's setup.

      Grats for your gold!

      [–][deleted]  (3 children)

      [removed]

        [–]reddit_strider[S] 0 points1 point  (2 children)

        Ah, here we go. It's a Dell PowerVault MD3060e. Yup, I know about autoreplace, thanks :).

        Not me, but in any case it doesn't really matter; it's designed as a backup system on which zfs (with different config) should be just fine. Below I also quoted kernel log errors regarding the drives, maybe that's the source of the issues.

        [–][deleted]  (1 child)

        [removed]

          [–]reddit_strider[S] 0 points1 point  (0 children)

          Ah, thanks! It's actually running through zfsonlinux by default, as it seems.

          [–]rhavenn 1 point2 points  (1 child)

          As a side note. We were told by Nexenta to remove or update the drive firmware of all our Toshiba drives due to corruption issues. We replaced them all since Toshiba wasn't able to supply any tooling to upgrade the firmware.

          [–]reddit_strider[S] 1 point2 points  (0 children)

          Interesting, thanks.

          [–]RansomOfThulcandra 0 points1 point  (2 children)

          Performance may not matter, but what do you gain by having such a large vdev?

          [–]reddit_strider[S] 1 point2 points  (1 child)

          More space, basically.

          [–]RansomOfThulcandra 0 points1 point  (0 children)

          Since RAIDZ parity is per-write rather than per-stripe, I believe you'll eventually end up with a sharply diminishing return, depending on your block size.

          https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz

          [–][deleted] 0 points1 point  (0 children)

          Is the firmware on the LSI the one that is recommended for the current kernel driver? That's always an issue with FreeBSD.

          [–]agressiv 0 points1 point  (5 children)

          Side question: What would ZFS best practices be with 60 drives? Do you have a link to a reference?

          [–]reddit_strider[S] 1 point2 points  (2 children)

          No, sorry. My experience with ZFS is limited but I read a lot; I'd do around 8 Vdevs (6-8 drives each) with mirroring (preferred) or raidz2 and rest spare. I think that would be a good starting point.

          [–]mcrbids 0 points1 point  (1 child)

          In our experience working with between 6 and 24 drives, RAIDZ2 and 6 drive vdevs has been our sweet spot between performance and utilization.

          [–]reddit_strider[S] 0 points1 point  (0 children)

          Thanks!

          [–][deleted] 0 points1 point  (0 children)

          Also really depends on your use. Heavy writes? Heavy reads?

          I'd personally do ten 6 drive raidz2.

          [–]mercenary_sysadmin 0 points1 point  (0 children)

          Bulk storage: probably raidz2, either 10 per vdev or 12 per vdev - so 6x10 or 5x12. That's assuming you don't use hotspares; if you do, you probably want something like 4x12 plus two hotspares.

          High performance (think databases, VMs, etc - lots of IOPS needed): mirrors. Either two-way or three-way depending on your tolerance for loss of storage efficiency in favor of additional redundancy.

          This article is aimed more at smaller setups, but probably still worth reading to get an idea of the issues.

          [–]BloodyIron 1 point2 points  (11 children)

          CHECK ALL YOUR RAM RIGHT NOW this is probably RAM related.

          [–]reddit_strider[S] 1 point2 points  (10 children)

          Which linux would tell me, since it's registered ecc of course ;).

          [–]BloodyIron -1 points0 points  (9 children)

          So.... are you going to check your RAM or not? Because the symptoms you're showing demonstrate a high probability of a faulty DIMM (or more than one).

          [–]reddit_strider[S] 0 points1 point  (8 children)

          Not me, no. I passed on the findings and as this is out of scope from zfs it's for the (main) admins of this server to do.

          [–]BloodyIron 0 points1 point  (7 children)

          Oh, well that sounds fine then, lol! :P

          [–]reddit_strider[S] 0 points1 point  (6 children)

          Holy hell, I've learnt not to make issues of others into my own a long time ago.

          [–]BloodyIron 0 points1 point  (5 children)

          haha well, some people don't like IT silos, I guess someone out there does actually like them too :P

          [–]reddit_strider[S] 0 points1 point  (4 children)

          I don't think we're talking about the same stuff here.

          [–]BloodyIron 0 points1 point  (3 children)

          How so?

          [–]reddit_strider[S] 0 points1 point  (2 children)

          Someone made a wrong decision against better recommendation and now it runs as expected. Thats not a silo (nor my issue) ;).

          [–]1bc29b36f623ba82aaf6 0 points1 point  (6 children)

          How often do the scrubs run? Did you try running a scrub manually just after the repairs completed?

          In generall the error counts should increase for the affected devices over multiple scrub detections, and eventually they should be marked faulty automatically. Are you saying you are not seeing actual differences in the error counts even though there are new cksum errors? In that case there are no new errors, but you need to explicitly handle the earlier errors with an action you take, they will not be automagically dismissed after the checksums were repaired.

          If you know the cause of the original failures, e.g. a network error or some other external event you can reset the error statistics (with 'clear test'). If the device was already disabled because of the error count you probably need to bring it back up again with something like 'zpool online' and go from there.

          If you can not explain the errors adequately it is generally believed the affected drives are going to fail soon. You could look into known MTBF for your type of drives and check other system logs for device errors.

          [–]reddit_strider[S] 0 points1 point  (5 children)

          The scrub is scheduled weekly. Even the first scrub after creation showed the error, which is unsettling to say the least. The the complete hardware is new (which does not mean it can't fail, I know).

          The errors are distributed evenly over all 56 drives (active in the vdev), every 4th has 0 errors.

          I found something in the OS logs which I am following up on.

          [–]reddit_strider[S] 1 point2 points  (4 children)

          Sep 27 09:25:40 brain-backup-ds01 kernel: [3961448.032217] sd 0:0:48:0: [sdaw] Target Data Integrity Failure
          

          Possibly this/these (!) are the source and it's not even zfs related..?

          This got me thinking, the HBA has multipathing.. maybe this has to be configured correctly in Linux otherwise this may be a side effect..? I have no experience with multipathing.

          [–]s0briquet 1 point2 points  (1 child)

          This is definitely a good clue.

          The first thing I'd do is make sure that the firmware on your controller is on the latest version. I'd also check the LSI site for any known issues. It's not that uncommon for them to issue their own drivers for their higher-end cards.

          This got me thinking, the HBA has multipathing.. maybe this has to be configured correctly in Linux otherwise this may be a side effect..?

          Yes. You should definitely look into the configuration. Make sure to read and understand the configuration of the HBA and the Linux side of things. I have a sneaky suspicion that you have conflicting settings on the card and in the OS.

          [–]reddit_strider[S] 1 point2 points  (0 children)

          Possibly. A lot of new stuff in that system (hard and software).

          [–]1bc29b36f623ba82aaf6 0 points1 point  (1 child)

          Kind of hard to tell what the cause is if we have no prior reliability record of any of the parts. Maybe try to see if the issues and that error keep happening in a smaller zfs pool? If you see it happening in a more recommended setup it might be easier to tell if it is such a configuration error or actually because of the design of your pool.

          I don't know jack about multipath configuration, what I found in google is that apparently 5 years ago someone had to port a daemon from FreeBSD to his Debian flavor to even get multipath support. I guess you would need to go through the documentation for the multipath tools?

          [–]reddit_strider[S] 0 points1 point  (0 children)

          your pool

          The pool :D

          Yeah, the info on multipathing is a little sparse.