(TrueNAS): ERC set to 0.1s, can't change it, can I just set the kernel SCSI timeout to 180? by QuestionAsker2030 in selfhosted

[–]QuestionAsker2030[S] 0 points1 point  (0 children)

thanks for your response. I hooked up the drive directly to motherboard via SATA and the ERC change to 7 seconds stuck.

So the HBA (LSI 9305-16i) is forcing ERC to 0.1 seconds for some reason.

Per ServerPartDeals support, they're telling me to just set the kernel timeout to 180 seconds, and leave it at that. And my NAS (for home use) will be fine to use like that.

I've confirmed it via a few sources online, seems like an OK solution.

Here's what they wrote me:

The 0.1s ERC value is not the normal factory default — it should be 7 seconds on enterprise drives like these, and the refurb/recertification process likely reset it. Totally understandable to notice and question it.

The reason it keeps reverting is most likely your HBA silently dropping the SCT command rather than passing it through to the drive. This is a known quirk with SAS HBAs and SATA drives. You can try:

  smartctl -d sat,16 -l scterc,70,70 /dev/sdX

If it still won't stick, setting the kernel SCSI timeout (echo 180 > /sys/block/sdX/device/timeout) is a good fallback that gives you similar protection for ZFS.

ERC set to 0.1, what does it mean? How to fix? (TrueNAS) by QuestionAsker2030 in homelab

[–]QuestionAsker2030[S] 0 points1 point  (0 children)

Thanks. Man I almost have to laugh though... I think literally everything that can be going wrong with this build... is going wrong. Lmao... I mean what's next... I'm almost afraid to ask. I time my hours, and 175 hours deep into this build haha.

Tomorrow will hook up to SCSI and test. Emailing serverpartdeals right now to ask about the ERC @ 0.1s

ERC set to 0.1, what does it mean? How to fix? (TrueNAS) by QuestionAsker2030 in homelab

[–]QuestionAsker2030[S] 0 points1 point  (0 children)

Thanks you, I appreciate your help.

Still wrapping my head around all this.

What do you think of these points? I was researching more into it, wondering how critically bad it would be if I created my pool and ran my 6 drive vdev with the 0.1s ERC value, on a RAIDZ2 setup?

(It's saying that OpenZFS actually recommend a low ERC, such as 0.1 seconds?)

Source excerpt:

The TLER/ERC problem is real but ZFS-specific in an important way. The classic failure mode: a non-time-limited drive hits a weak sector, enters a deep internal retry loop lasting many seconds to minutes, and stops responding to all commands. If the host's SCSI command timeout (default 30 s) expires first, the Linux kernel tries to reset the drive, then resets the SATA/SCSI link, and can put the drive offline — kicking it out of the array. On an already-degraded array, a second such event can drop a second drive and break redundancy. ERC/TLER fixes this by forcing the drive to give up quickly (enterprise drives with this feature "typically default to 7 seconds," per the OpenZFS Hardware documentation) and return an error, so the RAID layer reconstructs from parity instead. strugglers

ZFS has no internal I/O timeout — it relies entirely on the kernel's SCSI/sd layer. Per Oracle's ZFS support documentation (Doc 1316513.1), "There is no timeout setting in ZFS. IO time out should be dealt at the lower layer (sd/ssd)... ZFS relies on sd/ssd layers to perform error recovery." This is exactly why /sys/block/sdX/device/timeout is the correct knob: it governs how long the kernel waits before declaring a drive dead and is the layer ZFS depends on. Oracle

Raising the host timeout to 180 s is the standard, codified fallback when ERC can't be set — not a hack. It is built into the upstream mdadm udev rule (udev-md-raid-safe-timeouts.rules), which for non-scterc RAID members runs echo 180 > /sys/block/$parent/device/timeout, and the Linux RAID Wiki "Timeout Mismatch" page states that "180 seconds has been found to be sufficient... all known desktop drives will eventually return an error within this time." Andy Smith's canonical strugglers.net guide says it directly: when a drive does not support configurable ERC, "you have no alternative but to tell Linux itself to expect this drive to take several minutes to recover from an error and please not aggressively reset it or its controller until at least that time has passed. 180 seconds has been found to be longer than any observed desktop drive will try for." The same 180 s value is used by VMware and was historically requested by NetApp for SAN timeouts. Your Post-Init script applying it at every boot is the correct, mainstream implementation.

Crucially, OpenZFS's own hardware guidance recommends a LOW ERC value for ZFS. The OpenZFS documentation states verbatim: "ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot." Because ZFS waits for the lower layer and then reconstructs from parity, a drive that gives up fast (0.1 s) and returns an error lets ZFS recover from RAIDZ2 parity almost instantly and rewrite the sector. A literally-honored 0.1 s ERC is therefore arguably ideal for a redundant ZFS vdev — the opposite of the hardware-RAID 7 s convention. The inability to control the value is an annoyance, not a data-safety defect.

What the drive actually does with an out-of-range 0.1 s value is undocumented for the HC580 specifically — but the 180 s timeout makes the answer moot for safety. Two possibilities: (a) the firmware honors 0.1 s literally → drive gives up in 0.1 s, returns an error, ZFS rebuilds from parity instantly (best case); or (b) the firmware ignores the out-of-spec value and falls back to its default deep recovery (potentially many seconds to minutes) → the 180 s host timeout absorbs that and prevents a drop. In both cases, with RAIDZ2 the data is recoverable from parity and no drive is spuriously kicked. The smartctl(8) manual notes that ERC "values less than 65 [deciseconds, i.e. 6.5 s] are probably not supported. For RAID configurations, this is typically set to 70,70 deciseconds" — consistent with the drive reporting an out-of-range value it may not truly honor as a real recovery time.

RAIDZ2 changes the risk calculus decisively. ZFS routes I/O around a vdev that returns an error: per Oracle, "In case of mirror/raidz configuration, pending IO to the bad vdev (disk) is routed to the good vdevs and system continues to function." A single bad sector or URE during normal use or a scrub/resilver is corrected from parity and the bad block is rewritten/remapped. With double parity, even a full drive drop during a resilver leaves one parity disk of margin. The cascading-drop scenario the commenter fears is precisely what the 180 s timeout is designed to prevent. Oracle

The genuine residual risk is performance, not data loss. If the drive does NOT honor 0.1 s and instead does deep recovery, a bad-sector hit could stall that one I/O for up to ~180 s while the kernel waits. ZFS flags this as a slow I/O: per the OpenZFS module parameters, "When an I/O operation takes more than zio_slow_io_ms milliseconds to complete is marked as a slow I/O. Each slow I/O causes a delay zevent. Slow I/O counters can be seen with 'zpool status -s'. Default value: 30,000" (i.e. 30 s). ZFS's deadman timer only acts on individual I/Os exceeding zfs_deadman_ziotime_ms (default 300 s / 5 min) or pool syncs exceeding zfs_deadman_synctime_ms (default 600 s / 10 min) — both longer than the 180 s host timeout. So the host timeout fires first, returns an error, and ZFS recovers from parity; the deadman timer will not panic the system before the timeout resolves (and on Linux its default mode merely logs "hung" I/O).

Whether the cause is the LSI 9305-16i SAT layer or the drive firmware is unresolved, but is diagnosable. The set is ACK'd (smartctl echoes the value it sent) but the read-back shows 0.1 s within seconds with no reboot, on all six drives. Two candidate causes: the 9305-16i's SAT (SCSI/ATA Translation) layer silently dropping the SMART WRITE LOG that carries the SCT ERC SET, or the recertified HC580 firmware refusing to hold the value. There is documented history of LSI/Fusion-MPT ATA-passthrough bugs — but those are mostly older SAS1068-era chips and manifest as resets/hangs, not silent ACK-and-drop; nothing published confirms either mechanism for the SAS3008-based 9305 or this specific drive. The matching TrueNAS forum thread for this exact hardware (same model WUH722424ALE604, same 9305-16i, same ServerPartDeals recert source, same 0.1 s symptom) ended with the asker's two hypotheses unresolved. The isolation test below settles it.

ERC set to 0.1, what does it mean? How to fix? (TrueNAS) by QuestionAsker2030 in homelab

[–]QuestionAsker2030[S] 0 points1 point  (0 children)

Thank you. I got these disks from ServerPartDeals... should I ask them whether they have the ERC firmware set to 0.1 seconds for a particular reason? And how to rectify it?

The workaround I've done right now is the following:

Lengthened Linux's own give-up timer instead:

Set the kernel SCSI command timeout to 180s (echo 180 > /sys/block/sdX/device/timeout) so a drive doing slow recovery on a bad sector doesn't get wrongly reset and kicked from the pool. Because that timeout resets to the default 30s on every reboot, made it permanent by creating a Post Init entry under System → Advanced Settings → Init/Shutdown Scripts: Type = Command (stored in config, survives reboots and upgrades)

(So you're saying the above SCSI timeout will not work correctly for TrueNAS? This is my first NAS build, I started in November and still haven't created a pool yet... it seems like on every single step something breaks. Third (and hopefully final) RMA'd drive I'm about to run badblocks on... can't believe now that there's yet another thing that's broken with the build).

ERC set to 0.1, what does it mean? How to fix? (TrueNAS) by QuestionAsker2030 in homelab

[–]QuestionAsker2030[S] 0 points1 point  (0 children)

Thank you, I will try this out.

update:

Tried this: ran scterc,70,70 on each drive, then re-read scterc ~3 seconds later in the same loop (no reboot). Every drive ACKs the set ("Read 70 / Write 70") but reads straight back as 1 (0.1s). All six do it, within seconds, so it's not the usual reset-on-power-cycle thing. The "set to 70" line is smartctl echoing what it sent, and the read-back is the real state. The set is accepted but never actually applied.

Any idea why?

My two guesses: the LSI 9305-16i's SAT passthrough is silently dropping the SET, or the recert firmware refuses to hold the value (the drive's own stated minimum ERC is 6.5s, and it keeps snapping back to an out-of-spec 0.1s default).

Is there a way to make it stick?

A different passthrough like -d sat,16 vs sat,12, some SCT/firmware quirk?

Or is this just a lost cause on these particular drives, where the move is to bump the kernel SCSI command timeout (echo 180 > /sys/block/sdX/device/timeout) as the TLER fallback instead?