I have setup an NFS share under Solaris 10. It utilizes ZFS, which in turn uses an IBM DS400 for backend storage. On top of that I have nagios running to monitor it. I got an alarm about the zfs pool being in a degraded state. I logged into the system and found this in the dmesg:
May 17 03:20:20 files DESC: The number of checksum errors associated with a ZFS device May 17 03:20:20 files exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information.
To see more information I ran
-bash-3.00# zpool status pool: rz2pool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 5h21m with 0 errors on Wed May 19 08:41:49 2010 config: NAME                       STATE    READ WRITE CKSUM rz2pool                    DEGRADED    0    0    0 raidz2                   ONLINE      0    0    0 c3t21000000D12643DEd0  ONLINE      0    0    0 c3t21000000D12643DEd1  ONLINE      0    0    0 c3t21000000D12643DEd2  ONLINE      0    0    0 c3t21000000D12643DEd3  ONLINE      0    0    0 c3t21000000D12643DEd4  ONLINE      0    0    0 c3t21000000D12643DEd5  ONLINE      0    0    0 c3t21000000D12643DEd6  ONLINE      0    0    0 raidz2                   DEGRADED    0    0    0 c3t21000000D12643DEd7  ONLINE      0    0    0 c3t21000000D12643DEd8  ONLINE      0    0    0 c3t21000000D12643DEd9  ONLINE      0    0    0 c3t21000000D12643DEd10 ONLINE      0    0    0 c3t21000000D12643DEd11 ONLINE      0    0    0 c3t21000000D12643DEd12 DEGRADED    0    0  234 too many errors c3t21000000D12643DEd13 ONLINE      0    0    0 errors: No known data errors
And this is where ZFS is awsome. It may not be the fastest volume manager on the planet, or the smartest. But I trust the integrity of it (having read whitepapers on it).
What is really cool here:
- It has detected that the underlying LUN is misbehaving.
- It has marked the LUN as degraded
- It has saved my data from silent corruption.
There are not many volume managers out there, which does that. I have not lost data, the dataintegrity is still in place and I know what disk is about to fail. Kudos and thanks to the ZFS dev team!
Leave a Reply
You must be logged in to post a comment.