I have setup an NFS share under Solaris 10. It utilizes ZFS, which in turn uses an IBM DS400 for backend storage. On top of that I have nagios running to monitor it. I got an alarm about the zfs pool being in a degraded state. I logged into the system and found this in the dmesg:
May 17 03:20:20 files DESC: The number of checksum errors associated with a ZFS device May 17 03:20:20 files exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information.
To see more information I ran
-bash-3.00# zpool status
pool: rz2pool
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 5h21m with 0 errors on Wed May 19 08:41:49 2010
config:
NAME STATE READ WRITE CKSUM
rz2pool DEGRADED 0 0 0
raidz2 ONLINE 0 0 0
c3t21000000D12643DEd0 ONLINE 0 0 0
c3t21000000D12643DEd1 ONLINE 0 0 0
c3t21000000D12643DEd2 ONLINE 0 0 0
c3t21000000D12643DEd3 ONLINE 0 0 0
c3t21000000D12643DEd4 ONLINE 0 0 0
c3t21000000D12643DEd5 ONLINE 0 0 0
c3t21000000D12643DEd6 ONLINE 0 0 0
raidz2 DEGRADED 0 0 0
c3t21000000D12643DEd7 ONLINE 0 0 0
c3t21000000D12643DEd8 ONLINE 0 0 0
c3t21000000D12643DEd9 ONLINE 0 0 0
c3t21000000D12643DEd10 ONLINE 0 0 0
c3t21000000D12643DEd11 ONLINE 0 0 0
c3t21000000D12643DEd12 DEGRADED 0 0 234 too many errors
c3t21000000D12643DEd13 ONLINE 0 0 0
errors: No known data errors
And this is where ZFS is awsome. It may not be the fastest volume manager on the planet, or the smartest. But I trust the integrity of it (having read whitepapers on it).
What is really cool here:
- It has detected that the underlying LUN is misbehaving.
- It has marked the LUN as degraded
- It has saved my data from silent corruption.
There are not many volume managers out there, which does that. I have not lost data, the dataintegrity is still in place and I know what disk is about to fail. Kudos and thanks to the ZFS dev team!