I have setup an NFS share under Solaris 10. It utilizes ZFS, which in turn uses an IBM DS400 for backend storage. On top of that I have nagios running to monitor it. I got an alarm about the zfs pool being in a degraded state. I logged into the system and found this in the dmesg:
May 17 03:20:20 files DESC: The number of checksum errors associated with a ZFS device May 17 03:20:20 files exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information.
To see more information I ran
-bash-3.00# zpool status
pool: rz2pool
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 5h21m with 0 errors on Wed May 19 08:41:49 2010
config:
NAMEÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â STATEÂ Â Â Â READ WRITE CKSUM
rz2pool                    DEGRADED    0    0    0
raidz2Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd0Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd1Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd2Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd3Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd4Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd5Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd6Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
raidz2Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â DEGRADEDÂ Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd7Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd8Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd9Â Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd10Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd11Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
c3t21000000D12643DEd12Â DEGRADEDÂ Â Â Â 0Â Â Â Â 0Â Â 234Â too many errors
c3t21000000D12643DEd13Â ONLINEÂ Â Â Â Â Â 0Â Â Â Â 0Â Â Â Â 0
errors: No known data errors
And this is where ZFS is awsome. It may not be the fastest volume manager on the planet, or the smartest. But I trust the integrity of it (having read whitepapers on it).
What is really cool here:
- It has detected that the underlying LUN is misbehaving.
- It has marked the LUN as degraded
- It has saved my data from silent corruption.
There are not many volume managers out there, which does that. I have not lost data, the dataintegrity is still in place and I know what disk is about to fail. Kudos and thanks to the ZFS dev team!
Leave a Reply
You must be logged in to post a comment.