Disk failure and ZFS

ZFS is a beautiful filesystem – even in case of hardware failures, as it was build to deal with them. Allow me to demonstrate a defective disk on a Raid-Z1 pool. As long as only one disk breaks down, it is still functional – even while rebuilding.

To do the recovery, we need to locate the bad disk, which happens by typing ‘zpool status -x’:

  pool: MyPool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:
 
        NAME        STATE     READ WRITE CKSUM
        Ahsay       DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            c1t0d5  ONLINE       0     0     0
            c1t0d6  ONLINE       0     0     0
            c1t0d7  UNAVAIL      0     0     0  cannot open
            c1t0d8  ONLINE       0     0     0
 
errors: No known data errors

The defective disk is c1t0d7. If the disk isn’t damaged, you can try to bring it back online using the command ‘zpool online ‘. In our case it would be ‘zpool online MyPool c1t0d7’

If this works, anything is fine again, but if you need to put the disk at another LUN, it gets a new ID. In that case you need to use ‘zpool replace [disk]’ – the latter parameter is the new place of the disk. As I just overwrote the disk using dd, it stays at the same place, so ‘zpool replace MyPool c1t0d7’ is sufficient to integrate it back into the pool and it starts to rebuild:

pool: MyPool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 517h14m to go
config:
 
        NAME              STATE     READ WRITE CKSUM
        Ahsay             DEGRADED     0     0     0
          raidz1          DEGRADED     0     0     0
            c1t0d5        ONLINE       0     0     0
            c1t0d6        ONLINE       0     0     0
            replacing     DEGRADED     0     0     7
              c1t0d7s0/o  FAULTED      0     0     0  corrupted data
              c1t0d7      ONLINE       0     0     0  3.54M resilvered
            c1t0d8        ONLINE       0     0     0
 
errors: No known data errors

After the rebuild, the so called ‘resilvering’ completed, the topic is history and we are done with it: short and sweet. In my opinion it’s even a little too easy to be true, as the whole system could stay online and working all the time.

Author:

Leave a Reply

Your email address will not be published. Required fields are marked *