Skip to main content
Delphix

KBA1742 Zpool Showing Errors With Reference 0xffffffffffffffff

 

 

Applicable Delphix Versions

 

Major Release

All Sub Releases

5.2 5.2.2.0, 5.2.2.1, 5.2.3.0

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1 ,5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

4.3

4.3.1.0, 4.3.2.0, 4.3.2.1, 4.3.3.0, 4.3.4.0, 4.3.4.1, 4.3.5.0

4.2

4.2.0.0, 4.2.0.3, 4.2.1.0, 4.2.1.1, 4.2.2.0, 4.2.2.1, 4.2.3.0, 4.2.4.0 , 4.2.5.0, 4.2.5.1

4.1

4.1.0.0, 4.1.2.0, 4.1.3.0, 4.1.3.1, 4.1.3.2, 4.1.4.0, 4.1.5.0, 4.1.6.0

4.0

4.0.0.0, 4.0.0.1, 4.0.1.0, 4.0.2.0, 4.0.3.0, 4.0.4.0, 4.0.5.0, 4.0.6.0, 4.0.6.1

3.2

3.2.0.0, 3.2.1.0, 3.2.2.0, 3.2.2.1, 3.2.3.0, 3.2.4.0, 3.2.4.1, 3.2.4.2, 3.2.5.0, 3.2.5.1, 3.2.6.0, 3.2.7.0, 3.2.7.1

3.1

3.1.0.1, 3.1.1.0, 3.1.2.0,  3.1.2.1, 3.1.3.0 , 3.1.3.1, 3.1.3.2, 3.1.4.0, 3.1.5.0, 3.1.6.0

3.0

3.0.0.3, 3.0.0.4, 3.0.1.0, 3.0.1.1, 3.0.1.2, 3.0.1.3, 3.0.2.0, 3.0.2.1, 3.0.3.0, 3.0.3.1, 3.0.4.0, 3.0.4.1, 3.0.5.0, 3.0.6.0, 3.0.6.1

Issue

A critical storage fault can be reported and on checking the zpool status -v it can show errors that are referencing a dataset ID of 0xffffffffffffffff.

 

Example:

 

**** Command: zpool status -v ****

pool: domain0
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub canceled on Tue Mar 27 08:44:01 2018
config:

NAME STATE READ WRITE CKSUM
domain0 DEGRADED 0 0 1.75M
c2t2d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c3t2d0 DEGRADED 0 0 3.49M too many errors

errors: Permanent errors have been detected in the following files:

<0xffffffffffffffff>:<0x1>

Troubleshooting

To check further on this issue, the fmdump -eV output from the support logs should be checked for ongoing ZFS errors:

 

Example of error in fmdump -eV:

 

Jul 10 2018 11:53:36.116113373 ereport.fs.zfs.checksum
nvlist version: 0
 class = ereport.fs.zfs.checksum
 ena = 0xdedd850f0f501801
 detector = (embedded nvlist)
 nvlist version: 0
 version = 0x0
 scheme = zfs
 pool = 0xd920afdbe9b2518f
 vdev = 0x43805e3a61fa687c
 (end detector)


 pool = domain0
 pool_guid = 0xd920afdbe9b2518f
 pool_context = 0
 pool_failmode = wait
 vdev_guid = 0x43805e3a61fa687c
 vdev_type = disk
 vdev_path = /dev/dsk/c3t2d0s0
 vdev_devid = id1,sd@n6000c297eab8151b5fe145c240b642f2/a
 parent_guid = 0xd920afdbe9b2518f
 parent_type = root
 zio_err = 50
 zio_offset = 0xdd6799be00
 zio_size = 0xac00
 zio_objset = 0xffffffffffffffff
 zio_object = 0x1
 zio_level = 1
 zio_blkid = 0x1e
 cksum_expected = 0x5506330b636512b5 0xff92901be0e10e55 0x96f9b5cba0c955da 0xe8614069208a6f6a
 cksum_actual = 0x5d299e45d41fe95c 0x637ab193e6132ee3 0x16e0119ea80cb6d7 0xd475e9ba0d9a36f9
 cksum_algorithm = edonr
 __ttl = 0x1
 __tod = 0x5b449e40 0x6ebbfdd

 

The dataset reference of 0xffffffffffffffff is used when there is a checksum error in an object that was deleted and we’re unable to finish the async destroy due to IO errors. 

This can be confirmed by the following command run against the support logs data:

 

$ grep -i freeing zpool_get_all
domain0 freeing 9.33M default
rpool freeing 0 default

so we can see that we are attempting to free 9.33MB of space but are being prevented in doing so due to the zpool corruption.  The issue may be with a single block that is holding up the release of all of this space.

Resolution

The resolution will require a remote session onto the Delphix engine so a flag can be set to allow ZFS to complete the freeing operation by ignoring the IO errors for these blocks.  The blocks affected by the IO errors will be leaked from the filesystem, i.e. they will no longer be available for use, but the remaining blocks will be freed.

 

On the Delphix engine:

$ pfbash
# echo "zfs_free_leak_on_eio/W 0t1" | pfexec mdb -kw
# echo "zfs_free_leak_on_eio/W 0t0" | pfexec mdb -kw
# zpool clear domain0

After setting the zfs_free_leak_on_eio value to 1, the 'freeing' value for domain0 should start to drop, indicating that the space is being released.  If the error in zpool status -v has still not been cleared then you can do the following to try to clear it:

# zpool scrub domain0
# zpool scrub -s domain0
# zpool scrub domain0
# zpool scrub -s domain0

 

Then check again to ensure the zpool status  -v is all okay and a fmdump -eV -t 1h is no longer reporting any ZFS errors.

 

Additional Information

 

External Links