Skip to main content
Delphix

TB062 Device Removal May Lead To Delphix Engine Crashes or Data Loss

 

 

Alert Type

Availability, Data Corruption

Impact

In rare instances, following one or more device removal operations, a Delphix Engine may crash and enter a persistent crash loop preventing the affected system from restarting.  

Any Delphix Virtualization jobs running at the time of the crash will be abnormally terminated. This includes, but is not limited to, Refresh, Snapsync, Replication, et cetera.  Virtual Databases (VDBs) running at the time of a Delphix Engine will hang, and may crash.

The issue may result in a protracted and continuous outage requiring Delphix Support intervention. Although unlikely, unrecoverable data corruption or data loss is possible. 

Contributing Factors

The issue can only occur when using one of the following Delphix Engine releases:

Major Release

All Sub Releases

5.3

5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0

This issue does not affect Delphix Masking Engine. The issue can only occur during or after the Delphix Storage Migration feature has been used to remove one or more storage pool devices. The removal of a storage device need not have been performed with a 5.3 version but may have occurred at any time in the past.

The issue will only occur at the time a VDB is being deleted, either through an explicit delete operation, a VDB refresh, or Self-Service operations that perform a delete or refresh.

Symptoms

You may see the error when navigating with a browser to the Delphix Engine or when an existing Delphix Admin or Server Setup application is disrupted by the issue:

Delphix Engine Communication Error

When an Engine crash occurs, Oracle target hosts may experience messages in their system log console, or tty output like:

NFS server <ip address> not responding

where <ip address> is the network address of the affected Delphix server.   Attempts to access files under the mount point for Delphix-host remote filesystems may hang.

On hypervisor platforms (e.g. VMware’s ESXi) where virtual console access is available, the console will show recurring and continuous crashes and reboots.

Relief/Workaround

Defer use of the Delphix Storage Migration feature. 

For instances where the feature has already been used on one or more of the susceptible Delphix versions, it is possible to contact Delphix Support via a support case for additional screening.  Delphix Support can use a support bundle to confirm if the Delphix Storage Migration feature has been used and to provide a workaround. 

Once the issue has occurred, the only remedy is to open a support case with Delphix Support to help recover the system.

Resolution

This issue is fully resolved in Delphix 5.3.5.0 and later releases.

Additional Information

The issue is related to a product defect that may cause a filesystem freelist to become corrupted. Depending on the extent of the corruption, there is risk of customer data being lost or corrupted as a result of the issue.

The following article may provide more information or related information to this article: