Skip to main content
Delphix

TB049 Delphix Engine May Restart Unexpectedly After Application of ESXi Patches

 

 

 

Alert Type

Availability

Impact

The Delphix Engine becomes unresponsive and reboots. This causes abnormal termination of all running jobs at the application level, including, but not limited to, VDB Provision, Refresh, Snapsync, Replication, et cetera. This also causes outages for VDBs served from the Delphix Engine.  The problem may intermittently recur.

Contributing Factors

This issue may occur when using any version of the Delphix Engine, including but not limited to:

 

Major Release

All Sub Releases

5.2

5.2.2.0, 5.2.3.0

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1 ,5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

 

The issue is only known to occur when using VMware ESXi hypervisors, versions 5.5 and 6.0, but may occur with other versions of ESXi.

The issue is related, but not limited to, the application of one or or more of the following ESXi Patch sets:

VMware ESXi 6.0, Patch Release ESXi600-201808001 (56548)
VMware ESXi 6.0, Patch Release ESXi600-201809001 (57797)
VMware ESXi 6.0, Patch Release ESXi600-201807401-BG (53629)
VMware ESXi 5.5, Patch Release ESXi550-201809001 (57478)

 

The problem may occur with any network load, but is more likely to occur when using certain network services that use special packets where hardware checksum offload is enabled:

  • DHCP

  • SNMP

Symptoms

  • You may see

    Delphix Engine Communication Error

    when navigating with a browser to the Delphix Engine or when an existing Delphix Admin or Server Setup application is disrupted by the issue.

  • A Delphix Engine alert and accompanying email may occur with the following text:

    The server is starting up following an unexpected shutdown around <date>."

    or

    The management service is starting up following an unexpected shutdown around <date>."
  • Jobs or Actions running at the time of failure will terminate abnormally. A Delphix Engine alert and accompanying email will be issued with the following text:

    <job_type> for "<object>" failed due to an error during execution: <job_type> for <object> failed due to server restart during execution."

    where <job_type> is the type of job running, for example, DB_REFRESH, DB_PROVISION, or DB_SYNC and <object> is the name of the Delphix group and database name for which the job was being processed. 

Relief/Workaround

Defer installation of the cited ESXi patch sets or uninstall affected patches if already installed.

Resolution

The issue is fully resolved in Delphix Engine Releases 5.2.6.2, 5.3.0.1, and later releases.

Additional Information

Changes in the way ESXi emulates the vmxnet3 device causes the issue. These changes improperly process certain network packets that have hardware checksum offload enabled and are split up into multiple fragments when passed to the transmission queue. Network traffic from some UDP/IP-based applications trigger the issue.

Delphix has reproduced the issue with the cited ESXi versions and patch levels and is testing other ESXi configurations.  This article will be updated as more information is available.