Skip to main content
Delphix

TB015 Unexpected Reboots On Intel® Xeon® E5-2600 v2 Systems

 

 

Alert Type

Availability

Impact

Due to spurious machine faults from the underlying platform, the Delphix Engine may unexpectedly and intermittently reboot on some server platforms. Access to virtual databases (VDBs) may be temporarily suspended, and affected VDBs may hang or crash. Delphix jobs running at the time of the failure, e.g. SnapSync or Provision jobs, may fail when interrupted by a reboot. 

Contributing Factors

The problem is only known to occur on platforms using the Intel® Xeon® E5 v2 Product Family of Processors (Ivy Bridge EP) released in September, 2013 and Intel® Xeon® E7 v2 (Ivy Bridge EP) Product Family of Processors released in Q1 2014.

The problem does not occur on the prior generation of  Intel® Xeon® E5 processors (Sandy Bridge EP) nor on any Intel™-based servers manufactured prior to September, 2013. 

The problem may occur when running any version of VMware ESXi software, or any version of the Delphix Engine.

Operating systems other than the Delphix Engine's DelphixOS (including some Windows variants) running under VMware on Intel® Xeon® E5-2600 v2 are also known to be impacted.

The problem is more likely to occur with increased load on an affected ESXi server, for example if there are more than a few other VM guests active.  

Symptoms

  • Following an unexpected reboot, an alert will be created with the following descriptive text:

    	Unexpected server restart    The server is starting up following an unexpected shutdown around <date>. Contact Delphix Support.

    Note: this message can also occur if the Delphix Engine guest is restarted from the VMware vSphere™ Client
     

  • Jobs running at the time of the failure may fail with the alert:

    	<job_type> job for "<object>" failed due to server restart during execution 

    where <job_type> is the type of job running, for example DB_REFRESH,  DB_PROVISION, or DB_SYNC and <object> is the name of the Delphix group and database name for which the job was being processed. 
     

  • When a reboot occurs VDBs may experience a temporary suspension of service. SQL Server VDBs may be inaccessible until they are restarted. In the system log of affected Oracle target hosts, messages like:

    	NFS server <ip address>  not responding

    may be seen on the console or in the system log, where <ip address> is the network address of the affected Delphix server. 

  • In rare cases,  multiple machine faults can result in the affected the VM guest being suspended, and the vSphere client may create an alert with the following text:

    	Click OK to restart the virtual machine or Cancel to power off the virtual machine. 

    When a VMware administrator first logs into the affected ESXi server they will be presented with a dialog with this text. Until then, the VM guest will remain in a suspended state. 

Relief/Workaround

  1. A VMware administrator enables the software MMU virtual settings for the affected guest machine:
    1. Start the VMware vSphere™ Client
    2. Select the IP address / Name for one of the affected ESXi servers hosting a Delphix Engine
    3. Enter a valid User Name and Password, then select Login
    4. Expand the inventory (in the left panel)  for the affected ESXi server, and select a VMware guest system hosting a Delphix Engine
    5. In the Getting Started tab, select Edit virtual machine settings
    6. On the Virtual Machine Properties dialog, select the Options tab
    7. Select CPU/MMU Virtualization
    8. Select the "Use Intel® VT-x/AMD-V™ for instruction set virtualization and software for MMU virtualization" option
    9. Select OK
  2. Shutdown all running VDBs. Login to the Delphix Admin application 
    For each running VDB:
    1. Expand the VDB panel by selecting (clicking on) it
    2. Select the Shutdown VDB icon (red box)
    3. Select Yes on the dialog asking "Are you sure you want to shutdown this VDB?"
  3. Login to the Delphix Server Setup application using a user with sysadmin credentials
    1. Select Shutdown Delphix Engine at the top of the Server Setup page
    2. Select reboot
  4. Restart the VDBs.  Login to the Delphix Admin application.
    For each VDB that was stopped in Step 2
    1. Expand the VDB panel by select (clicking on) it
    2. Select the Startup VDB icon (green arrow)
    3. Select Yes on the dialog asking "Are you sure you want to startup this VDB"?

According to VMware recommendations, the use of the software MMU adds an additional 5-10% to the existing memory overhead for affected VM guests. 

Resolution

The Intel® Xeon® Processor Product Family Specification Update documents (see links below), contains additional information about the cause of the problem:

For Intel® Xeon® Processor E5 v2 processors

See the information for erratum CA135, "A MOV to CR3 When EPT is Enabled May Lead to an Unexpected Page Fault or an Incorrect Page Translation" in the "Intel® Xeon® Processor E5 v2 processors Product Family Specification Update document."

For Intel® Xeon® Processor E7 v2 processors

See the information for erratum CF124, "A MOV to CR3 When EPT is Enabled May Lead to an Unexpected Page Fault or an Incorrect Page Translation" in the "Intel® Xeon® Processor E7 v2 processors Product Family Specification Update document."

 

Contact your server manufacturer for a possible BIOS update containing a fix for this issue. Alternatively, VMWare has released a fix in ESXi 5.5 Update 2 (see VMware KB 2073791) and an ESX upgrade can be done. The same document also mentions the availability of a fix via vSphere Installation Bundle (VIB) for a subset of earlier ESX versions.  These VIBs could be applied if a fix from the hardware vendor is not available.

Additional Information

Identifying the processor model

It is not possible to directly identify the model of processor used from the Delphix Engine. 

A VMware administrator can display the CPU hardware description by two methods:

  1. Using the VMware vSphere™ Client
    1. Using the client, connect and login to the desired ESXi server hosting one or more Delphix Engine guests
    2. Ensure that the ESXi server name is selected in the inventory list in the left panel
    3. Select the Configuration tab
    4. In the Hardware panel, select Processors
      Screen Shot 2014-03-25 at 10.24.39 PM.png 
  2. Logging into the ESXi server via ssh and entering the following command:

    # vim-cmd hostsvc/hosthardware|grep description

    The output of this command will show the text description of the installed CPU(s):

             description = "Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz", 
             description = "Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz", 

     

Affected Processors will be of the form "Intel(R) Xeon(R) CPU E5-xxxx v2 @ <speed>Ghz" or "Intel(R) Xeon(R) CPU E7-xxxx v2 @ <speed>Ghz"