TB015 Unexpected Reboots On Intel® Xeon® E5-2600 v2 Systems
Alert Type
Availability
Impact
Due to spurious machine faults from the underlying platform, the Delphix Engine may unexpectedly and intermittently reboot on some server platforms. Access to virtual databases (VDBs) may be temporarily suspended, and affected VDBs may hang or crash. Delphix jobs running at the time of the failure, e.g. SnapSync or Provision jobs, may fail when interrupted by a reboot.
Contributing Factors
The problem is only known to occur on platforms using the Intel® Xeon® E5 v2 Product Family of Processors (Ivy Bridge EP) released in September, 2013 and Intel® Xeon® E7 v2 (Ivy Bridge EP) Product Family of Processors released in Q1 2014.
The problem does not occur on the prior generation of Intel® Xeon® E5 processors (Sandy Bridge EP) nor on any Intel™-based servers manufactured prior to September, 2013.
The problem may occur when running any version of VMware ESXi software, or any version of the Delphix Engine.
Operating systems other than the Delphix Engine's DelphixOS (including some Windows variants) running under VMware on Intel® Xeon® E5-2600 v2 are also known to be impacted.
The problem is more likely to occur with increased load on an affected ESXi server, for example if there are more than a few other VM guests active.
Symptoms
-
Following an unexpected reboot, an alert will be created with the following descriptive text:
Unexpected server restart The server is starting up following an unexpected shutdown around <date>. Contact Delphix Support.
Note: this message can also occur if the Delphix Engine guest is restarted from the VMware vSphere™ Client
-
Jobs running at the time of the failure may fail with the alert:
<job_type> job for "<object>" failed due to server restart during execution
where <job_type> is the type of job running, for example DB_REFRESH, DB_PROVISION, or DB_SYNC and <object> is the name of the Delphix group and database name for which the job was being processed.
-
When a reboot occurs VDBs may experience a temporary suspension of service. SQL Server VDBs may be inaccessible until they are restarted. In the system log of affected Oracle target hosts, messages like:
NFS server <ip address> not responding
may be seen on the console or in the system log, where <ip address> is the network address of the affected Delphix server.
-
In rare cases, multiple machine faults can result in the affected the VM guest being suspended, and the vSphere client may create an alert with the following text:
Click OK to restart the virtual machine or Cancel to power off the virtual machine.
When a VMware administrator first logs into the affected ESXi server they will be presented with a dialog with this text. Until then, the VM guest will remain in a suspended state.
Relief/Workaround
- A VMware administrator enables the software MMU virtual settings for the affected guest machine:
- Start the VMware vSphere™ Client
- Select the IP address / Name for one of the affected ESXi servers hosting a Delphix Engine
- Enter a valid User Name and Password, then select Login
- Expand the inventory (in the left panel) for the affected ESXi server, and select a VMware guest system hosting a Delphix Engine
- In the Getting Started tab, select Edit virtual machine settings
- On the Virtual Machine Properties dialog, select the Options tab
- Select CPU/MMU Virtualization
- Select the "Use Intel® VT-x/AMD-V™ for instruction set virtualization and software for MMU virtualization" option
- Select OK
- Shutdown all running VDBs. Login to the Delphix Admin application
For each running VDB:- Expand the VDB panel by selecting (clicking on) it
- Select the Shutdown VDB icon (red box)
- Select Yes on the dialog asking "Are you sure you want to shutdown this VDB?"
- Login to the Delphix Server Setup application using a user with sysadmin credentials
- Select Shutdown Delphix Engine at the top of the Server Setup page
- Select reboot
- Restart the VDBs. Login to the Delphix Admin application.
For each VDB that was stopped in Step 2- Expand the VDB panel by select (clicking on) it
- Select the Startup VDB icon (green arrow)
- Select Yes on the dialog asking "Are you sure you want to startup this VDB"?
According to VMware recommendations, the use of the software MMU adds an additional 5-10% to the existing memory overhead for affected VM guests.
Resolution
The Intel® Xeon® Processor Product Family Specification Update documents (see links below), contains additional information about the cause of the problem:
For Intel® Xeon® Processor E5 v2 processors
See the information for erratum CA135, "A MOV to CR3 When EPT is Enabled May Lead to an Unexpected Page Fault or an Incorrect Page Translation" in the "Intel® Xeon® Processor E5 v2 processors Product Family Specification Update document."
For Intel® Xeon® Processor E7 v2 processors
See the information for erratum CF124, "A MOV to CR3 When EPT is Enabled May Lead to an Unexpected Page Fault or an Incorrect Page Translation" in the "Intel® Xeon® Processor E7 v2 processors Product Family Specification Update document."
Contact your server manufacturer for a possible BIOS update containing a fix for this issue. Alternatively, VMWare has released a fix in ESXi 5.5 Update 2 (see VMware KB 2073791) and an ESX upgrade can be done. The same document also mentions the availability of a fix via vSphere Installation Bundle (VIB) for a subset of earlier ESX versions. These VIBs could be applied if a fix from the hardware vendor is not available.
Additional Information
Identifying the processor model
It is not possible to directly identify the model of processor used from the Delphix Engine.
A VMware administrator can display the CPU hardware description by two methods:
- Using the VMware vSphere™ Client
- Using the client, connect and login to the desired ESXi server hosting one or more Delphix Engine guests
- Ensure that the ESXi server name is selected in the inventory list in the left panel
- Select the Configuration tab
- In the Hardware panel, select Processors
-
Logging into the ESXi server via ssh and entering the following command:
# vim-cmd hostsvc/hosthardware|grep description
The output of this command will show the text description of the installed CPU(s):
description = "Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz", description = "Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz",
Affected Processors will be of the form "Intel(R) Xeon(R) CPU E5-xxxx v2 @ <speed>Ghz" or "Intel(R) Xeon(R) CPU E7-xxxx v2 @ <speed>Ghz"
Useful Links
Performance Evaluation of Intel EPT Hardware Assist (external link)
Change CPU/MMU Virtualization Settings in the vSphere Web Client (external link)
Overhead Memory on Virtual Machines (external link)
Intel® Xeon® Processor E5-2600 v2 Product Family (external link)
Intel® Xeon® Processor E7 v2 family (external link)
Intel® Xeon® Processor E5 v2 processors Product Family Specification Update (external link)
Intel® Xeon® Processor E7 v2 processors Product Family Specification Update (external link)
HP Support Center Document c04327904: System ROM Update RECOMMENDED to Prevent a BSOD or Kernel Panic in Virtual Machines (external link)