There are two typical sets of symptoms when an Engine becomes unresponsive:
1. The Delphix Engine NFS mounts are not responsive, and all target host IO operations fail. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix user
2. The Delphix Engine NFS mounts are responsive, and VDB operations are not disrupted. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix user
In both conditions, the hypervisor still indicates the virtual machine (VM) is running, and ping may return successfully.
In condition 1, a non-maskable interrupt (NMI) may be sent from the hypervisor to cause the Delphix operating system (DxOS) to kernel panic and generate a crash dump. The resulting crash dump can be collected by Delphix Support for further analysis.
If there is no response to the NMI on the VM's console, retry the procedure. The final recourse is to reset or power on/off the system which will not generate a core and reduce potential for root cause analysis.
It is important to note that this procedure will not be successful in all cases. Unresponsive VM situations may occur for a variety of reasons related to the guest operating system or ESX hypervisor issues. The following procedure is a best-effort to collect system state information at the time of a VM becoming unresponsive.
In condition 2, a VMware snapshot should be attempted instead of, or prior to, an NMI.
NMI Procedure, VMware ESX 5.x, 6.x
- Login to ESX via SSH.
At the prompt use the command '
esxcli vm process list'to get the list of VMs and record the World ID.
Once the World ID is obtained, execute the following command to initiate the NMI: "
vmdumper <world id> nmi". In the example below, the Delphix Engine name is known to be "example5023", which can be used with "
grep" to reduce output for parsing.
Example of esxcli and vmdumper commands
~ # esxcli vm process list | grep -A 10 example example5023 World ID: 5678754 Process ID: 0 VMX Cartel ID: 5678753 UUID: 56 4d 67 e6 38 d9 70 27-b9 06 56 4c 77 a9 5b 9d Display Name: delphix5023 Config File: /vmfs/volumes/6c25682a-d47ef09e/dlpx-220.127.116.11-55/dlpx-18.104.22.168-55.vmx ~ # vmdumper 5678754 nmi Sending NMI to guest... ~ #
No output beyond "Sending NMI to guest..." is expected. The command prompt should typically return within a few seconds.
Other methods are detailed in VMware KB article How to send NMI to Guest OS on ESXi 6.x (2149185).
NMI Procedure, VMware ESX 4
NMIs can only be sent on ESX from the SSH command-line. First, attempt to connect to the ESX system via SSH. If that fails, enable SSH using the following sequence of steps:
- On the ESX system's console: Press ALT-F1 and a console log should be displayed. Type '
unsupported' to access the VMware "Tech Support Mode".
The text entered will not be visible.
- A password prompt should appear and the root password may be entered to gain CLI access.
/etc/inetd.conf, search for line beginning with
#ssh, and uncomment the line
~ # vi /etc/inetd.conf
inetdprocess id (
pid) using the command
ps | grep inetdFind inetd process id
~ # ps | grep inetd 1541 1541 busybox inetd
inetdprocess using the
pidobtained in previous step using
kill -HUP <pid>Restart inetd
~ # kill -HUP 1541
To send the NMI, complete the following:
- Login to ESX via SSH
At the ESX command prompt use the command
vm-support -xto get the list of VMs and note the
vmidbelonging to the Delphix Engine"vm-support -x" example
[root@esxserver ~]# vm-support -x VMware ESX Support Script 1.30 Available worlds to debug: vmid=4305 RHEL Oracle Source vmid=4308 Delphix Engine vmid=4309 RHEL Dev Target
vmidis collected from the previous step, use
vmdumperto generate NMI:Generate NMI with "vmdumper"
[root@esxserver ~]# vmdumper 4308 nmi
Advice from VMware
VMware has a number of knowledge base articles of their own relating to diagnosing unresponsive VMs and generating NMIs for those systems. The following articles are especially relevant to the processes discussed above.
Monitoring Delphix VM Console during NMI
During the NMI process, it is helpful to observe (and record if possible) the Delphix VM console behavior to confirm the NMI is received, and to observe the DxOS panic and reboot. Details of the output may vary but the DxOS panic indicated with "NMI received" is expected in every instance if the operation is successful. If the console indicates the dump has reached 100%, the desired diagnostic information should be retrievable by Delphix Support.
panic[cpu0]/thread=ffffff000b805c40: NMI received ffffff000b805aa0 fffffffff791f57f () ffffff000b805ad0 unix:av_dispatch_nmivect+34 () ffffff000b805ae0 unix:nmiint+152 () ffffff000b805bd0 unix:mach_cpu_idle+6 () ffffff000b805c00 unix:cpu_idle+11a () ffffff000b805c20 unix:idle+a7 () ffffff000b805c30 unix:thread_start+8 () syncing file systems... done dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel 0:03 100% done 100% done: 119831 pages dumped, dump succeeded rebooting...
If any issue is encountered during the DxOS panic dump, the process may take up to 2 hours to time out. If the "NMI received" activity is NOT observed in the console, the Engine may still be rebooted but this indicates diagnostic information may not be generated as the NMI was not registered by the Engine. If you are receiving messages on the console and it stops responding for a long period of time (>30m), particularly at any point during syncing file systems or incrementing the amount of pages dumped, it is likely that it will not complete. You can observe the VM from the hypervisor to try to make a determination of whether there is enough activity to warrant giving it additional time or proceed with resetting the VM.
VMware provides instructions on generating a VM snapshot in the following knowledge document:
During this operation, it is imperative that the "Snapshot the virtual machine's memory" option is selected, to capture the live VM memory state, as the VM hang condition may prevent the ability to write diagnostic information to the Engine filesystems, and therefore prevent root cause analysis from being completed.