Skip to main content

How to Generate a non-maskable interrupt NMI in VMware ESX (KBA1129)



There are two typical sets of symptoms when an Engine becomes unresponsive:

1. The Delphix Engine NFS mounts are not responsive, and all target host IO operations fail. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix user

2. The Delphix Engine NFS mounts are responsive, and VDB operations are not disrupted.  The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix user

In both conditions, the hypervisor still indicates the virtual machine (VM) is running, and ping may return successfully.

In condition 1a non-maskable interrupt (NMI) may be sent from the hypervisor to cause the Delphix operating system (DxOS) to kernel panic and generate a crash dump. The resulting crash dump can be collected by Delphix Support for further analysis.

If there is no response to the NMI on the VM's console, retry the procedure. The final recourse is to reset or power on/off the system which will not generate a core and reduce potential for root cause analysis.

It is important to note that this procedure will not be successful in all cases. Unresponsive VM situations may occur for a variety of reasons related to the guest operating system or ESX hypervisor issues. The following procedure is a best-effort to collect system state information at the time of a VM becoming unresponsive.

In condition 2, a VMware snapshot should be attempted instead of, or prior to, an NMI.

NMI Procedure, VMware ESX 5.x, 6.x

  1. Login to ESX via SSH.
  2. At the prompt use the command 'esxcli vm process list' to get the list of VMs and record the World ID.

  3. Once the World ID is obtained, execute the following command to initiate the NMI:  "vmdumper <world id> nmi".  In the example below, the Delphix Engine name is known to be "example5023", which can be used with "grep" to reduce output for parsing.  

    Example of esxcli and vmdumper commands
    ~ # esxcli vm process list | grep -A 10 example
       World ID: 5678754
       Process ID: 0
       VMX Cartel ID: 5678753
       UUID: 56 4d 67 e6 38 d9 70 27-b9 06 56 4c 77 a9 5b 9d
       Display Name: delphix5023
       Config File: /vmfs/volumes/6c25682a-d47ef09e/dlpx-
    ~ # vmdumper 5678754 nmi
    Sending NMI to guest...
    ~ #

    No output beyond "Sending NMI to guest..." is expected.  The command prompt should typically return within a few seconds.

  4. Other methods are detailed in VMware KB article How to send NMI to Guest OS on ESXi 6.x (2149185).

NMI Procedure, VMware ESX 4

NMIs can only be sent on ESX from the SSH command-line. First, attempt to connect to the ESX system via SSH. If that fails, enable SSH using the following sequence of steps:

  1. On the ESX system's console: Press ALT-F1 and a console log should be displayed. Type 'unsupported' to access the VMware "Tech Support Mode".
    The text entered will not be visible.  
  2. A password prompt should appear and the root password may be entered to gain CLI access.
  3. Edit (vi) /etc/inetd.conf, search for line beginning with #ssh, and uncomment the line

    ~ # vi /etc/inetd.conf
  4. Identify the inetd process id (pid) using the command ps | grep inetd

    Find inetd process id
    ~ # ps | grep inetd
    1541 1541 busybox		inetd
  5. Restart the inetd process using the pid obtained in previous step using kill -HUP <pid>

    Restart inetd
    ~ # kill -HUP 1541

To send the NMI, complete the following:

  1. Login to ESX via SSH
  2. At the ESX command prompt use the command vm-support -x to get the list of VMs and note the vmid belonging to the Delphix Engine

    "vm-support -x" example
    [root@esxserver ~]#  vm-support -x
    VMware ESX Support Script 1.30
    Available worlds to debug:
    vmid=4305		RHEL Oracle Source
    vmid=4308		Delphix Engine
    vmid=4309		RHEL Dev Target
  3. Once the vmid is collected from the previous step, use vmdumper to generate NMI:

    Generate NMI with "vmdumper"
    [root@esxserver ~]# vmdumper 4308 nmi 

Advice from VMware

VMware has a number of knowledge base articles of their own relating to diagnosing unresponsive VMs and generating NMIs for those systems. The following articles are especially relevant to the processes discussed above. 

Monitoring Delphix VM Console during NMI

During the NMI process, it is helpful to observe (and record if possible) the Delphix VM console behavior to confirm the NMI is received, and to observe the DxOS panic and reboot. Details of the output may vary but the DxOS panic indicated with "NMI received" is expected in every instance if the operation is successful. If the console indicates the dump has reached 100%, the desired diagnostic information should be retrievable by Delphix Support.

Example Delphix VM console output
panic[cpu0]/thread=ffffff000b805c40: NMI received
ffffff000b805aa0 fffffffff791f57f ()
ffffff000b805ad0 unix:av_dispatch_nmivect+34 ()
ffffff000b805ae0 unix:nmiint+152 ()
ffffff000b805bd0 unix:mach_cpu_idle+6 ()
ffffff000b805c00 unix:cpu_idle+11a ()
ffffff000b805c20 unix:idle+a7 ()
ffffff000b805c30 unix:thread_start+8 ()
syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
 0:03 100% done
100% done: 119831 pages dumped, dump succeeded

If any issue is encountered during the DxOS panic dump, the process may take up to 2 hours to time out. If the "NMI received" activity is NOT observed in the console, the Engine may still be rebooted but this indicates diagnostic information may not be generated as the NMI was not registered by the Engine. If you are receiving messages on the console and it stops responding for a long period of time (>30m), particularly at any point during syncing file systems or incrementing the amount of pages dumped, it is likely that it will not complete. You can observe the VM from the hypervisor to try to make a determination of whether there is enough activity to warrant giving it additional time or proceed with resetting the VM.

VMware Snapshot

VMware provides instructions on generating a VM snapshot in the following knowledge document:

During this operation, it is imperative that the "Snapshot the virtual machine's memory" option is selected, to capture the live VM memory state, as the VM hang condition may prevent the ability to write diagnostic information to the Engine filesystems, and therefore prevent root cause analysis from being completed.