There are two typical sets of symptoms when an Engine becomes unresponsive:
1. The Delphix Engine NFS mounts are not responsive, and all target host IO operations fail. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix Admin or self-service user, and no login prompt is presented when SSH connection is attempted.
2. The Delphix Engine NFS mounts are responsive, and VDB operations are not disrupted. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix Admin or self-service user. A login prompt is received when SSH connection is attempted, but login attempts fail, with no password prompt following the entry of a username.
In both conditions, the hypervisor still indicates the virtual machine (VM) is running, and ping may return successfully. Memory and CPU utilization may be variable, or the VM may indicate no activity.
In condition 1, a non-maskable interrupt (NMI) may be sent from the hypervisor to cause the Delphix operating system (DxOS) to kernel panic and generate a crash dump. The resulting crash dump can be collected by Delphix Support for further analysis.
If there is no response to the NMI on the VM's console, retry the procedure. The final recourse is to reset or power on/off the system which will not generate a core and reduces potential for root cause analysis.
It is important to note that this procedure may not be successful in all cases. Unresponsive VM situations may occur for a variety of reasons related to the guest operating system or hypervisor issues. The following procedure is a best-effort to collect system state information at the time of a VM becoming unresponsive.
In condition 2, a Delphix Support user may still be able to login and offer recovery options so if possible, the NMI should not be issued or Engine rebooted until Support is engaged for further direction.
An administrative user with permissions to access the Delphix VM Console in Azure Portal is required to perform these actions. This is typically granted by "VM Contributor" role.
Additional storage blob container permissions may be required to access the Boot Diagnostics interface.
Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
Major Release All Sub Releases 6.0
22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124
126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124
126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52
184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206
220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124
1. After logging into the Azure Portal, navigate to "Virtual Machines", then click the VM name in the resulting tab.
2. Scroll to the bottom of VM tools to locate "Support + troubleshooting" heading. Click "Boot diagnostics" and click "Serial log" to access the VM serial log history. A "Download serial log" hyperlink should be available to download the current console content prior to the NMI operation. This serial log access is also helpful for monitoring the NMI and resulting panic, and any startup messages post-restart.
Missing permissions for storage blob container may result in an error:
Error encountered while getting the screenshot or serial log file from the blob container in storage account <storage account name>. Please make sure you have permissions and fireall (sic) is not blocking access to the Storage account.
3. From "Support + troubleshooting", click "Serial console" to access the VM interactive console. Initially there will be a delay for a number of seconds while the console connects:
Missing permissions for the admin user will cause the Serial console connection may result in an error:
The serial console connection to the VM encountered an error: 'Forbidden (403) - You do not have the required permissions to use this VM serial console. Please ensure you have at least VM Contributor role permissions.
4. Once connected, the "Send Command" button will be available. Click this icon, then click "Send Non-Maskable Interrupt (NMI)".
A final warning will be posted to the user that the VM will be crashed and restarted for debugging purposes. Click "Send NMI" button to initiate the process.
Monitoring Delphix VM Console during NMI
The serial console should subsequently indicate a VM panic due to NMI received.
Example Delphix VM console output, versions <6.0
panic[cpu0]/thread=ffffff003d005c40: NMI received fffffffffbc18ed0 fffffffffbad559f () fffffffffbc18f00 unix:av_dispatch_nmivect+34 () fffffffffbc18f10 unix:nmiint+152 () ffffff003d005bd0 unix:mach_cpu_idle+6 () ffffff003d005c00 unix:cpu_idle+11a () ffffff003d005c20 unix:idle+a7 () ffffff003d005c30 unix:thread_start+8 () dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Example Delphix VM console output, versions >= 6.0
Delphix engine versions 6.0 and later will immediately boot into a diagnostic kernel image in order to write the crash dump to disk. Messages similar to the following will be seen in the console log whilst the crash dump is written to disk. Note that the values and file paths will differ from what is shown below. The Delphix engine will then attempt to reboot normally.
[ 16.646585] kdump-tools: Starting kdump-tools: [ 16.656098] kdump-tools: * running makedumpfile -c -d 31 --message-level 22 --private-page-filter 0x2F5ABDF11ECAC4E /proc/vmcore /var/crash/202204281528/dump-incomplete [ 32.066802] kdump-tools: Page filter: 0x2f5abdf11ecac4e [ 32.075554] kdump-tools: STEP [Checking for memory holes ] : 0.023162 seconds [ 32.083134] kdump-tools: STEP [Excluding unnecessary pages] : 0.057384 seconds [ 32.089872] kdump-tools: STEP [Copying data ] : 14.745280 seconds [ 32.097036] kdump-tools: STEP [Copying data ] : 0.003801 seconds [ 32.104759] kdump-tools: Original pages : 0x00000000001e23bb [ 32.111072] kdump-tools: Excluded pages : 0x00000000001af8dd [ 32.117443] kdump-tools: Pages filled with zero : 0x0000000000032667 [ 32.124092] kdump-tools: Non-private cache pages : 0x000000000003acf5 [ 32.131174] kdump-tools: Private cache pages : 0x00000000000000aa [ 32.137880] kdump-tools: private filter pages : 0x00000000000565d7 [ 32.144447] kdump-tools: User process data pages : 0x000000000007b5b8 [ 32.152035] kdump-tools: Free pages : 0x0000000000070948 [ 32.159405] kdump-tools: Hwpoison pages : 0x0000000000000000 [ 32.166963] kdump-tools: Offline pages : 0x0000000000000000 [ 32.174569] kdump-tools: Remaining pages : 0x0000000000032ade [ 32.181013] kdump-tools: (The number of pages is reduced to 10%.) [ 32.188092] kdump-tools: Memory Hole : 0x00000000000ddc45 [ 32.194236] kdump-tools: -------------------------------------------------- [ 32.200962] kdump-tools: Total pages : 0x00000000002c0000 [ 32.208083] kdump-tools: Cache hit: 514825, miss: 904, hit rate: 99.8% [ 32.215449] kdump-tools: The dumpfile is saved to /var/crash/202204281528/dump-incomplete. [ 32.226237] kdump-tools: makedumpfile Completed. [ 33.430492] kdump-tools: * kdump-tools: saved vmcore in /var/crash/202204281528 [ 38.811794] kdump-tools: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/202204281528/dmesg.202204281528 [ 38.831262] kdump-tools: The dmesg log is saved to /var/crash/202204281528/dmesg.202204281528. [ 38.841395] kdump-tools: makedumpfile Completed. [ 38.847671] kdump-tools: * kdump-tools: saved dmesg content in /var/crash/202204281528
Once this process is completed, and the VM is accessible again, Delphix Support will login to the Engine directly to relocate the crash dump file to be collected in a Support log bundle, or manually transferred from the VM via SCP, etc.
The following articles may provide more information or related information to this article: