How to Generate a Non-Maskable Interrupt (NMI) or Diagnostic Interrupt in AWS (KBA7830)
KBA
KBA# 7830Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
-
Major Release All Sub Releases 6.0 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1
Issue
The symptoms of an unresponsive server are that the system is not reachable via NFS, the GUI, SSH, console logins, et cetera, and that the EC2 console still indicates the virtual machine (VM) is running and CPU or memory resources may still be consumed. The server may respond to ping, depending on the nature of the issue.
There are two typical sets of symptoms when an Engine becomes unresponsive:
1. The Delphix Engine NFS mounts are not responsive, and all target host IO operations fail. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix Admin or self-service user, and no login prompt is presented when SSH connection is attempted.
2. The Delphix Engine NFS mounts are responsive, and VDB operations are not disrupted. The web interface will not load, and attempts to login via SSH are unsuccessful as any Delphix Admin or self-service user. A login prompt is received when SSH connection is attempted, but login attempts fail, with no password prompt following the entry of a username.
In both conditions, the hypervisor still indicates the virtual machine (VM) is running, and ping may return successfully. Memory and CPU utilization may be variable, or the VM may indicate no activity.
Should such a condition arise where the system is otherwise unreachable, a non-maskable interrupt (NMI) or diagnostic interrupt may be sent from the hypervisor to cause the Delphix operating system (DxOS) to kernel panic and generate a crash dump. The resulting crash dump can be collected by Delphix Support for further analysis.
If the system does not respond, retry the procedure. The final recourse is to reset or power on/off the system which will not generate a core and reduces the potential for root cause analysis.
It is important to note that this procedure will not be successful in all cases. Unresponsive VMs may occur for a variety of reasons related to the guest operating system or other hypervisor issues. The following procedure is a best-effort to collect system state information at the time of the issue.
Prerequisites
An administrative user with permissions to interface with the EC2 instance is required and the AWS EC2 CLI must be functional, as there is no web interface for this procedure at the time of article publish.
A subset of AWS EC2 instance types supported by Delphix do not support the diagnostic interrupt feature: this is an AWS EC2 limitation. Only AWS Nitro-based instance types (except A1) support this feature. The following AWS documentation discusses this more:
If this process is attempted in an unsupported instance type, the AWS EC2 CLI will indicate this:
An error occurred (UnsupportedOperation) when calling the SendDiagnosticInterrupt operation: This instance type does not support diagnostic interrupts.
Only Delphix engine versions 6.0.2.0 and later support instance types using the Nitro hypervisor
Issuing Diagnostic Interrupt
Locate the instance ID in the EC2 Instance web interface. In the following example, our Engine name is "Delphix1"
Alternatively, the instance ID can be collected via AWS EC2 CLI. In the following example we search for the same instance name:
% aws ec2 describe-instances --filters "Name=tag:Name,Values=Delphix1" --query "Reservations[].Instances[].InstanceId" - i-0af0379f93be843e2
Once the instance ID is located, the following command can be issued to direct EC2 to send the interrupt to the Engine:
aws ec2 send-diagnostic-interrupt --instance-id <instance ID>
If the command is successful, AWS EC2 CLI will simply return to the command prompt after a few moments.
The EC2 instance serial console will indicate OS reboot when the interrupt is received (uptime counter indicated in left column restarts at 0.000000):
And during the reboot process, if the panic dump is successful the Kernel crash dump process will start, and indicate that a dump is being saved to the Engine root filesystem (path indicated will vary based on date/time of interrupt being issued).
Following this activity, once the Engine is online and accessible, a Support log bundle can be collected through the normal interface, but the dump files generated will need to be transferred by a Delphix Support engineer via screen-sharing session for further RCA.
Related Articles
The following articles may provide more information or related information to this article: