Oracle Virtual Database (VDB) Cannot be Provisioned or Enabled Due to Zombie Processes From Previous Instance (KBA1512)
- Last updated
- Save as PDF
Issue
Provision or Enabling an Oracle virtual database (VDB) appears to hang indefinitely without any error being reported to the Delphix Engine. It is possible that this is a result of an unclean shutdown during one of the following operations:
- Cancelling a previous Provision job
- Disabling the VDB
In both cases, a canceled provision or VDB disable job complete with no visible errors. To determine whether there was an unclean shutdown of the instance, some further investigation is required.
Troubleshooting
An Oracle hang during provisioning can have various causes. To determine whether this is caused by an unclean shutdown of the instance, please work through these troubleshooting steps.
-
When viewing the Oracle Alert log, you might see a generic ORA-00600 error message from a previous failed instance shutdown.
Fri Dec 13 13:30:01 2016 Errors in file /oracle/diag/rdbms/v66/V66/trace/V66_ora_1666.trc (incident=854541): ORA-00600: internal error code, arguments: [2116], [900], [], [], [], [], [], [], [], [], [], [] Incident details in: /oracle/diag/rdbms/v66/V66/incident/incdir_854541/V66_ora_1666_i854541.trc
-
Examining the referenced trace file shows IO wait errors at the very top of the file, such as the following example:
*** 2016-12-13 13:30:01.999 Process diagnostic dump for oracle@SOMEHOST (CKPT), OS id=1666, pid: 12, proc_ser: 1, sid: 3, sess_ser: 1 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 12:43:02 ] NOTE: scheduling delay has not been sampled for 0.851177 secs 0.000000 secs from [ 12:42:59 - 12:43:03 ], 5 sec avg 0.000000 secs from [ 12:42:03 - 12:43:03 ], 1 min avg 0.000000 secs from [ 12:38:04 - 12:43:03 ], 5 min avg loadavg : 0.00 0.01 0.01 Swapinfo : Avail = 61818.32Mb Used = 20049.52Mb Swap free = 41768.80Mb Kernel rsvd = 3801.01Mb Free Mem = 4241.47Mb F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 3401 S delphix 7498 1 0 152 20 e0000001ed3db980 103199 e0000001f1df3260 13:30:01 ? 0:00 ora_ckpt_V66 *** 2016-12-13 13:30:30.999 Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 1666, image: oracle@SOMEHOST (CKPT)' to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 29950 ms current sql: <none> Current Wait Stack: 0: waiting for 'Disk file operations I/O' FileOperation=0x2, fileno=0x0, filetype=0x1 wait_id=4 seq_num=5 snap_id=1 wait times: snap=5 min 31 sec, exc=5 min 31 sec, total=5 min 31 sec wait times: max=infinite, heur=5 min 31 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x5a0
-
After cancelling the Provision/Enable job, you find there are one or more zombie Oracle processes associated with the instance.
# ps -eo state=STATE,ruser=USER,ppid,pid,stime,cmd | grep -E '^STATE|V66' STATE USER PPID PID STIME CMD Z delphix 1430 1666 06:08 ora_ckpt_V66 Z delphix 1430 1674 06:08 ora_lgwr_V66 Z delphix 1120 1423 02:55 ora_ckpt_V66
You will need to use
ps
with whatever options are relevant for your operating system for providing the state. In the above example from RHEL, the state "Z" indicates a zombie process.
- Any attempt to manually kill these processes with
kill -s SIGKILL <PID>
fails.
If you find all four symptoms, then it is highly likely that the shutdown was unclean. In this scenario, "unclean" means that NFS shares from the Delphix Engine to the Oracle instance were unmounted before all Oracle processes had terminated. This can happen if the shutdown request to Oracle takes an abnormal time to complete. The Delphix Engine will wait for X seconds for the shutdown to complete. If nothing is returned in that time, the VDB disable job will continue anyway without error and unmount the NFS shares. This blocks processes that depend on disk IO from terminating. Those zombie processes prevent subsequent Provision/Enable jobs from running successfully.
Resolution
The zombie processes need manual cleanup. Since they cannot be terminated normally, they must be terminated via their parent process.
-
First identify the parent process ID (PPID). In this sample, the parent processes have PPIDs of 1430 and 1120.
# ps -eo state=STATE,ruser=USER,ppid,pid,stime,cmd | grep -E '^STATE|V66' STATE USER PPID PID STIME CMD Z delphix 1430 1666 06:08 ora_ckpt_V66 Z delphix 1430 1674 06:08 ora_lgwr_V66 Z delphix 1120 1423 02:55 ora_ckpt_V66
-
Then send SIGCHLD to the PPID.
# kill -s SIGCHLD 1430 1120
-
Finally confirm that the processes are actually terminated. If not, you can go up a level to the parent parent process and try to kill the parent directly with
kill -s SIGKILL <PPID>
if you are sure the parent process is safe to kill. In this example, the output of ps confirms that everything was killed successfully.# ps -eo state=STATE,ruser=USER,ppid,pid,stime,cmd | grep -E '^STATE|V66' STATE USER PPID PID STIME CMD
Do not send SIGCHLD to the parent process if the parent process is:
- init
- systemd
- An OS level cluster manager like HACMP or Power HA
- Any other kind of system wide parent process
Doing so can result in other processes being terminated resulting in unexpected system behavior. If the parent process is one of the above, then the best way to terminate the zombie processes is to perform a graceful system reboot.