Oracle Virtual Database (VDB) Cannot be Provisioned or Enabled Due to Zombie Processes From Previous Instance (KBA1512)

Provision or Enabling an Oracle virtual database (VDB) appears to hang indefinitely without any error being reported to the Delphix Engine. It is possible that this is a result of an unclean shutdown during one of the following operations:

Cancelling a previous Provision job
Disabling the VDB

In both cases, a canceled provision or VDB disable job complete with no visible errors. To determine whether there was an unclean shutdown of the instance, some further investigation is required.

Troubleshooting

An Oracle hang during provisioning can have various causes. To determine whether this is caused by an unclean shutdown of the instance, please work through these troubleshooting steps.

When viewing the Oracle Alert log, you might see a generic ORA-00600 error message from a previous failed instance shutdown.

Fri Dec 13 13:30:01 2016 
Errors in file /oracle/diag/rdbms/v66/V66/trace/V66_ora_1666.trc (incident=854541): 
ORA-00600: internal error code, arguments: [2116], [900], [], [], [], [], [], [], [], [], [], [] 
Incident details in: /oracle/diag/rdbms/v66/V66/incident/incdir_854541/V66_ora_1666_i854541.trc

Examining the referenced trace file shows IO wait errors at the very top of the file, such as the following example:

*** 2016-12-13 13:30:01.999
Process diagnostic dump for oracle@SOMEHOST (CKPT), OS id=1666,
pid: 12, proc_ser: 1, sid: 3, sess_ser: 1 
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 12:43:02 ]
    NOTE: scheduling delay has not been sampled for 0.851177 secs  0.000000 secs from [ 12:42:59 - 12:43:03 ], 5 sec avg
  0.000000 secs from [ 12:42:03 - 12:43:03 ], 1 min avg
  0.000000 secs from [ 12:38:04 - 12:43:03 ], 5 min avg
loadavg : 0.00 0.01 0.01
Swapinfo : 
    Avail = 61818.32Mb Used = 20049.52Mb
    Swap free = 41768.80Mb Kernel rsvd = 3801.01Mb
    Free Mem  = 4241.47Mb 
  F S      UID   PID  PPID  C PRI NI             ADDR   SZ            WCHAN    STIME TTY       TIME COMD
3401 S  delphix  7498     1  0 152 20 e0000001ed3db980 103199 e0000001f1df3260 13:30:01 ?         0:00 ora_ckpt_V66

*** 2016-12-13 13:30:30.999
Short stack dump: ORA-32516: cannot wait for process 'Unix process pid: 1666, image: oracle@SOMEHOST (CKPT)' to finish executing ORADEBUG command 'SHORT_STACK'; wait time exceeds 29950 ms

current sql: <none>
Current Wait Stack:
 0: waiting for 'Disk file operations I/O'
    FileOperation=0x2, fileno=0x0, filetype=0x1
    wait_id=4 seq_num=5 snap_id=1
    wait times: snap=5 min 31 sec, exc=5 min 31 sec, total=5 min 31 sec
    wait times: max=infinite, heur=5 min 31 sec
    wait counts: calls=0 os=0
    in_wait=1 iflags=0x5a0

After cancelling the Provision/Enable job, you find there are one or more zombie Oracle processes associated with the instance.
```
# ps -eo state=STATE,ruser=USER,ppid,pid,stime,cmd | grep -E '^STATE|V66'
STATE    USER       PPID    PID STIME CMD
Z     delphix       1430   1666 06:08 ora_ckpt_V66
Z     delphix       1430   1674 06:08 ora_lgwr_V66
Z     delphix       1120   1423 02:55 ora_ckpt_V66
```
You will need to use ps with whatever options are relevant for your operating system for providing the state. In the above example from RHEL, the state "Z" indicates a zombie process.
Any attempt to manually kill these processes with kill -s SIGKILL <PID> fails.

If you find all four symptoms, then it is highly likely that the shutdown was unclean. In this scenario, "unclean" means that NFS shares from the Delphix Engine to the Oracle instance were unmounted before all Oracle processes had terminated. This can happen if the shutdown request to Oracle takes an abnormal time to complete. The Delphix Engine will wait for X seconds for the shutdown to complete. If nothing is returned in that time, the VDB disable job will continue anyway without error and unmount the NFS shares. This blocks processes that depend on disk IO from terminating. Those zombie processes prevent subsequent Provision/Enable jobs from running successfully.

Resolution

The zombie processes need manual cleanup. Since they cannot be terminated normally, they must be terminated via their parent process.

First identify the parent process ID (PPID). In this sample, the parent processes have PPIDs of 1430 and 1120.

# ps -eo state=STATE,ruser=USER,ppid,pid,stime,cmd | grep -E '^STATE|V66'
STATE    USER       PPID    PID STIME CMD
Z     delphix       1430   1666 06:08 ora_ckpt_V66
Z     delphix       1430   1674 06:08 ora_lgwr_V66
Z     delphix       1120   1423 02:55 ora_ckpt_V66

Then send SIGCHLD to the PPID.
```
# kill -s SIGCHLD 1430 1120
```
Finally confirm that the processes are actually terminated. If not, you can go up a level to the parent parent process and try to kill the parent directly with kill -s SIGKILL <PPID> if you are sure the parent process is safe to kill. In this example, the output of ps confirms that everything was killed successfully.
```
# ps -eo state=STATE,ruser=USER,ppid,pid,stime,cmd | grep -E '^STATE|V66'
STATE    USER       PPID    PID STIME CMD
```

Do not send SIGCHLD to the parent process if the parent process is:

init
systemd
An OS level cluster manager like HACMP or Power HA
Any other kind of system wide parent process

Doing so can result in other processes being terminated resulting in unexpected system behavior. If the parent process is one of the above, then the best way to terminate the zombie processes is to perform a graceful system reboot.

Issue

Troubleshooting

Resolution