Skip to main content
Delphix

Virtual Databases may fail to start following file removal on NFS shares from the Delphix Engine

Applicable Delphix Versions

  • 5.1
  • 5.0
  • 4.3
  • 4.2
  • 4.1
  • 4.0
  • 3.2

Issue

Following a Virtual DataBase (VDB) rewind a database will fail to start if archive logs to support that rewind have been manually removed. Customer visible error messages will vary and may include the following wording "none of the files are in backup ORA-06512".


While the body of this document is focused on manual removal of archive logs it is equally applicable to all other files (e.g. online redo, spfile) shared out via NFS to customer target hosts from the Delphix engine and used by Oracle during startup.
 

Scope

The impact of this issue is limited to VDBs. Only VDB's expose shared NFS filesystems to users.

Symptom

message_command_output (a column from dlpx_job_events) or the debug logs will include text similar to the following:

| RMAN-00571: ===========================================================\r
| RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============\r
| RMAN-00571: ===========================================================\r
| RMAN-03002: failure of recover command at 01/12/2017 15:42:25\r
| RMAN-06053: unable to perform media recovery because of missing log\r
| RMAN-06025: no backup of archived log for thread 2 with sequence 312370 and starting SCN of 14661068836919 found to restore\r
| RMAN-06025: no backup of archived log for thread 2 with sequence 312369 and starting SCN of 14661068551106 found to restore\r
| RMAN-06025: no backup of archived log for thread 2 with sequence 312368 and starting SCN of 14661065836322 found to restore\r
| RMAN-06025: no backup of archived log for thread 1 with sequence 203530 and starting SCN of 14661069060882 found to restore\r
| RMAN-06025: no backup of archived log for thread 1 with sequence 203529 and starting SCN of 14661068536351 found to restore\r

Diagnostic

The key diagnostic for this issue is a delete notice from the Remote Filesystem daemon (RFS), where the filesystem matches the filesystem of the impacted VDB.

Because logs have rolled, it is frequently not possible to match file names from Oracle start-up errors with the same file names being noted as deleted. In this case a pragmatic approach is required: there are logs which have been manually deleted; a VDB on the same filesystem is missing logs - so the correlation makes sense. While its true the evidence is circumstantial, such a link is reasonably compelling.

In situations where RFS shows file deletion and those files are matched in the Oracle startup error message, the evidence that the deletion is responsible for the failure is unequivocal.

The following is an example error. These errors are available in /var/adm/messages on a live system or <bundle>/log/syslog/messages in a supportbundle:

Mar 24 17:07:31 plv-feq-a004 nfssrv: [ID 239897 kern.notice] NOTICE: rfs3: File /domain0/group-39/oracle_db_container-121/oracle_timeflow-2711/archive/arch_1_938271787_1520.dbf deleted by 102 from 010.214.002.038

Steps to triage

1. Using MDS determine the job number, job type and failure mode.

select job_id,job_type,start_time,(update_time-start_time)as elapsed_time,job_state,title from dlpx_jobs where start_time > '2017-04-01 00:00:00' and title like '%myvdb%' order by start_time desc;

2. Extract the error message details using the job_id from step 1.

select job,event_time,job_state,percent_complete,message_details,message_action,message_command_output from dlpx_job_events where job = 'JOB-<number>' and job_state = 'FAILED' ;

3. If the error message is a match for the key symptom, proceed with further steps otherwise this document may not be appropriate.

4. Rule out the following DLPX-47065 VDB "recovery fails when files other than archive logs are detected by Oracle".
This bug has the same basic signature. The difference will be noted in the debug logs which will show an extra unexpected file on the filesystem under the title:

List of Files Which Were Not Cataloged

This will be followed by the error: 

RMAN-07519: Reason: Error while cataloging.

5. Collect and document within the case both the Delphix version and java stack associated with the failure.

6. Determine the container, group and which timeflow is active using the name of the VDB as a search:

select container_id,container_type,name,group_ref,provision_container,timeflow,previous_timeflow from dlpx_container where name like '%myvdb%';

7. For later possible workaround steps, understand if there are other ealier timeflows than the one which is current. Earlier timeflows may preceed file removal (and thereby potentially represent an alternative to a VDB refresh):

select timeflow_id,name,container,parent_timeflow,parent_db_change_id,parent_timestamp from dlpx_timeflow where container like '%-<container_id>';

8. Using the container_id and timeflow_id list the snapshots in the timeflow. In most cases there is more than one. It does not really matter to which snapshot a provision was attempted, because in most cases, removal of archive logs is likely to impact all snapshots in the one timeflow; customers tend to use blunt tools for file removal with an all or nothing approach. Of course the controlling factor here is the number of archive logs which have been removed and the width or range of the SCN's held by those removed archive logs.

If you know the date and time of the snapshot the customer was using, you can find the snapshot_id using the following query:

select snapshot_id, first_change_time, last_change_time, creation_time, timeflow, retention, consistency from dlpx_snapshot where container = 'ORACLE_DB_CONTAINER-11' and last_change_time::text like '%2017-03-17%';

If you do not know the snapshot the customer selected and the job failed, then it is not currently possible to ascertain the snapshot using logging. MDS will only be useful for successful DB_PROVISION jobs and then we don't care and this document is not something you would be looking at. The lack of visibility on this front is covered by the following bug in Jira https://jira.delphix.com/browse/DLPX-51075.

9. List the first and last_change_db_id using the timeflow_id:

select snapshot_id, timeflow, first_db_change_id, last_db_change_id from dlpx_snapshot where snapshot_id = <id_number>;

10. Check for existing logs within the SCN range, using the container_id, timeflow_id and scn as filters:

select log_id,timeflow,instance_num,start_scn,start_time,end_scn,end_time,sequence from orcl_log where container='ORACLE_DB_CONTAINER-<id>' and timeflow='ORACLE_TIMEFLOW-<id>' and <scn from first_db_change_id> between start_scn and end_scn;

11. Determine the filesystem on which the VDB resides using the timeflow_id:

select container_id,owner,filesystem from dlpx_storage_container where owner='ORACLE_TIMEFLOW-<id>';

12. Check for filesystem target-side deletions using grep, such as:

grep rfs messages |grep 'oracle_db_container-121/oracle_timeflow-2711' |wc -l

Cause

Such deletions point to an operation, usually scripted, on the target host. Scripts delete older logs and may be frequently more aggressive than a customer had intended.
Delphix Support should advise customers to use snapshot retention policies as this approach is aware of the logs required to make a snapshot viable.

In seeking out any rogue script, pay attention to the timing of any RFS notice as this may help understand when a script might next run which in turn can help us point the customer to the source of the issue (assuming it is a scheduled script).

Other

Reviewing listings of files within the impacted mount point is likely to be of limited value since it is going to be a reflection of what exists and not what has been removed.

ls -la /mnt/provision/myvdb/archive 

total 225504 
drwxr-x--x 2 oracle dba 10 Mar 10 03:30 . 
drwxr-x--x 7 oracle dba 7 Feb 20 14:00 .. 
-r--r----- 1 oracle dba 57633792 Mar 7 22:01 1_37_944455572.dbf 
-r--r----- 1 oracle dba 45889536 Mar 8 03:30 1_38_944455572.dbf 
-r--r----- 1 oracle dba 7549952 Mar 8 09:15 1_39_944455572.dbf 
-r--r----- 1 oracle dba 58901504 Mar 8 22:10 1_40_944455572.dbf 
-r--r----- 1 oracle dba 42459136 Mar 9 03:30 1_41_944455572.dbf 
-r--r----- 1 oracle dba 7444480 Mar 9 08:55 1_42_944455572.dbf 
-r--r----- 1 oracle dba 58735616 Mar 9 22:05 1_43_944455572.dbf 
-r--r----- 1 oracle dba 45362176 Mar 10 03:30 1_44_944455572.dbf

Workarounds

1. It may be possible to switch the VDB back to an earlier (possibly orginal) timeflow. The idea here is to return the VDB to the state prior to archive log removal. This can be achieved using the Delphix engine CLI, for example:


delphix engine> database
delphix engine database> select ORADB
delphix engine database 'ORADB'> switchTimeflow
delphix engine database 'ORADB' switchTimeflow *> set timeflow='DB_PROVISION@2016-06-19T04:45:15'


2. Use Delphix to refresh the VDB from the dSource. This will mean any changes unique to VDB would be lost. The refresh would of course solve the problem of missing logs as the process will bring over a new copy of files inclusive archive logs from the source database to the VDB and target system.


3. Restore the logs if they are available within the customer RMAN backups.

Related bugs

DLPX-51074 A fault should be thrown by Delphix where VDB archive logs are manually removed
DLPX-51075 VDB provision properties should be logged as a first step of each job
DLPX-47065 VDB recovery fails when files other than archive logs are detected by Oracle
 

Related documentation

How to resolve "Error: Archive logs cannot be retrieved due to deletion"

 

Conditional content (Pro member)