Resolving Fault "Next backup to restore cannot be found" (KBA1719)
KBA
KBA#1719Applicable Delphix Versions
This article applies to all versions of the Delphix Engine.
Issue
In some environments, the Delphix Engine may detect a gap in the SQL Server backup history tables of a Source Database.
When this occurs, the Delphix Engine will raise the following fault:
Title: Next backup to restore cannot be found
Details: The next backup to be restored for dSource "StackOverflow" was not found in the msdb.dbo.backupset table.
User Action: Perform a sync on the dSource to continue pulling new changes from the source database.
Troubleshooting Fault
This fault is raised when the Delphix Engine detects a gap in the SQL Server backup history tables: msdb.dbo.backup*
.
These history tables are the Delphix Engine's source of truth for backup ingestion, during Snapshots and in all Validated Sync modes.
In the following example, metadata for a log backup of the StackOverflow
database was not added to the msdb database due to a blocking issue:
database_name backup_start_date type first_lsn last_lsn physical_device_name StackOverflow 2017-10-27 08:15:01.000 L 286000000875600038 286000000877900001 ..._20171027_081501.trn StackOverflow 2017-10-27 08:30:00.000 L 286000000877900001 286000000878200001 ..._20171027_083000.trn StackOverflow 2017-10-27 08:45:01.000 L 286000000878200001 286000000878500001 ..._20171027_084501.trn StackOverflow 2017-10-27 09:15:00.000 L 286000000878800001 286000000879100001 ..._20171027_091500.trn
The missing log backup is identifiable by the gap between the last_lsn
of the 08:45 backup, and the first_lsn
of the 9:15 backup.
The cause of missing backup history data will vary between environments, but may include:
- Blocking or deadlocks on the backup metadata tables, preventing the SQL Server backup routines from inserting backup information
- Intentional removal of backup information by DBAs to prevent the backup being used (for example, due to a corrupt backup, or the presence of sensitive data)
- Backups were taken on an Availability Group node that has since been removed as a member of the Availability Group
If you need to investigate the cause of this issue, contact your Database Administrators. The Delphix Engine does not manipulate data in SQL Server backup tables, and this behavior is occurring outside the Delphix Engine.
Resolution
There are two possible resolutions to this issue:
- Option 1: Perform a Snapshot of the dSource
- Option 2: Locate and import missing backup metadata
Generally, the recommended approach to resolving this issue is to perform a Snapshot of the dSource.
Ingesting a recent Full or Differential backup will often allow you to "catch up" more quickly with your current log backups, and Validated Sync operations will automatically resume once the snapshot is complete.
Locating and importing missing backup data may be preferable if your Source Database is too large for a Snapshot to be practical, or if you cannot tolerate any "missing" snapshots in your timeflow (for backups that have been skipped).
Option 1: Perform a Snapshot of the dSource
To take a Snapshot of the dSource, use the dSource's Snapshot button. This is available on the dSource Card, or in the Configuration tab, depending on your version.
This will open a dialog and provide several options for re-synchronizing the Source Database:
Once this snapshot is complete, a new Timeflow Card will be created to represent the new backup. Validated Sync operations will resume and new Timeflow Cards will be created for all backups created since the snapshot.
Depending on your backup schedule, it may be necessary to wait for the next full backup to be taken from the source system, or to initiate one manually.
Option 2: Locate and import missing backup metadata
Connect to the affected SQL Server using SQL Server Management Studio. For Availability Group sources, it will be necessary to connect to all nodes of the Availability Group.
Use the backupset
and backupmediafamily
tables to check if metadata about one or more log backups is "missing". The following query can be used, substituting <DATABASE NAME>
as appropriate:
SELECT TOP 100 bs.database_name, bs.backup_start_date, bs.type, bs.first_lsn, bs.last_lsn, bmf.physical_device_name FROM msdb.dbo.backupset bs INNER JOIN msdb.dbo.backupmediafamily bmf ON bs.media_set_id = bmf.media_set_id WHERE database_name IN ('<DATABASE NAME>') AND bs.type = 'L' ORDER BY bs.backup_start_date DESC
This will provide output similar to the following:
database_name backup_start_date type first_lsn last_lsn physical_device_name StackOverflow 2017-10-27 09:15:00.000 L 286000000878800001 286000000879100001 ..._20171027_091500.trn StackOverflow 2017-10-27 08:45:01.000 L 286000000878200001 286000000878500001 ..._20171027_084501.trn StackOverflow 2017-10-27 08:30:00.000 L 286000000877900001 286000000878200001 ..._20171027_083000.trn StackOverflow 2017-10-27 08:15:01.000 L 286000000875600038 286000000877900001 ..._20171027_081501.trn
Missing backupsets can be identified by a gap between the last_lsn
of one backup and the first_lsn
of the next (note that the above results are in descending order, from most recent to earliest). The date of the missing backup should follow the date of the last card in the dSource's Timeflow:
In the case of an Availability Group, run the above query on each of the members of the Availability Group. The server on which the gap is visible is the server on which the following steps should be performed.
Using the physical_device_name
of the other rows, check this directory containing the other backups and look for the missing file.
Once this is done, use the RESTORE VERIFY ONLY ... WITH LOADHISTORY
command to import the metadata from the missing backup to your Source Instance. It is important that the directory name is consistent with the other records in msdb (using a drive letter or Windows file share as appropriate).
RESTORE VERIFYONLY FROM DISK = '<PATH TO BACKUP>' WITH LOADHISTORY;
Clear any faults relating to the dSources.
Finally, Disable and Enable the affected dSources, using the Disable/Enable Slider. This step is necessary because the default behaviour of this fault is to wait until a Snapshot is performed.