Skip to main content
Delphix

Using Hypervisor or Storage Snapshots, Clones for Delphix Engine Backup and Disaster Recovery (KBA1769)

 

 

KBA

KBA#1769

Applicable Delphix Versions

This article applies to the following versions of the Delphix Engine:

Major Release

All Sub Releases

All All

Troubleshooting 

Many customers engage Delphix Support or Services to ask questions about implementing snapshot technology for "backup", either at the hypervisor level, or within the storage infrastructure.  This document is intended to address the common concerns, and provide generic Delphix guidance.  Any deviation that's specific to a given hypervisor platform will be noted accordingly.

Why a Snapshot Isn't a Backup 

Although these terms are often used interchangeably, most snapshot implementations, by themselves, are not a true backup.  A snapshot provides a static image representing the state of the object (VM or storage) at that moment. However, it's typically not a true copy; although snapshots CAN be used as a source for backup, it simply captures a point-in-time, and these technologies will track the differences between snapshot time and current runtime (often referred to as delta). 

As a result, snapshots on a running VM over time can use a significant amount of extra storage, and should be used carefully since this overhead can ultimately result in space issues for the hypervisor, or for the storage platform.

However, the snapshot preserving a point-in-time can be used to revert the storage objects to a prior state, in the instance of logical device corruption, or (as previously mentioned) used as a backup source.

For a backup to be useful and effective, it needs to:

- Contain all pertinent data for the object

- Be physically independent from the source, possibly even in a different array, datacenter, etc (this varies depending on compliance requirements for your organization, business unit, or locale).

Therefore, in most implementations snapshots should be used as a method to acquire a backup image or copy of a VM, but not solely relied on for redundancy.

Concerns 

In all snapshot implementations for a Delphix VM backup, it's critical that all storage device states are captured at the same moment.  Delphix Filesystem (DxFS) stripes all data across all configured disks, and failure to capture the same moment/state for all disk devices allocated is likely to result in object corruption.

There are challenges in some snapshot implementations, as attempting to capture snapshots of disk devices ad-hoc with a running VM will likely result in different points in time being captured, and a resulting restoration of said disk devices would not likely be usable to recover a Delphix Engine. 

Additionally, when restoring a VM backup or cloning, some unique information pertaining to the instance naturally becomes duplicated. In most day-to-day operations, the configuration can be altered to address this (change Engine hostname, address, if cloning). However, some internal security objects used for challenge-response authentication will be cloned, which becomes an issue when Delphix Support or Services requires access to the Engine for troubleshooting, etc.  There are methods to address this, with Support intervention.

Snapshots can also introduce performance issues, when a large number of snapshots are taken by the hypervisor and not purged regularly, or space issues are encountered due to the overhead required in tracking changes (delta).

Currently, due to the implementation of VMware tools on DelphixOS (DxOS), calls from the ESX hypervisor to the Delphix filesystems will be unsuccessful, and therefore the consistency of the disk devices supporting our file system cannot be guaranteed with these tools while the OS is running.

Recommendation 

With the above concerns in mind, Delphix generally recommends the following (unless specific guidance for a given hypervisor platform is provided):

- Any snapshot or VM image taken for backup/recovery purposes should be done while the VM is powered off / shut down. This ensures filesystem consistency across allocated storage volumes, and reduces the time to recovery in most instances.

- Snapshots or other VM images taken from a running Engine may require additional recovery for consistency. Depending on the efforts required, this may be a billable activity, and outside the scope of your Support contract.

- Any VM that is cloned from a snapshot or backup should be isolated from the network until configuration change can be made effective via sysadmin interface, to avoid conflict with existing Engines which may be active at the same time (if applicable).  

- Cloning active VMs which have already been configured and used for data ingestion should not be used for any replication scenario.  Rather, Delphix Replication should be leveraged if the end goal is distribution of said data objects.

- Whenever possible, vendor tools or workflows which allow for snapshots to be taken at the same moment, in parallel, are ideal.  This has several benefits; beyond the self-consistency mentioned previously, ad-hoc snapshots or copies of individual disk devices often introduces an additional administrative burden.  Consider a scenario where a VM admin needs to snapshot 10 disk devices per Engine, for multiple Engines, and create a clone on demand.  It may introduce additional overhead and complexity if these snapshots or copies are not logically associated.  

Additional Information

Amazon Web Services (AWS)

AMI cloning can be leveraged to address some of the concerns about groups storage snapshots, etc, by cloning the entire VM and snapshotting all storage devices.  Additional information regarding this practice is included in External Links below.