Skip to main content
Delphix

KBA1704 Resolving VDB Refresh Mounting Timeout Issues Caused by Diskpart Hangs

 

 

Applicable Delphix Versions

 

Major Release

All Sub Releases

5.2 5.2.2.0

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1 ,5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

4.3

4.3.1.0, 4.3.2.0, 4.3.2.1, 4.3.3.0, 4.3.4.0, 4.3.4.1, 4.3.5.0

4.2

4.2.0.0, 4.2.0.3, 4.2.1.0, 4.2.1.1, 4.2.2.0, 4.2.2.1, 4.2.3.0, 4.2.4.0 , 4.2.5.0, 4.2.5.1

4.1

4.1.0.0, 4.1.2.0, 4.1.3.0, 4.1.3.1, 4.1.3.2, 4.1.4.0, 4.1.5.0, 4.1.6.0

4.0

4.0.0.0, 4.0.0.1, 4.0.1.0, 4.0.2.0, 4.0.3.0, 4.0.4.0, 4.0.5.0, 4.0.6.0, 4.0.6.1

Issue

This issue occurs when iSCSI mounting/un-mounting operations occur, manifesting mostly during VDB operations such as refreshes, rollbacks, etc. In theory, this can happen to dSources when mounting or unmounting (disable/enable operations). In all cases, the Windows target host, whether standalone or in a failover cluster scenario, has some sort of Veritas disk management software installed, typically for some other application outside of Delphix usage. In most situations, there are physical SQL Server databases running in a mixed bag with Delphix iSCSI drives handling the VDBs.

When running a job on VDB you'll receive this type of error:

event_id               | 222363
job                    | JOB-54152
event_time             | 2018-01-23 21:13:28.471
job_state              | FAILED
percent_complete       | 35
message_code           | exception.windowshost.mount.timedout
message_params         | ["CSPWD00B0005"]
message_details        | Attempt to discover and mount LUNs from the Delphix Engine through the iSCSI initiator on target host "CSPWD00B0005" timed out.
message_action         | Make sure the target host has not exhausted all of its memory. If SQL Server instances are running on the target host, limit the maximum memory for all SQL Server instances to leave sufficient memory (at least 2 GB) for other operations on the host. Check the Windows event log for iSCSI initiator timeout errors.
message_command_output |
event_type             | ERROR

A timeout can be a legitimate problem with the infrastructure, but this article discusses what happens when the reason for a 20 minute timeout is due to the diskpart windows utility hanging on the operation. This utility is used in the mounting process to gather information about the disk used for the iSCSI mount on the VDB experiencing the operation.

Troubleshooting

To check if this is the diskpart "hang" scenario, open a session to the windows host where VDB database files are to be mounted. Open a command prompt or Powershell and run the diskpart command.  If you don't receive a prompt to start entering commands, diskpart is hung.

Open the Task Manager and check the process list under the details tab for diskpart.exe.

Right click the process and select "Analyze wait chain". If you see processes chained off this one, you are likely hitting this problem.

diskpart_chain.png

Notice the executable - vds.exe.  This is the virtual disk service provider.  You'll want to find that process in task manager and check its wait chain:

vds_chain.png

If you see the VxVDS.exe binary on the chain, the issue is likely caused by the Veritas Super VDS Provider. It's essentially a wrapper to the VDS providers, both to Microsoft and Veritas and enables the Veritas provider to handle management of its own disk management software as well as Microsoft disk management software such as diskpart and disk management. For reasons unspecified, the use of this provider can randomly cause the hanging issue and disrupt Delphix VDB operations relevant to iSCSI mounting, manifesting as a timeout error.

This problem can be easily resolved.

Resolution

The resolution is rather simple and is described in this link: VxVDS Information and steps to unregister/register.

In brief, a windows administrator needs to log onto the affected target (or in the case of dSources, staging) host.

Run these commands, following steps under How to unregister VxVDS: "1. Without stop/restart of Storage Agent" (do not run any other commands listed in the document):

cd %VMPATH%
vxvdsdyn.exe /unregserver
vxvds.exe /unregserver
taskkill /f /im vxvds.exe
taskkill /f /im vxvdsdyn.exe
taskkill /f /im vds.exe
vxassist refresh    # this will bring up VDS service without Veritas VDS providers

There is no need to restart the windows host.  These commands perform the unregistering dynamically. Once completed, when diskpart is executed, the Veritas VxVDS provider no longer wraps with diskpart and the VDB operations can resume normally at this point. The change in the VxVDS Provider persists on the Windows host. The Veritas software is unaffected by the unregistering as the procedure simply instructs the VxVDS Super VDS Provider from affecting non-Veritas utilities such as diskpart and disk manager windows disk utilities.

Additional Information

 

External Links

VxVDS Information and steps to unregister/register

 

If you need to, log a case with Veritas support via the TSANet procedure. Provide the windows event/application logs, windows version, wmic qfe output, sysinfo output.  The Veritas technical support engineer will provide a utility they use to collect diagnostics off the target host and package it into a CAB file format. This file needs to be transferred off this host and provided to Veritas per their instructions.

JIRAs of importance:

ESCL-597: Investigate and track Veritas/diskpart issues with Wells Fargo WSFC

Example stack trace:

[2018-01-23 16:13:28,440][ERROR][util.mount.WindowsMounter#reallyMount:280][Worker-61896|JOB-54152|DB_REFRESH(MSSQL_DB_CONTAINER-59)][ACTION-115099] mountLun timed out for CSPWD00B0005
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at com.delphix.appliance.server.ccc.impl.connectorconnection.C3BaseConnectorConnectionImpl$JobExecutor.execute(C3BaseConnectorConnectionImpl.java:274)
        at com.delphix.appliance.server.ccc.impl.connectorconnection.C3BaseConnectorConnectionImpl.processExecute(C3BaseConnectorConnectionImpl.java:394)
        at com.delphix.appliance.server.ccc.impl.connectorconnection.C3BaseConnectorConnectionImpl.internalExecute(C3BaseConnectorConnectionImpl.java:591)
        at com.delphix.appliance.server.ccc.impl.connectorconnection.C3BaseConnectorConnectionImpl.executeWithTimeout(C3BaseConnectorConnectionImpl.java:565)
        at com.delphix.appliance.server.ccc.impl.connectorconnection.C3WindowsConnectorConnectionImpl.executeWithTimeout(C3WindowsConnectorConnectionImpl.java:35)
        at com.delphix.appliance.server.host.PS1ScriptExecutor.executeWithTimeout(PS1ScriptExecutor.java:95)
        at com.delphix.appliance.server.util.mount.WindowsMounter.executeMountScriptWithLock(WindowsMounter.java:560)
        at com.delphix.appliance.server.util.mount.WindowsMounter.reallyMount(WindowsMounter.java:266)
        at com.delphix.appliance.server.util.mount.WindowsMounter.mountLun(WindowsMounter.java:197)
        at com.delphix.appliance.server.mssql.target.MSSqlBaseTargetDBScriptExecutor.mountLun(MSSqlBaseTargetDBScriptExecutor.java:179)
        at com.delphix.appliance.server.mssql.target.MSSqlVirtualTargetDBScriptExecutor.mountLun(MSSqlVirtualTargetDBScriptExecutor.java:738)
        at com.delphix.appliance.server.mssql.target.MSSqlVirtualTargetDBScriptExecutor.mountInitial(MSSqlVirtualTargetDBScriptExecutor.java:667)
        at com.delphix.appliance.server.mssql.target.vdb.MSSqlVirtualDatabaseProvisionJob.execute(MSSqlVirtualDatabaseProvisionJob.java:123)
        at com.delphix.appliance.server.mssql.target.vdb.MSSqlVirtualDatabaseManagerImpl.provision(MSSqlVirtualDatabaseManagerImpl.java:207)
        at com.delphix.appliance.server.mssql.target.vdb.MSSqlVirtualDatabaseManagerImpl.provision(MSSqlVirtualDatabaseManagerImpl.java:76)
        at com.delphix.appliance.server.db.target.vdb.BaseVirtualDatabaseManagerImpl.provision(BaseVirtualDatabaseManagerImpl.java:229)
        at com.delphix.appliance.server.mssql.executor.MSSqlDatabaseProvider.provision(MSSqlDatabaseProvider.java:690)
        at com.delphix.appliance.server.lifecycle.impl.ReprovisionManagerImpl.lambda$refreshProvisionInternal$18(ReprovisionManagerImpl.java:367)
        at com.delphix.appliance.server.lifecycle.impl.ReprovisionManagerImpl.lambda$refresh$16(ReprovisionManagerImpl.java:150)
        at com.delphix.appliance.server.jcm.impl.JobInstance$3.run(JobInstance.java:317)
        at com.delphix.appliance.server.dco.jobs.JobInterrupter.withThread(JobInterrupter.java:40)