Skip to main content
Delphix

A high volume of Secure Shell connections to a single target host may lead to DB SYNC job failure

A significant number of Secure Shell (SSH) Connections are initiated on a target host when DB_SYNC jobs are started. If the number of SSH connections match or exceed the configured or default limit, one or more jobs will fail associated with "host unavailable errors".

Issue

The following error can occur during policy DB_SYNC:

Description DB_SYNC job for "Target_Container_Name" failed due to an error during execution: The host "[999.99.99.99]" is unavailable. 
Action Make sure host is available and the SSH server/service is running. 
Severity CRITICAL 
Hostname HPUX_Machine 
Target Target_Container_Name
Timestamp 2015-12-31 23:59:59.999

 

Troubleshooting

The following stack found in the debug log is symptomatic of this issue:

[2015-12-22 22:00:04,041][INFO][executor.impl.ExecutorImpl#setupExecutorContext:275][Worker-23988|JOB-11022|DB_SYNC(ORACLE_DB_CONTAINER-421)][] Job JOB-11022 gets context ACTION-20574
[2015-12-22 22:00:04,041][DEBUG][executor.impl.ExecutorImpl#setupExecutorContext:296][Worker-23988|JOB-11022|DB_SYNC(ORACLE_DB_CONTAINER-421)][ACTION-20574] action preExecute action ACTION-20574 domain 1
[2015-12-22 22:00:04,068][INFO][ccc.impl.trilead.TrileadC3ConnectionImpl#logMe:71][pool-26-thread-13][] Encountered User Exception in createConnection(), TrileadC3ConnectionImpl.java:319: exception.host.ssh.connection.failed {172.21.12.45}
[2015-12-22 22:00:04,069][WARN][dhm.impl.UnixHostAttributesRetrieverTask#retrieveAttributes:193][pool-26-thread-13][] Failed to connect to host 172.21.12.45 [172.21.12.45].
com.delphix.appliance.server.dco.events.enums.HostExceptions$SshConnectionFailed: exception.host.ssh.connection.failed {172.21.12.45}
        at com.delphix.appliance.server.ccc.impl.trilead.TrileadC3ConnectionImpl.createConnection(TrileadC3ConnectionImpl.java:319)
        at com.delphix.appliance.server.ccc.impl.trilead.TrileadC3ConnectionImpl.connectUsingPublicKey(TrileadC3ConnectionImpl.java:212)
        at com.delphix.appliance.server.ccc.impl.trilead.TrileadC3ConnectionImpl.connect(TrileadC3ConnectionImpl.java:189)
        at com.delphix.appliance.server.ccc.impl.C3ServiceImpl.getSshConnection(C3ServiceImpl.java:59)
        at com.delphix.appliance.server.ccc.impl.C3ServiceImpl.getSshConnection(C3ServiceImpl.java:81)
        at com.delphix.appliance.server.dhm.impl.UnixHostAttributesRetrieverTask.retrieveAttributes(UnixHostAttributesRetrieverTask.java:188)
        at com.delphix.appliance.server.dhm.impl.UnixHostAttributesRetrieverTask.retrieveAttributes(UnixHostAttributesRetrieverTask.java:56)
        at com.delphix.appliance.server.environmentMonitor.impl.BaseAttributesRetrieverTask.fetchAttributes(BaseAttributesRetrieverTask.java:104)
        at com.delphix.appliance.server.environmentMonitor.impl.BaseAttributesRetrieverTask$1.call(BaseAttributesRetrieverTask.java:210)
        at com.delphix.appliance.server.environmentMonitor.impl.BaseAttributesRetrieverTask$1.call(BaseAttributesRetrieverTask.java:207)
        at com.delphix.session.util.ThreadFuture.call(ThreadFuture.java:107)
        at com.delphix.session.util.ThreadFuture.doRun(ThreadFuture.java:92)
        at com.delphix.session.util.AbstractFuture.run(AbstractFuture.java:253)
        ... 5 more
Caused by: java.io.IOException: There was a problem while connecting to 172.21.12.45:22

Caused by: java.io.IOException: There was a problem while connecting to 172.21.12.45:22
        at com.trilead.ssh2.Connection.connect(Connection.java:798)
        at com.delphix.appliance.server.ccc.impl.trilead.TrileadC3ConnectionImpl.establishConnection(TrileadC3ConnectionImpl.java:334)
        at com.delphix.appliance.server.ccc.impl.trilead.TrileadC3ConnectionImpl.createConnection(TrileadC3ConnectionImpl.java:315)
        ... 17 more
Caused by: java.io.IOException: Premature connection close
        at com.trilead.ssh2.transport.ClientServerHello.readLineRN(ClientServerHello.java:33)
        at com.trilead.ssh2.transport.ClientServerHello.<init>(ClientServerHello.java:68)
        at com.trilead.ssh2.transport.TransportManager.initialize(TransportManager.java:490)
        at com.trilead.ssh2.Connection.connect(Connection.java:737)
        ... 19 more

                        Internal Error: com.delphix.appliance.server.dhm.HostConnectionException: Failed to connect to host
com.delphix.appliance.server.dhm.HostConnectionException: Failed to connect to host
        at com.delphix.appliance.server.dhm.impl.UnixHostAttributesRetrieverTask.retrieveAttributes(UnixHostAttributesRetrieverTask.java:197)
        at com.delphix.appliance.server.dhm.impl.UnixHostAttributesRetrieverTask.retrieveAttributes(UnixHostAttributesRetrieverTask.java:56)
        at com.delphix.appliance.server.environmentMonitor.impl.BaseAttributesRetrieverTask.fetchAttributes(BaseAttributesRetrieverTask.java:104)
        at com.delphix.appliance.server.environmentMonitor.impl.BaseAttributesRetrieverTask$1.call(BaseAttributesRetrieverTask.java:210)
        at com.delphix.appliance.server.environmentMonitor.impl.BaseAttributesRetrieverTask$1.call(BaseAttributesRetrieverTask.java:207)
        at com.delphix.session.util.ThreadFuture.call(ThreadFuture.java:107)
        at com.delphix.session.util.ThreadFuture.doRun(ThreadFuture.java:92)
        at com.delphix.session.util.AbstractFuture.run(AbstractFuture.java:253)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

 Took 19 ms.
[2015-12-22 22:00:04,080][WARN][oracle.database.OracleDatabaseAccessorManager#chooseAvailableInstanceAndHost:370][Worker-23988|JOB-11022|DB_SYNC(ORACLE_DB_CONTAINER-421)][ACTION-20574] Host [172.21.12.45] for database D03SPLEX is unavailable
[2015-12-22 22:00:04,081][INFO][oracle.database.OracleDatabaseAccessorManager#logMe:71][Worker-23988|JOB-11022|DB_SYNC(ORACLE_DB_CONTAINER-421)][ACTION-20574] Encountered User Exception in chooseAvailableInstanceAndHost(), OracleDatabaseAccessorManager.java:372: exception.oracle.accessor.host.unavailable {[172.21.12.45]}

 

 

From bundleminer print out enough jobs to cover the timeframe of the problem. In this case, needed to print the 19-21 jobs happening 12-23 at 5:00 am:

11023       COMPLETED  2015-12-23 05:00:04.181  2015-12-23 05:09:32.141   9.47       Run SnapSync for database "D04D".
11022       FAILED     2015-12-23 05:00:03.867  2015-12-23 05:00:16.739   0.21       Run SnapSync for database "D03SPLEX".
11021       COMPLETED  2015-12-23 05:00:03.647  2015-12-23 05:03:21.936   3.30       Run SnapSync for database "D04C".
11020       COMPLETED  2015-12-23 05:00:03.491  2015-12-23 05:15:27.18    15.39      Run SnapSync for database "D03B".
11019       COMPLETED  2015-12-23 05:00:03.194  2015-12-23 05:01:33.831   1.51       Run SnapSync for database "D02".
11018       COMPLETED  2015-12-23 05:00:03.019  2015-12-23 05:03:10.624   3.13       Run SnapSync for database "D03C".
11017       COMPLETED  2015-12-23 05:00:02.563  2015-12-23 05:14:06.943   14.07      Run SnapSync for database "D03D".
11016       COMPLETED  2015-12-23 05:00:02.355  2015-12-23 05:02:58.455   2.94       Run SnapSync for database "D04".
11015       COMPLETED  2015-12-23 05:00:02.198  2015-12-23 05:00:08.26    0.10       Run SnapSync for database "D06".
11014       COMPLETED  2015-12-23 05:00:02.066  2015-12-23 05:02:08.521   2.11       Run SnapSync for database "D04B".
11013       COMPLETED  2015-12-23 05:00:01.852  2015-12-23 05:00:11.597   0.16       Run SnapSync for database "D11".
11012       COMPLETED  2015-12-23 05:00:01.662  2015-12-23 05:00:03.779   0.04       Run SnapSync for database "D05C".
11011       COMPLETED  2015-12-23 05:00:01.512  2015-12-23 05:01:26.985   1.42       Run SnapSync for database "D02D".
11010       COMPLETED  2015-12-23 05:00:01.34   2015-12-23 05:00:04.624   0.05       Run SnapSync for database "D15".
11009       COMPLETED  2015-12-23 05:00:01.088  2015-12-23 05:00:02.618   0.03       Run SnapSync for database "D05".
11008       COMPLETED  2015-12-23 05:00:00.961  2015-12-23 05:01:44.664   1.73       Run SnapSync for database "D02C".
11007       COMPLETED  2015-12-23 05:00:00.671  2015-12-23 05:02:03.771   2.05       Run SnapSync for database "D03".
11006       COMPLETED  2015-12-23 05:00:00.533  2015-12-23 05:00:03.027   0.04       Run SnapSync for database "D08".
11005       COMPLETED  2015-12-23 05:00:00.277  2015-12-23 05:00:03.024   0.05       Run SnapSync for database "D08A".

 

A pattern associated with the timing of this issue will be evident. In the above example 20 or more simultaneous SnapSync jobs are always associated with a single failing job. The following MDS query may be helpful in triage:

select j.job_id,e.event_id,e.event_time,j.target_name,e.message_code,substr(e.message_details,1,50) as message_details
from dlpx_jobs j
inner join dlpx_job_events e on 'JOB-' || j.job_id = e.job
where j.job_type = 'DB_SYNC'
and j.job_state != 'COMPLETED'
and e.job_state = 'FAILED'
order by j.job_id desc ,e.event_id asc limit 20

-[ RECORD 1 ]---+---------------------------------------------------
job_id          | 11022
event_id        | 107035
event_time      | 2015-12-23 05:00:04.129
target_name     | DEV DBA Shareplex testing. Indefinite./D03SPLEX
message_code    | exception.oracle.accessor.host.unavailable
message_details | The host "[172.21.12.45]" is unavailable.
-[ RECORD 2 ]---+---------------------------------------------------
job_id          | 10788
event_id        | 105112
event_time      | 2015-12-19 05:00:02.681
target_name     | DEV .mainline permanent/D02
message_code    | exception.oracle.accessor.host.unavailable
message_details | The host "[172.21.12.45]" is unavailable.
-[ RECORD 3 ]---+---------------------------------------------------
job_id          | 10665
event_id        | 104192
event_time      | 2015-12-17 05:00:03.517
target_name     | DEV AR1346 Harmonization 01/15/16/D04B
message_code    | exception.oracle.accessor.host.unavailable
message_details | The host "[172.21.12.45]" is unavailable.
-[ RECORD 4 ]---+---------------------------------------------------
job_id          | 10658
event_id        | 104218
event_time      | 2015-12-17 05:00:46.17
target_name     | DEV AR1399 (Giri) 10/31/2015/D02C
message_code    | exception.oracle.accessor.host.unavailable
message_details | The host "[172.21.12.45]" is unavailable.
-[ RECORD 5 ]---+---------------------------------------------------
job_id          | 10621
event_id        | 103883
event_time      | 2015-12-16 05:00:03.854
target_name     | DEV AR1399 (Giri) 10/31/2015/D03C
message_code    | exception.oracle.accessor.host.unavailable
message_details | The host "[172.21.12.45]" is unavailable.
-[ RECORD 6 ]---+---------------------------------------------------

The host affected by this issue may vary or it may always be the same host. That variation will be dependent on whether or not hosts are configured in same way with regard to SSH.

Resolution

In this specific case, the Oracle target was an HP-UX machine. A suspicion of ssh limitation prompted to retrieve the following information:

  • The contents of /opt/ssh/etc/sshd_config
  • Copy of the syslog.log (from syslog.d) - usually found in /var/adm/syslog/syslog.log
    • Details where this file resides are in the /etc/syslog.conf file
    • We are looking for ssh errors during the time of failure
  • To find out ssh version from the hp box:
    • ps -ef | grep sshd
    • /opt/ssh/sbin/sshd -v)

Inspect the output of "ps -ef" with care. Ensure the correct Secure Shell application is identified. Many customers utilise 3rd party tools including SSH and the location of the relevant configuration may be different from what is expected.

In the sshd_config file is this default setting:

#MaxStartups 10:30:100

First item is for number of SSH connections. If the SSH connections occur near simultaneously, there is chance one SSH connection attempt will fail. To avoid this issue and resolve the problem increase the max higher value and un-comment the configuration setting as follows:

MaxStartups 20:30:100

Once the change has been made, restart the SSH daemon and check for the errors associated with that target host over the following days. If this was the problem the failures should stop occurring, at least for that host.

 

Additional Information

In this example the number of simultaneous connections was around 20. You can tune up the maximum number of SSH connections to a suitable value to avoid the issue, but a balance needs to be struck so as to avoid consuming too many resources. Setting a very large value is inappropriate. Alternatively, scheduling staggering DB_SYNC job times may provide resolution and avoid the need to request changes to SSH.

Information about the ssh properties:

http://searchnetworking.techtarget.com/tip/HP-UX-Secure-Shell-Part-5-The-host-configuration-files

  • Was this article helpful?