Skip to main content
Delphix

Masking Performance, Streams, Threads, Splice Jobs (KBA5217)

 

KBA

KBA# 5217

At a Glance

Versions: Applicable Delphix Masking versions: 5.3 (from 5.3.4), 6.x
Description: This page describes how to optimally adjust the Streams, Update Threads, and Multiple Jobs for performance. The KBA also lists how different parameters affect performance. 
Overview: Each configuration, where it is applied, how it affects memory, and how it improves performance (run-time).

 
Configuration Applies to Memory  Needed Perf Impr. Notes
Jobs Masking Engine Sum of jobs. Big Multiple jobs at the same time.
Streams Masking Engine Streams x Max Mem/Table.  Large Multiple tables at the same time.
Update Threads Database n/a Slight Network and DB capacity key.
Warning: Update Threads might increase the chance for deadlocks. 

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies
Major Release All Sub Releases
6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1

5.3

5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0

Faster Masking using Splicing, Streams, and Threads

The most common performance tuning options from the Masking Engine are:

  • Splicing, running multiple simultaneous jobs.
  • Increase Streams (recommended range 1 to 10 with a max of 99).
  • Increase Update Threads:
    • Most databases: (recommended range 1 to 4).
    • Some servers, for example, Salesforce API (recommended range 1 to 10 (max)).

 

Note

Note:

There are some important notes to Performance Tuning. 

  • Tuning should always be done on Error Free masking jobs.
  • When tuning performance - ensure the Database performance is optimally tuned.
    • Ensure the database is tuned for masking (updating a large number of records).
    • This includes network and storage
    • Be aware of Triggers, Indexes, and other database configurations such as logs and log switching.

 

Splicing - Multiple Jobs

It is possible to split a job up into multiple jobs (splicing). The benefit of doing this are performance gains and that the order of the execution can be managed so that execution time and memory usage can be optimized.  

Note

Note:

It is only possible to apply one masking job to a Masked vDB. The workaround here is to use the API and script the masking execution.

 

How to Job Splice

If the job execution runtime (duration) is an issue, then splicing the job into two or more jobs may help. The best way to do this is to put the large tables, for example, the three largest, in one job and set the streams to 3. Then, group all the remaining smaller tables in a second job. 

This will enable you to optimize the job runtime to the runtime of the largest/slowest job.

Example 1 - No Splicing 

In this example, the Rule Set has the following DB tables. The estimated duration has been estimated from an earlier execution.

DB Table Size Estimated duration 
Table_1
Table_2
Table_3
Large
Large
Large
45 minutes
40 minutes
39 minutes
10 tables (T4-T13) Small Average 3 minutes

Using one job

Using one job and Streams set to 1 will cause each table to run one after the other.

+---------+------------------------------------------------+
| Job 1   | T1 + T2 + T3 + T4 + T5 + ... + T11 + T12 + T13 |
+---------+------------------------------------------------+
| Runtime | <<---           Tot 154 min               -->> |
+---------+------------------------------------------------+

Example 2  - Splice (similar duration)

Using 2 jobs

Using the same Rule Set configuration as above (example 1). Configuring this as two jobs and using Streams we can optimize the execution runtime.

Notice that Streams is only set to 3 on Job 1. It is not needed on Job 2. Optimizing the use of Streams will ensure optimal use of memory and DB connections when running these two jobs.

Jobs Tables Streams   
Job 1 Table_1 to Table_3 3
Job 2 Table_4 to Table_13 1

 

The execution will look something like this:

+---------+--------------------------------+
| Job 1   | < Table_1 .................. > |  45 min
|         | < Table_2 ..............>      |  40 min
|         | < Table_3 ..............>      |  39 min
+---------+--------------------------------+
| Job 2   | < T4 + T5 +...+ T13 >          |  30 min
+---------+--------------------------------+
| Runtime | <<---    Tot 40 min       -->> |
+---------+--------------------------------+

Example 3  - Splice (different duration)

In this example, the Rule Set has the following DB tables. The estimated duration has been estimated from an earlier execution.

Table Size Estimated duration 
Table_1 Large 45 minutes
Table_2
Table_3
Medium
Medium
22 minutes
15 minutes
10 tables (T4-T13) Small Average 3 minutes

Using one job

Using one job and Streams set to 1 will cause each table to run one after the other.

+---------+------------------------------------------------+
| Job 1   | T1 + T2 + T3 + T4 + T5 + ... + T11 + T12 + T13 |
+---------+------------------------------------------------+
| Runtime | <<---           Tot 112 min               -->> |
+---------+------------------------------------------------+

Using 3 jobs

Configuring this as three jobs we can optimize the execution runtime and by setting Streams = 1 memory is optimized. 

Classification Tables Streams   
Job 1 Table_1 1
Job 2 Table_2 and Table_3 1
Job 3 Table_4 to Table_13 1

 

The execution will look something like this (note here that we didn't need to set Streams = 2 on Job 2 as the job still finished under 45 min. This will optimize execution time and save memory usage:

+---------+--------------------------------+
| Job 1   | < Table_1 .................. > |  45 min
|         | < Table_2   >< Table 3   >     |  37 min
+---------+--------------------------------+
| Job 2   | < T4 + T5 +...+ T13 >          |  30 min
+---------+--------------------------------+
| Runtime | <<---    Tot 45 min       -->> |
+---------+--------------------------------+

Streams

The Steams feature defines how many tables (or files) should be processed at the same time.

Since the Masking Engine does not know the size nor the duration necessary to mask each table, the order of the tables is not ordered by size or duration (masking big/long-running tables first). This will cause some streams to finish before others. It can also be that the largest table will be in the same stream, causing the runtime to be more or less the same.

Setting Streams very high will cause a large number of tables to start masking at the same time. This will cause a high impact on CPU usage, memory requirements, and also use more resources on the masked database.

Recommendations

  • Start with Streams = 1.
  • Check runtime and decide on the best option between splicing and Streams.
  • Optimal values: Streams 1 to 10.
  • Max Streams is 99.
  • If a large number of tables is masked in one job then Streams needs to be used.

 

Update Threads

The Update Threads feature splits the update of the table into multiple simultaneous Threads. There is no impact on the Masking Engine doing this as the threads are processed on the database or service. 

Usually adding more Threads causes the Masking Engine to push harder and this will usually not result in higher performance. There are times when this will have a great impact:

  • If the database or service has a large capacity and can process multiple requests at the same time.
  • There are limitations set per Connection on the database or service, runs each thread at max and more = higher performance. 

Special case - Single Table Splicing

It is possible to splice a single table. To make this effective - the SELECT and UPDATE need to operate on different blocks/pages on the database.

To do this multiple Rule Sets and Jobs are needed, each with a Rule Set Filter using MOD on the Block/Page Number.

Oracle 

This is known to work on Oracle where the Block Number can be retrieved by 'ROWID_BLOCK_NUMBER()'.

Example splicing a table into 4 jobs: 

  1. RuleSet_0  - with Filter expression 'MOD(DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID),4) = 0'
  2. RuleSet_1  - with Filter expression 'MOD(DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID),4) = 1'
  3. RuleSet_2  - with Filter expression 'MOD(DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID),4) = 2'
  4. RuleSet_3  - with Filter expression 'MOD(DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID),4) = 3'

 

This is an advanced technique and to get help with implementation, please contact your CSM and/or Technical Services. 

Impact on the Masked Target

Usually, the bottleneck and the impact on performance is during the update process on the masked target.

Some databases or services might be limited on the number of connections that can connect at the same time.

  • Each masked table has one connection per Update Thread.
  • Each masking job masks the number of tables defined in Streams at the same time.
  • Multiple jobs can be executed at the same time. This is set to the max of 7 in Application Settings.

 

Number of Connections

The number of Connections = Sum running jobs ( Streams x Threads per job)

 

Examples

The table below shows how many concurrent Update Connections there are based on the number of simultaneous jobs running, Streams, and Update Threads used. Even smaller numbers can create a large number of connections. 

+------+---------+---------+-------------+
| Jobs | Streams | Threads | Connections |
+------+---------+---------+-------------+
|    1 |       1 |       1 |           1 |
|    1 |       2 |       1 |           2 |
|    1 |       1 |       2 |           2 |
|    2 |       1 |       1 |           2 |
|    2 |       2 |       2 |           8 |
|    7 |      10 |       4 |         280 |
+------+---------+---------+-------------+

 

important

Important:

Some important notes:

  • Not all databases support multi-treading. 
  • Multiple Streams are usually better than multiple Threads (as Streams include read and masking).
  • The biggest improvement is usually from 1 to 2 (Streams or Threads). The returns diminish quickly after that. 
  • Testing is needed to determine which configuration works best for a specific database environment. 
  • Threads can deadlock tables, frequent on Sybase and old versions of MS SQL Server.
  • The table allocation to a stream is decided upfront and not dynamically. One stream might end up faster than the other streams. 
  • The longest-running table(s) and the stream order determine the total execution time. To execute long-running tables in a separate job that runs first (job splicing) might help to reduce execution time. 

 

Related Articles

The following articles may provide more information or related information to this article: