Job and Engine Configuration
This article has been created to address the best initial configuration and what to address to improve job configurations after that.
Initial Job Configuration Settings
This article covers:
Profiling Job Configuration
Profiling is used to automate assigning algorithms to columns. A few things that are important here:
- There are two different types of Profiling - Column Level and Data Level. The type is defined in the Profiling Set.
- The Column Level profiling uses RegEx on the Column name in the Inventory.
- The Data Level profiling uses RegEx on a sample set of data in the column.
- The requirements for each type is different.
Profiling Column Level only
When profiling Column Level only there is no need to change any configuration (as these are not used).
Profiling Data Level
When using Data Level profiling a job is started and a sample set of data is accessed on the database. The best initial configuration is:
- Streams: 1
- Min: 1024 MB
- Max: 1024 MB
If large RegExs are used, then increase Min/Max memory.
When the job is confirmed working, the Streams can be increased to improve performance on large profiling jobs. Note that the memory will need to be increased as well. The upper limit on Streams depends on the overall system (including database) and a guideline is to not exceed streams above 10 and max memory above 8 GB.
Masking Job Configuration
There are three different types of masking jobs - Database, File, and Mainframe. The configurations are fairly similar.
Job configuration - Database
The recommended initial settings for a database masking job are:
- In-Place Masking
- Streams: 1
- Update Threads: 1
- Min: 1024 MB
- Max: 1024 MB
- Commit Size: leave blank (default: 10,000)
- Feedback Size
- Other configurations are for special cases only.
When the job is confirmed working, the Streams can be increased to improve performance on large masking jobs. Note that the memory will need to be increased as well. The upper limit on Streams depends on the overall system (including database) and a guideline is to not exceed streams above 6. See below for memory.
Update Threads might increase performance and also depends on the overall system. A guideline is to test 2 and then 4 on one table to see which is faster.
Job configuration - File
The recommended initial settings for a file masking job are:
- On-The-Fly Masking
- Streams: 1
- Min: 2048 MB
- Max: 2048 MB (or larger as needed)
File Masking Jobs will require more memory since all the data in the row is ingested in the masking job (not just the fields being updated). Patterns in the Rule Set are also requiring memory.
When the job is confirmed working, the Streams can be increased to improve performance on large masking jobs. Note that the memory will need to be increased as well. The upper limit on Streams depends on the overall system and a guideline is to not exceed streams above 6.
The Feedback Size defines how frequent logs are written to the log files. These values are a guideline and one way to determine the size is that the logs should preferably fit into one log file.
|Database Size||Max number of Records in a table||Feedback Size value|
|Small to medium||Up to 10,000,000||50,000 (default)|
|Large||Up to 500,000,000||500,000|
|Very large||Over 500,000,000||5,000,000|
The key factors in regards to the amount of memory needed are defined by the number of columns masked and the masking algorithms used. Other factors are column type and length and to some degree Update Size. The size, the number of records, has no impact on the memory requirements as the data is processed in a stream.
Min and Max should be set to the same value for optimal performance. Note: if there is limited memory available or memory can't be allocated (due to ESX or some unsupported engine configurations) the Min can be set lower to enable the memory to slowly grow.
When should the memory be increased?
- All Masking
- When there is a large number of masked columns per tables - more than 10.
- When there is a large number of lookup values in Secure Lookup algorithm.
- When there is a large number of lookup values in Mapping algorithm.
- When large Segmented Mapping algorithms are used.
- When large custom mapplets are used.
- Database Masking
- When the masked column data type and length are large.
- File Masking
- When each row contains a lot of characters. All fields (masked and unmasked) are read into the masking job.
- A large number of Data Level expressions.
- Complex RegEx.
Masking Engine Configuration
The minimum memory configuration for the Masking Engine is 16 GB, however, more memory could be needed.
How much memory is needed depends on:
- The OS (2 GB)
- The Masking Engine (ME) (default: 1 GB)
- For room to move (1 GB)
- Each job executed at the same time (see section above for size).
This defines the memory used and this memory can never exceed the amount of memory available. If it exceeds, the java process will send an abort which will restart the complete stack.
The amount of RAM needed is, therefore: OS + ME + Extra + Sum(Max for each Job running).
Example: 2 GB + 1 GB + 1 GB + (4 GB + 4 GB + 8 GB) = 20 GB
Signs the Job or the Engine is incorrectly configured
Signs that the engine is misconfigured are:
- Masking Engine asks for User/Name password unexpectedly.
- This can happen if the Masking Engine restarts.
- The Masking Engine restart.
- This can happen if the memory runs out on the engine.
- More VM memory is needed.
- The job hangs (Running state).
- This can happen if the memory ran out for the job.
- More job memory is needed.
- Masking Engine Logs contains:
- "OutOfMemoryError" such as "OutOfMemoryError: GC overhead limit exceeded"
- 5.3.4 Documentation
- Knowledge Base Articles: