Skip to main content

KBA1662 Algorithm - Data Cleansing Algorithm



Applicable Delphix Masking Versions

  • 5.2.x
  • 5.1.x
  • 5.0.x
  • 4.x

At a Glance


This algorithm is used to cleanse data (standardize in DQ terms). It can be used to obfuscate the data.


The algorithm uses lookup values to change a set of terms to another (standard) set of terms. This algorithm is lightweight and fast.


Works with all characters encodings (some bugs are known). Encoding should give the same result. Data is not trimmed and it is Case Sensitive (Value and VALUE are different).

Lookup Pool Size: Up to 500,000

Will only match a full string in a field (not a word (or words) within a string. It is also sensitive to whitespaces.

Large pool size (number of rows in the lookup file) will take a long time to load. 


Creation: Lookup values are loaded when the algorithm is created from a text file.

Modify: The Description and the Lookup Value cab be changed. 

High-Level Overview

The Data Cleansing Algorithm is not frequently used. 

It is a simple algorithm which uses an XML file with encrypted values search and lookup values. These values are imported in the creation of the lookup algorithm. 

It is recommended to limit the number of items in the Secure Lookup to under 500,000 due to load times and memory requirements.   

UI - Creation and Modification 

The algorithms are accessed from the tab Settings and under the submenu Algorithm. A custom algorithm can be created and modified (edit):

  • To create click "Add Algorithm".
  • To modify click the green pen in the "Edit" column.

User Interface

The following popup is accessed when creating and modifying the algorithm:

UI - Algorithm - Data Cleansing Algorithm Docs.png

  1. Algorithm Properties
    • Algorithm Name
    • Description
  2. Lookup Details
    • Lookup File Name
    • Delimiter - the character used in the file to separate search and lookup value


Editable parameters are:

  • Description
  • List of lookup values

Note - To change any other parameter the algorithm needs to be deleted and then recreated. 


Pool Size and Performance

The pool size (number of rows in the lookup file) affects both the creation time and the load time. An estimate is that the increase in load time is 10 to 3 - increase the number of values to load by 3 and it will take 10 times longer. On an average system, it will take 4 min to load 500,000 values. Loading 1,500,00 will take 40 min. 

The graph below shows the load values in milliseconds (y) and in the number of values loaded on the horizontal axis (x). 


Memory Requirements

The ingested values are loaded into RAM when the masking job starts. For large pool sizes, it might be required to increase the Min/Max Memory settings in the Job Configuration. 

Best Practice: Job Configuration on the Masking Engine

Whitespace and Case Sensitivity 

The Data Cleanse Algorithm is sensitive to white-spaces and case. This can be a problem if the data being masking is stored in a CHAR (or NCHAR) column and these values will be padded. 

Whitespace workaround

To solve this, use this workaround:

  1. Open and edit the Rule Set
  2. Add a Custom SQL statement with the following amendment:

Change : ..., maskMeCol, ...

to: ..., LTRIM(RTRIM((maskMeCol)) as maskMeCol, ...


SELECT ID, LTRIM(RTRIM((maskMeCol)) as maskMeCol from myTable;

Case Sensitivity workaround

To solve this, use this workaround:

  1. Open and edit the Rule Set
  2. Add a Custom SQL statement with the following amendment:

Change : ..., maskMeCol, ...

to: ..., UPPER(maskMeCol) as maskMeCol, ...


SELECT ID, UPPER(maskMeCol) as maskMeCol from myTable;

Masked Data Encoded with Non-Standard Code Page

The data to be masked might be encoded and contain characters that are not converted correctly to UTF-8. The masked data is masked as "????". This issue is resolved in version 5.1.

If experiencing this issue please raise a case with Delphix Support. 


Below is an example of mask (cleanse) result using varchar. Note the case and whitespace sensitivity. 

Text file with search lookup values:


Masking (cleanse) result:

| Ref | Source | Cleansed |
| 1   | tst    | test     |
| 2   | Tst    | Test     |
| 3   | TST    | TEST     |
| 4   | tsT    | tsT      |  << Not cleansed >> no search value match
| 5   | TST    | TST      |  << Not cleansed >> trailing whitespace

Additional Information

Delphix Documentation: