Skip to main content

Algorithm: Data Cleansing (KBA1662)



The Data Cleansing Algorithm is not frequently used.  It is a simple algorithm which uses a lookup file with lookup values. These values are imported when you create the Lookup algorithm. 

The recommendation is to limit the number of items in the Secure Lookup to under 500,000 due to load times and memory requirements.   

At a Glance 

Description: This algorithm is used to cleanse data (standardize in DQ terms). It can be used to obfuscate the data. The algorithm uses lookup values to change a set of terms to another (standard) set of terms. This algorithm is lightweight and fast.

1:1 Mapping2

Data Cleanse Lookup Yes  Yes  

1 Referential Integrity - The cleaned value will be the same between job executions as well as tables.
2 1:1 Mapping - The cleaned value will be mapped uniquely to the input value.

Works with all characters encodings (some bugs are known). Encoding should give the same result.
Data is not trimmed and it is Case Sensitive (Value and VALUE are different).
Lookup Pool Size: Up to 500,000
Limitations: Will only match a full string in a field (not a word (or words) within a string.
Sensitive to whitespaces.
Large pool size (number of rows in the lookup file) will take a long time to load. 

Creation: Lookup values are loaded when the algorithm is created from a text file.

Modify: The Description and the Lookup Value cab be changed. 

Creating and Modifying Algorithm  

The algorithms are accessed from the tab Settings and under the submenu Algorithm. A custom algorithm can be created and modified (edit):

  • To create click Add Algorithm.
  • To modify click the Edit icon.

User Interface 

When creating (and modifying) the algorithm, the following popup will display:

  1. Algorithm Properties
    • Algorithm Name
    • Description
  2. Lookup Details
    • Lookup File Name
    • Delimiter - the character used in the file to separate search and lookup value


Editable parameters are:

  • Description
  • List of lookup values


To change any other parameter the algorithm needs to be deleted and then recreated. 


Pool Size and Performance 

The pool size (number of rows in the lookup file) affects both the creation time and the load time. An estimate is that the increase in load time is 10 to 3 - increase the number of values to load by 3 and it will take 10 times longer. On an average system, it will take 4 min to load 500,000 values. Loading 1,500,00 will take 40 min. 

The graph below shows the load values in milliseconds (y) and in the number of values loaded on the horizontal axis (x). 

Memory Requirements 

The ingested values are loaded into RAM when the masking job starts. For large pool sizes, it might be required to increase the Min/Max Memory settings in the Job Configuration. 

White space and Case Sensitivity  

The Data Cleanse Algorithm is sensitive to white-spaces and case. This can be a problem if the data being masking is stored in a CHAR (or NCHAR) column and these values will be padded. 

White space workaround 

To resolve white space issues, use workaround:

  1. Open and edit the Rule Set
  2. Add a Custom SQL statement with the following amendment to the column that need to be trimmed:

To   : LTRIM(RTRIM((maskCol)) as maskCol

SELECT ID, LTRIM(RTRIM((maskCol)) as maskCol from myTable;

Case Sensitivity workaround 

To resolve case sensitivity, use workaround:

  1. Open and edit the Rule Set
  2. Add a Custom SQL statement with the following amendment to the column with case issue:

To   : UPPER((maskCol)) as maskCol

SELECT ID, UPPER(maskCol) as maskCol from myTable;

Masked Data Encoded with Non-Standard Code Page 

The data to be masked might be encoded and contain characters that are not converted correctly to UTF-8. The masked data is masked as "????". This issue is resolved in version 5.1.

If experiencing this issue please raise a case with Delphix Support. 


Below is an example of mask (cleansed) result using varchar. Note the case and whites pace sensitivity. 

Text file with search lookup values:


Masking (cleansed) result:

| Ref | Source | Cleansed |
| 1   | tst    | test     |
| 2   | Tst    | Test     |
| 3   | TST    | TEST     |
| 4   | tsT    | tsT      |  << Not cleansed >> no search value match
| 5   | TST    | TST      |  << Not cleansed >> trailing whitespace



Related Articles

The following articles may provide more information or related information to this article:

Knowledge Base: