Skip to main content
Delphix

Algorithm: Tokenization Re-Identification (KBA1527)

 

 

The Tokenization Algorithms are designed to be used in Tokenization/Re-Identification jobs though they are frequently used when masking data. This page describes the use of Tokenization Algorithm in Tokenization/Re-Identification. 

For the use in Masking, please see Algorithm: Tokenization in Masking (KBA1002).

 

Out of the box, there are three Tokenization Algorithms:

  • NAME_TK
  • ACCOUNT_TK
  • SSN_TK

Additional algorithms can be created. Note that there are no configurations for these algorithms and each definition has its own seed to scramble the data. 

This algorithm can be used to link data on the same database, link databases on the same server or on different servers. To link the data on different servers the encryption key needs to be shared and be the same on the shared engines.

Many customers use this algorithm as a masking algorithm by overriding the algorithm settings in the Inventory Page.

At a Glance  

Description: By default, this is an algorithm used with Tokenization - Re-Identification. 
This algorithm is used frequently as a Masking Algorithm. 

The algorithm has two modes - if the tokenized string is longer than the specified Data Type for the field, the algorithm will switch and use mode 2 (Caesar Cipher):

  1. BASE-64 Encoding
  2. Caesar Cipher
Characteristics:

This algorithm is lightweight, fast and will uniquely scramble the data. 

 
Type
Referential
Integrity1

1:1 Mapping2

Strength
Base-64 Code based Yes  Yes Strong
Ceasar Code based Yes Yes Weak

1 Referential Integrity - The masked value will be the same between job executions as well as tables.
2 1:1 Mapping - The masked value will be mapped uniquely to the input value within masked column.

Character 
Encodings:

In: Depending on Mode
Out, mode 1: BASE-64

  • All characters and encodings.

Out, mode 2: Caesar Cipher

  • US-ASCII 7 bit alphabetic characters.
  • Special characters are retained. 
  • Case (upper/lower) is retained.  
Lookup Pool Size: None
Limitations:

The algorithm can change behaviour depending on the masked data length.

Caesar Cipher does not work on extended characters encodings such as Japanese nor on binary data.  Caesar Cipher is reversible. 

The algorithm can only be selected on the Inventory page. 

Creating Algorithm   

To create this algorithm, you need to specify only one field - Algorithm Name.

Fields are:

  1. Properties
    1. Algorithm Name
    2. Description (optional & editable)
      The optional value (description) is editable.
Note

Note:

 The optional value (description) is editable.

User Interface  

When creating (and modifying) the algorithm, the following popup will display:

Tokenization Algorithm Popup.png

Considerations   

There are some consideration when using this algorithm:

  • Low Memory Requirements 
  • The algorithm has two modes
  • Caesar Cipher is not strong 

Low Memory Requirements   

The memory requirements are very small. If there is an error, it is usually related to too much memory allocated. 

The algorithm has two modes   

The mode depends on the number of characters in the masked column. This can cause issues in the following:

  • The masked result is dependent on the number of characters defined for the column. 
    • If varchar has few characters (less than 24), then Caesar will be used. 
    • If varchar will fit the BASE 64 output, then BASE 64 will be used. 
    • For varchar the thresholds for Caesar are:
      • String length 1-15  - Caesar if Data Type < 24 
      • String length 16-31 - Caesar if Data Type < 44
      • String length 32-47 - Caesar if Data Type < 64
      • String length 48-63 - Caesar if Data Type < 88
      • ...
  • When the Data Type is char and nchar
    • This will cause the algorithm to always use Caesar Cipher due to the trailing spaces stored with the algorithm. Data might need to be trimmed if this is causing an issue: Algorithms - Casting Values before Masking
  • When the Data Type is text
    • This will cause the algorithm to always use BASE 64 and no workaround is available. 

Caesar Cipher is not strong  

Caesar Cipher should only be used on insensitive data as the masking algorithm is very weak and easy to reverse engineer. 

Example  

Below is an example of masked data. The example shows both the long and the short response. 

  • Masked Short:
    • Caesar Cipher
    • The output if the Tokenized data is not fitting the number of character in the column.
  • Masked Long:
    • BASE-64 Encoding
    • The output if the Tokenized data fits within the number of characters in the column. 
+-------------+--------------+--------------------------+
| Input       | Masked Short | Masked Long              |
+=============+==============+==========================+
| NULL        | NULL         | NULL                     |
+-------------+--------------+--------------------------+
|             |              |                          |
+-------------+--------------+--------------------------+
| 1           | 0            | JIEaxAUiS07BJ1b83VK67A== |
|  1          |  0           | RVpBr7jwkzBrsbylJ8RsqA== |  << Leading space
| 2           | 1            | DtEpeeQPGlUiWyTFtX/9Pc== |
| 3           | 2            | Tt0DD/xi8T7/oNQM4zngHe== |
+-------------+--------------+--------------------------+
| 1000        | 999          | Bx9Q08KjWJOjmLRxctjwjc== |
| 1000000     | 999999       | tzwwW7oCpWH4ugQ/0+JG3e== |
+-------------+--------------+--------------------------+
| Anders      | Ylnsfk       | QDWGg040FuM4x3wXI+LZYA== |
| Jack        | Ryiw         | 4rceRCHDfLrgykiUtF+IXc== |
| JACK        | RYIW         | zyXPE8U2j4vAVWuyjiAyNY== |  << Different case
| Mask Data   | Gykw Nypy    | lrKFSutznXgmSGMjfyjLEe== |  
| Mask (Data) | Gykw (Nypy)  | OOieiOzmVfmpjg7/a0iOpA== |  << Special characters
| 0A1b2C      | 9Y0d1I       | AtdPaY3mlys7hVXA/QFEMc== |
+-------------+--------------+--------------------------+
| άλφα        | hehg         | Nf9hicuZwzXlJJ+2agmW+Y== |  << Same result
| βήτα        | hehg         | Nf9hicuZwzXlJJ+2agmW+Y== |  <<  "
| ovol        | hehg         | Nf9hicuZwzXlJJ+2agmW+Y== |  <<  "
| глаголь       | gurgjub      | Lh30YlrZItxBeDHASHn5tA== |
| quyến       | auoel        | bBVHAPX1slfcVz+IdqmtuY== |
+-------------+--------------+--------------------------+
| -           | -            | WrvOd+8a90lUniLNvjzKOY== |  << Caesar pass through
| !@#$%       | !@#$%        | RAEP34kIsLJMNzI8T0O3lY== |  <<  "
| こんばんは    | こんばんは    | 4RbopLsq2CVHll0cMhtxYA== |  <<  "
+-------------+--------------+--------------------------+

Text File Example  

The examples below show the Tokenization algorithm on CSV delimited and Fixed Width text files. Note that with CSV we will always get BASE-64 Encoding.

Input

1234,1234
   4,   4
1  4,1  4
 2 4, 2 4
12 4,12 4
 234, 234

CSV delimited

1234,fLfQlqvzuWPdpH6POQVhDM==
   4,2noZSlGxak1rgQcOpN3Ywc==
1  4,RYZbPyVHtylBlPkyiHNyZM==
 2 4,JVZDzDeukSa7brDr6dCFik==
12 4,IpChIhAL4qSKxgN0gP/H8M==
 234,ze5VXnwxMHw6RbssrqIelU==

Fixed Width

1234,8901
   4,   1
1  4,8  1
 2 4, 9 1
12 4,89 1
 234, 901

Common Errors  

Caesar is not tokenizing (masking) characters outside the alphabetic characters  

The Caesar Cipher is not tokenizing characters not convertible to a US ASCII. As examples, special characters and Japanese characters will pass through un-tokenized (unmasked).

Algorithm is not unique for characters outside the US-ASCII set  

Tokenized words in Russian will be Re-Identified as US-ASCII characters. See example above. This is for characters that can be converted to an equivalent version in the US ASCII - i.e. Ü and U.

Example, these three strings will be masked using Caesar to the same result:

ovol >> hehg
άλφα >> hehg
βήτα >> hehg

Data Type CHAR causes the use of Caesar Cipher  

Data stored in a column of Data Type CHAR (or NCHAR) will be padded with spaces and cause Caesar to be used all the time.

Resolution: To resolve this - cast (trim) the masked value before. Algorithms - Casting Values before Masking

Related Articles

The following articles may provide more information or related information to this article.

Knowledge Base Article 

Documentation