Algorithm: Tokenization in Masking (KBA1002)
Overview
The Tokenization algorithm is normally used in Tokenization/Re-Identification though it has frequently been seen to be used to mask data. This article describes how the Tokenization algorithm works in the Masking Engine.
At a Glance
Description: | By default, this is an algorithm used with Tokenization - Re-Identification. This algorithm is frequently used as a Masking Algorithm. The algorithm has two modes - if the tokenized string is longer than the specified Data Type for the field, the algorithm will switch and use mode 2 (Caesar Cipher):
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Characteristics: |
1 Referential Integrity - The masked value will be the same between job executions as well as tables. |
||||||||||||
Character Encodings: |
In: Depending on Mode
Out, mode 2: Caesar Cipher
|
||||||||||||
Lookup Pool Size: | None | ||||||||||||
Out-of-the-Box: | The Out-of-the-Box algorithms are:
|
||||||||||||
Limitations: |
The algorithm can change behavior depending on the masked data length. Caesar Cipher does not work on extended characters encoding such as Japanese nor on binary data. Caesar Cipher is reversible. The algorithm can only be selected on the Inventory page. |
Creating Algorithm
To create this algorithm, there is only one field that needs to be specified - Algorithm Name.
Fields are:
- Properties
- Algorithm Name
- Description (optional & editable)
Considerations
There are some considerations when using this algorithm:
- Low Memory Requirements
- The algorithm having two modes
- Caesar Cipher is not strong
Low Memory Requirements
The memory requirements are very small. If there is an error, it is usually related to too much memory allocated.
Algorithm Having Two Modes
The mode depends on the number of characters in the masked column. This can cause issues:
- The masked result is dependent on the number of characters defined for the column.
- If the length of the Data Type can NOT hold the length of the BASE-64-AES-128 encryption, then Caesar will be used.
- For varchar the thresholds for Caesar are:
- String length 1-15 - Caesar if Data Type < 24
- String length 16-31 - Caesar if Data Type < 44
- String length 32-47 - Caesar if Data Type < 64
- String length 48-63 - Caesar if Data Type < 88
- ...
- When the Data Type is char and nchar
- This will cause the algorithm to always use Caesar Cipher due to the trailing spaces stored with the algorithm. Data might need to be trimmed if this is causing an issue: Algorithms - Casting Values before Masking
- When the Data Type is text
- This will cause the algorithm to always use BASE 64-AES-128 encryption and no workaround is available.
Caesar Cipher is Not Strong
Caesar Cipher should only be used on insensitive data as the masking algorithm is very weak and easy to reverse engineer.
Examples
Database Example
Below is an example of masked data. The example shows both the long and short response.
- Masked Short:
- Caesar Cipher
- The output if the Tokenized data is not fitting the number of character in the column.
- Masked Long:
- BASE-64-AES-128 encryption Encoding
- The output if the Tokenized data fits within the number of characters in the column.
+-------------+--------------+--------------------------+ | Input | Masked Short | Masked Long | +=============+==============+==========================+ | NULL | NULL | NULL | +-------------+--------------+--------------------------+ | | | | +-------------+--------------+--------------------------+ | 1 | 0 | JIEaxAUiS07BJ1b83VK67A== | | 1 | 0 | RVpBr7jwkzBrsbylJ8RsqA== | << Leading space | 2 | 1 | DtEpeeQPGlUiWyTFtX/9Pc== | | 3 | 2 | Tt0DD/xi8T7/oNQM4zngHe== | +-------------+--------------+--------------------------+ | 1000 | 999 | Bx9Q08KjWJOjmLRxctjwjc== | | 1000000 | 999999 | tzwwW7oCpWH4ugQ/0+JG3e== | +-------------+--------------+--------------------------+ | Anders | Ylnsfk | QDWGg040FuM4x3wXI+LZYA== | | Jack | Ryiw | 4rceRCHDfLrgykiUtF+IXc== | | JACK | RYIW | zyXPE8U2j4vAVWuyjiAyNY== | << Different case | Mask Data | Gykw Nypy | lrKFSutznXgmSGMjfyjLEe== | | Mask (Data) | Gykw (Nypy) | OOieiOzmVfmpjg7/a0iOpA== | << Special characters | 0A1b2C | 9Y0d1I | AtdPaY3mlys7hVXA/QFEMc== | +-------------+--------------+--------------------------+ | άλφα | hehg | Nf9hicuZwzXlJJ+2agmW+Y== | << Same result | βήτα | hehg | Nf9hicuZwzXlJJ+2agmW+Y== | << " | ovol | hehg | Nf9hicuZwzXlJJ+2agmW+Y== | << " | глаголь | gurgjub | Lh30YlrZItxBeDHASHn5tA== | | quyến | auoel | bBVHAPX1slfcVz+IdqmtuY== | +-------------+--------------+--------------------------+ | - | - | WrvOd+8a90lUniLNvjzKOY== | << Caesar pass through | !@#$% | !@#$% | RAEP34kIsLJMNzI8T0O3lY== | << " | こんばんは | こんばんは | 4RbopLsq2CVHll0cMhtxYA== | << " +-------------+--------------+--------------------------+
Text File Example
The examples below show the Tokenization algorithm on CSV delimited and Fixed Width text files. Note that with CSV we will always get BASE-64 Encoding.
Input 1234,1234 4, 4 1 4,1 4 2 4, 2 4 12 4,12 4 234, 234 |
CSV delimited 1234,fLfQlqvzuWPdpH6POQVhDM== 4,2noZSlGxak1rgQcOpN3Ywc== 1 4,RYZbPyVHtylBlPkyiHNyZM== 2 4,JVZDzDeukSa7brDr6dCFik== 12 4,IpChIhAL4qSKxgN0gP/H8M== 234,ze5VXnwxMHw6RbssrqIelU== |
Fixed Width 1234,8901 4, 1 1 4,8 1 2 4, 9 1 12 4,89 1 234, 901 |
Common Errors
Caesar is not tokenizing (masking) characters outside the alphabetic characters
The Caesar Cipher is not tokenizing characters not convertible to a US ASCII. As examples, special characters and Japanese characters will pass through un-tokenized (unmasked).
Algorithm is not unique for characters outside the US-ASCII set
Tokenized words in Russian will be Re-Identified as US-ASCII characters. See example above. This is for characters that can be converted to an equivalent version in the US ASCII - i.e. Ü and U.
Example, these three strings will be masked using Caesar to the same result:
ovol >> hehg άλφα >> hehg βήτα >> hehg |
Data Type CHAR causes the use of Caesar Cipher
Data stored in a column of Data Type CHAR (or NCHAR) will be padded with spaces and cause Caesar to be used all the time.
Resolution: To resolve this - cast (trim) the masked value before. Algorithms - Casting Values before Masking
Related Articles
The following articles may provide more information or related information to this article.
Knowledge Base Article
- Algorithms - Casting Values before Masking
- Best Practice: Job Configuration on the Masking Engine
- Algorithm: Tokenization Re-Identification (KBA1527)
Documentation
- Doc 5.3: Tokenization Algorithm Framework