Skip to main content
Delphix

Masking and Tokenization and Character Sets (KBA8651)

 

 

KBA

KBA# 8651

At a Glance

Description:  This KBA describes possible issues and errors when data is masked to a string with characters that are not valid due to the character set in the file or database column storing the masked value.

This issue can be hard to detect and it can also be silent (as with Oracle and MS SQL Server). 
Applicable Versions: This KBA applies to all known Continuous Compliance Engine versions. The issue is not bound to any version but to how character sets are handled.
Issue 1: Error 1 Character conversion errors.
Issue 2: Error 2  Masked value too large for the column. 
Issue 3: Silent Oracle and SQL Server will convert some characters automatically. This might cause issues with uniqueness when masking and with Re-Identify when using Tokenization. 
Algorithms: All new algorithms (new Framework) are UTF-8 internally on the Continuous Compliance Engine. 
Resolution: Ensure that the Characters defined in the Algorithm are valid for the Character Set on the masked database column or masked file. If not - remove unsupported characters from the algorithm.
Recommendation: It is recommended to create your own algorithm using the built-in framework. The algorithm will then be created based on your specific requirements. The out of the box (OOTB) algorithms are created to demonstrate capabilities. 

For example, the built-in Character Mapping (dlpx-core:CM Alpha-Numeric) has a large set of extended Unicode characters which might not work on all character sets.

 

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies
Major Release All Sub Releases
6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0, 6.0.13.1, 6.0.14.0, 6.0.15.0

5.3

5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0

5.2

5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

Resolution 

The solution to these issues is changing the algorithm so the algorithm masks to a supported set of characters.

Steps: 

  1. Go to the database and get the Character Set.
  2. Go to the algorithm with the issue and:
    • Remove any unsupported characters; or
    • Recreate the Algorithm ensuring lookup values do not have unsupported characters; or
    • For a Custom Algorithm, please contact the author.

Note

Note:

The Algorithm Key might provide different results on different settings. Changing the algorithm key is NOT a solution to this issue, as it doesn't address the root cause but changes the probability.

 

Root cause

The root cause here is that the masked value contains a character that is not supported. This can come from Character Mappings that includes a character in the Character Group that is not supported. It can also be an unsupported character in a Secure Lookup or Mapping Algorithm.

Since different locals accept different characters the only solution is to create an algorithm that is correct for a specific local. Even US names are not immune from this issue where some names contain unsupported values. To see the extent of this - Google US town names that have diacritics.

Issues

There are three different types of issues identified related to the character set.

  1. Character conversion errors - the masked value cannot be converted into the database/file character set.
  2. Silent conversion - the unsupported character is converted into a supported character which causes referential integrity or re-identification errors. 
  3. Value too long errors - the masked value gets an extra character and the value is too long for the field/column.

Below are some examples of errors. See above for Resolution.

Conversion errors

Sybase is known to throw an error if a character is not matching the character set.

SAP ASE (Sybase)

On Sybase a character conversion issue usually results in the following error: 

[JOB_ID_xx_yy] - DelphixTableUpdate.0 - JZ0BE: BatchUpdateException: Error occurred while executing batch statement: Error converting characters into server's character set. Some character(s) could not be converted.
...
[JOB_ID_xx_yy] - DelphixTableUpdate.0 - Next Exception: SQLState( ZZZZZ) ErrorCode(2402)

Silent conversion

Oracle and SQL Server are known to silently convert the unsupported character to a similar character, a '?', or similar. 

This might mask a value in a way that causes duplicates, referential integrity errors, re-identification errors, or other. The main issue here is that it is silent and the error/issue might be identified long after the data is masked. 

Oracle 

For example (this depends on LNS settings), Oracle might convert the following:

Ū > U
₨ > ?

SQL Server

For SQL Server the following is an example (once again - this depends on code page settings): 

Ū > U
₨ > ?

Truncation errors

In some instances, the new masked character can be converted and contains extra characters (up to 3 characters and 9 bytes have been noted on Oracle). If the data type is exactly defined for the value, this could cause the value to be too large for the column (as seen in the examples below). 

Oracle

[JOB_ID_xx_yy] - DelphixTableUpdate.0 - ORA-12899: value too large for column "..." (actual: x, maximum: y)

DB2

[JOB_ID_xx_yy] - DelphixTableUpdate.0 - Next Exception: SQLState( 22001) ErrorCode(-404)

Troubleshooting

To troubleshoot this issue, begin with the following: 

  1. The column Data Type.
  2. The Character Set.
  3. Check if there are any errors.
  4. Check all characters in the algorithm:
    • See below for known characters in out of the box (OOTB) algorithms.
    • Check characters in the character mapping algorithms.
    • Check source file for Mapping and Secure Lookup algorithms.
    • Contact the author of Custom Algorithms.

Notes 

Oracle

The following statement is found on Oracle's docs site: 

Because Java works internally in Unicode, the client character set is always set to Unicode. 

Source: https://docs.oracle.com/en/database/...nnections.html

Continuous Compliance Engine

Alpha-Numeric Character Mapping OOTB

Alpha-Numeric Character Mapping has the following extended characters defined in the OOTB algorithm: 

àáâãäåæèéêëçìíîïñðšòóôõöőùúûüűýÿžÀÁÂÃÄÅÆÈÉÊËÇÌÍÎÏÑЊÒÓÔÕÖŐÙÚÛÜŰÝŸŽ

 

 


Related Articles

The following articles may provide more information or related information to this article: