File Masking and Enclosure and Escape Characters (KBA8555)
KBA
KBA# 8555At a Glance
Versions: | Applicable Delphix Masking versions: 6.0.3 and onwards. |
---|---|
Enclosure: | If a value needs to be enclosed, the enclosure can be single or multiple characters. The Masking Engine can handle multiple enclosures per field. The normal enclosure is a double quote (" ). The Masking Engine cannot use different characters for opening and closing.
|
Escape Character: | There are two options to escape the Enclosure:
|
Number of Fields: | The output will always have the number of fields defined in the File Format (or File Formats if different Record Types are used). |
Important: | If values are enclosed incorrectly, it is not possible to detect the separators or End of Line (EOL) characters, as these are enclosed. Therefore, it is important to enclose and escape correctly. |
Possible Errors: | Errors with enclosures and escape characters can only be seen in the masked data. On some occasions, the engine can run out of memory trying to find the closing enclosure. java.lang.OutOfMemoryError: Requested array size exceeds VM limit |
Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
-
Major Release All Sub Releases 6.0 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0, 6.0.13.1, 6.0.14.0, 6.0.15.0, 6.0.16.0
Using Enclosures and Escaping
Data stored in a delimited plain text file requires special characters to be used as the delimiter and the end-of-record character. The delimiter is usually one character (commonly comma, TAB, pipe, tilde, etc) but can also be multiple characters. For the end-of-record character, this is usually LF for Unix and CR+LF for Windows but can also be custom.
The special thing about these are:
- If these characters appear inside a value in a field, the value needs to be enclosed.
- If the enclosure appears inside a value being enclosed, that enclosure character needs to be escaped.
Possible issues
The possible issues are that there is a delimiter or end-of-record character in the field. This will cause the data to not line up and the wrong data is masked.
Important to Enclose and Escape Correctly
If the delimited plain text file is not using enclose and escape correctly - the following can happen:
- The data will be misaligned which will cause:
- Data loss
- PII leakage
- The engine may crash the job due to:
- Processing errors
- Out of memory
None of the issues above can be fixed by more memory or tweaks. The only solution is the have the correct enclosure and escaping strategy - this strategy needs to be defined where the data is generated and it then needs to match in the masking job.
Enclosure Strategies
The key things here are:
- If enclosures are not used, then any characters that encloses a value will be ignored.
- If enclosures are used, then there needs to be an opening enclosure and a matching closing enclosure.
Examples of Correct and Incorrect Enclosure
Correct | Incorrect |
The following enclosures are valid and are enclosing the value in the field. Enclosures are not needed even if they are defined. ...,abcd,... Opening and closing enclosure used around the whole value: ...,"abcd",... Inside the enclosure there can be a separator: ...,"ab,cd",... or EOL character: ...,"ab cd",... There can be multiple enclosed values in a field: ...,"a"bc"d",... The enclosure can enclose part of a value: ...,"abc"d,... |
If an enclosure has been defined, the enclosure needs to match with opening and a matching closing enclosure. This will be treated as an opening enclosure: ...,ab"cd,... This will be treated as opening, closing, and opening: ...,"ab"cd",... The EOL is outside the enclosure and seen as two records. ...,"ab" "cd",... The EOL is outside the enclosure and the second record has no closing enclosure. ...,"ab" cd",... |
Escape Strategies
There are two different Escape Strategies to ignore the enclosure character:
- Double Enclosure (the enclosure character is duplicated if it is escaped).
- Escape the enclosure character.
Examples of Correct and Incorrect Double Enclosure
Correct | Incorrect |
The following enclosures are valid and are enclosing the value in the field. Enclosures are not needed even if they are defined. ...,abcd,... Opening and closing enclosure used around the whole value: ...,"abcd",... Inside the enclosure there can be a separator: ...,"ab,cd",... or EOL character: ...,"ab cd",... There can be multiple enclosed values in a field: ...,"a"bc"d",... The enclosure can enclose part of a value: ...,"abc"d,... |
The escape and the enclosure needs to match with opening and a matching closing enclosure. This will be treated as an opening enclosure: ...,ab"cd,... This will be treated as opening, closing, and opening: ...,"ab"cd",... The EOL is outside the enclosure. This will be seen as two records. ...,"ab" cd",...
|
Examples of Correct and Incorrect Escape Characters
Correct | Incorrect |
The following enclosures are valid and are enclosing the value in the field. Enclosures are not needed even if they are defined. ...,abcd,... Opening and closing enclosure used around the whole value: ...,ab\"cd,... Escape the enclosure inside the enclosed value: ...,"ab\"cd",... ...,"abcd\"",... or outside the enclosed value: ...,"ab"cd\",... |
The escape character, if defined will be applied to all occurrences of the escaped enclosure. A single enclosure will be seen as an opening enclosure: ...,ab"cd,... The two first are escaped and the last is opening: ...,\"ab\"cd",... The first and last are escaped and the middle is opening: ...,\"ab"cd\",...
|
Example of Memory Error
If the enclosure strategy is not correctly defined and aligned with the data, the data in the enclosure could contain so many characters that the job consumes all available memory -- crashing the job.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Troubleshooting
The following sections provide troubleshooting guidance.
Linux/Unix
Finding errors in INFO.log
This listing can return many matches. The following command offers a way to filter on a specific job and execution.
grep -E "E=1|Masking Engine|Caused by|OutOfMemoryError|Job with execution.*submitted|Unable to load the job|WARNING|for job|Text File.*Finished" info.log
Finding errors in INFO.log for specific JOB
This limits the search for errors to a specific job and execution (change xx and yy to the job and execution as required).
grep -E "E=1|Masking Engine|Caused by|OutOfMemoryError|Job with execution.*submitted|Unable to load the job|WARNING|for job|Text File.*Finished" info.log | grep "JOB_ID_xx_yy"
Windows PowerShell
Finding errors in INFO.log
This listing can be long. The following command offers a way to filter on a specific job and execution.
Select-String "E=1|Masking Engine|Caused by|OutOfMemoryError|Job with execution.*submitted|WARNING|for job|Text File.*Finished" info.log
Finding errors in INFO.log for specific JOB
This limits the search for errors to a specific job and execution (change xx and yy to the job and execution as required).
Select-String "E=1|Masking Engine|Caused by|OutOfMemoryError|Job with execution.*submitted|Unable to load the job|WARNING|for job|Text File.*Finished" info.log | Select-String "JOB_ID_xx_yy"
Log Example
Below are some logs and descriptions (slightly modified for readability).
- INFO
- Indicates general activity such as Start Job (submitted) and Job Status.
- Indicates general activity such as Start Job (submitted) and Job Status.
- ERROR
- The Out Of Memory can be seen here.
- The Out Of Memory can be seen here.
- 'Caused by'
- 'Caused by' is always good to look for as it (usually) captures errors (in this case it is listed in the ERROR).
- 'Caused by' is always good to look for as it (usually) captures errors (in this case it is listed in the ERROR).
- 'Masking Engine Version'
- This is logged at the top of each info.log file and each time the engine was rebooted.
- This is a good way to check the version and if there is a second entry, it indicates an engine restart.
grep -E "E=1|Masking Engine|Caused by|OutOfMemoryError|Job with execution.*submitted|Unable to load the job|WARNING|for job|Text File.*Finished" info.log 2021-... Masking Engine Version: 6.0.x.x (...), ... ... 2021-... [...] INFO c.dmsuite.threads.JobExecutorService - Job with execution ID: yy submitted to thread pool 2021-... [...] INFO c.d.d.m.t.RunServiceImpl - [JOB_ID_xx_yy] 2021/... - Text File Input.0 - ERROR (...) : java.lang.OutOfMemoryError: Java heap space 2021-... [...] INFO c.d.d.m.t.RunServiceImpl - [JOB_ID_xx_yy] 2021/... - Text File Output.0 - Finished processing (I=0, O=20544753, R=20544753, W=20544753, U=0, E=0) 2021-... [...] INFO c.d.d.m.t.RunServiceImpl - [JOB_ID_xx_yy] 2021/... - Text File Input.0 - Finished processing (I=20544753, O=20544753, R=0, W=20544753, U=0, E=1) 2021-... [...] ERROR c.dmsuite.threads.JobExecutorService - [JOB_ID_xx_yy] - Execution yy for job xx failed Caused by: com.dmsuite.exception.JobFailedException: Job has failed due to error in running Kettle Caused by: com.dmsuite.exception.JobFailedException: Kettle returned a non zero exit value: 1
Related Articles
The following articles may provide more information or related information to this article: