Delimited File Masking and EOR Enclosures Escapes (KBA8928)
KBA
KBA# 8928At a Glance
Description: | This KBA describes Delimited File Masking configurations. Especially characters (printable and unprintable) for End of Record and Delimiters and Enclosure and Escape strategies. Note: Incorrect configuration, might cause incorrect masked output and formatting. |
||
---|---|---|---|
End of Record: | The End of Record configures the character(s) that defines the end of the record.
|
||
Delimiter: | The Delimiter configures the character(s) that defines the delimitation of values in a record.
The most common delimiter is Comma ( , ) but can be any character.
|
||
Text Enclosure: | The Text Enclosure configures the character(s) that encloses a value.
The most common text enclosure is Double Quotes ( " ) but can be any character.
|
||
Enclosure Escape: | The Escape Enclosure strategy configures how to escape the text enclosure.
|
||
Escape EEC: | The Escape EEC (Escape Enclosure Character) is a flag that enables escaping the escape character that escapes the enclosure.
|
||
More info: | More information about this can be found in the Masking Docs: |
Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
-
Major Release All Sub Releases 6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0 5.3
5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0 5.2
5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1
5.1
5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0
5.0
5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4
Delimited File Configuration
This section describes the different configurations in Delimited File Masking, such as EOR, delimiter, enclosures, and escape characters (strategies).
UI Example
The Format is configured in the Rule Set and to access Special Characters - use the CTRL buttons shown below:
End of Record (EOR)
This defines the character (or sequence of characters) used to define the End of Record.
Option | Hex Code | Comment |
Select one of:
|
Use Hex:
|
This sets the EOR:
|
Custom End of Record
The EOR can be custom and be any character or a sequence of characters (printable or unprintable).
For unprintable characters please see the table in the section below.
An example of a printable EOR sequence is: #n
This means that the two characters ('#' and 'n') will be the EOR.
This will not look like many expect - below is a comma separated example (3 columns, 3 rows):
R1C1_Val,R1C2_Val,R1C3_Val#nR2C1_Val,R2C2_Val,R2C3_Val#nR3C1_Val,R3C2_Val,R3C3_Val
Common error \n or \r\n
A common error is to define '\n' or '\r\n' as the EOR. When '\n' is defined as the EOR this means that the two characters '\' and 'n' are used as an EOR (not LF (0x0A) as commonly thought).
To define LF (0x0a) please use the 'LF terminated (Unix)' or a Custom Control Character as listed in the table below.
Unprintable Control Characters
An unprintable control characters is a characters/code (defined in ASCII) that does not represent a printable symbol. For the EOR, this is usually CR+LF or LF.
The table shows the Ctrl Code (also used in the UI), the Hex Value, a Description, and a Comment.
UI Symbol | Ctrl Character | Description | Comment |
^@ [NUL] ^A [SOH] ^B [STX] ^C [ETX] ^D [EOT] ^E [ENQ] ^F [ACK] ^G [BEL] ^H [BS] ^I [HT] ^J [LF] ^K [VT] ^L [FF] ^M [CR] ^N [SO] ^O [SI] ^P [DLE] ^Q [DC1] ^R [DC2] ^S [DC3] ^T [DC4] ^U [NAK] ^V [SYN] ^W [ETB] ^X [CAN] ^Y [EM] ^Z [SUB] ^[ [ESC] ^\ [FS] ^] [GS] ^^ [RS] ^_ [US] |
$[00] $[01] $[02] $[03] $[04] $[05] $[06] $[07] $[08] $[09] $[0A] $[0B] $[0C] $[0D] $[0E] $[0F] $[10] $[11] $[12] $[13] $[14] $[15] $[16] $[17] $[18] $[19] $[1A] $[1B] $[1C] $[1D] $[1E] $[1F] |
NULL (character) START OF HEADING START OF TEXT END OF TEXT END OF TRANSMISSION ENQUIRY ACKNOWLEDGE BELL (beep) BACKSPACE HORIZONTAL TAB LINE FEED VERTICAL TAB FORM FEED CARRIAGE RETURN SHIFT OUT SHIFT IN DATA LINK ESCAPE DEVICE CONTROL 1 (XON) DEVICE CONTROL 2 DEVICE CONTROL 3 DEVICE CONTROL 4 (XOFF) NEGATIVE ACKNOWLEDGE SYNCHRONOUS IDLE END OF TRANSMISSION BLOCK CANCEL END OF MEDIUM SUBSTITUTE ESCAPE FILE SEPARATOR GROUP SEPARATOR RECORD SEPARATOR UNIT SEPARATOR |
HT = TAB (used in tab-delimited file) LF = Windows EOR (2 of 2). Unix EOR. CR = Windows EOR (1 of 2). |
Delimiter
This defines the character (or characters) used to define the delimiter. The delimiter can be any character or a sequence of characters (printable or unprintable).
File Type | Delimiter | Comment |
Common delimited files are:
|
Use:
|
Notes:
|
Common error: '^t'
A common error is to define '^t' as the tab character. This will in fact define two characters ('^' and 't') as the delimiter.
To define TAB, click the 'Ctrl' and select Horizontal Tab (as listed in the table above).
Unprintable Control Characters
To define unprintable control characters as the delimiter, click the Ctrl button and select the character. These are the same control characters as in the EOR, see above for a list of Control Characters.
Text Enclosure
If a value (string) in the delimited file contains a delimiter or EOR character, what should you do?
This is when the value (string) is enclosed with an enclosure. This enclosure is also configurable on the engine.
Example
The example shows a CSV file with a value ('Jane,Smith') that contains a comma. Note the 3rd example where the value contains a newline, CR+LF (Windows) or LF (Unix).
ID,Full_name 1,John Smith 2,Jane,Smith 3,Scott Smith
Solution
The solution is the enclose the strings with an enclosure (here " ). Note that the enclosure is optional.
ID,Full_name 1,"John Smith" 2,"Jane,Smith" 3,"Scott Smith"
Enclosure Escape Strategy
What if the enclosed value does include the enclosure character?
There is a solution for this, too. It is called enclosure escape strategy.
Escape Strategy | Comment |
Two options:
|
To enable the engine to ignore an enclosure in a value:
|
Example
Following the example above but using a single quote as the enclosure. The surname is now O'connell.
ID,Full_name 1,John O'connell 2,Jane,O'connell 3,Scott O'connell
Solution - Double Enclosure
The solution here is to escape the enclosure by doubling it (writing it twice).
ID,Full_name 1,'John O''connell' 2,'Jane,O''connell' 3,'Scott O''connell'
Solution - Custom
The solution used here is to escape with a character. The character used is / .
ID,Full_name 1,'John O/'connell' 2,'Jane,O/'connell' 3,'Scott O/'connell'
Escape "Enclosure Escape Character"
What if we need to escape the escape character that is used to escape the enclosure? There is a solution for that too.
Escape EEC | Comment |
Only one option:
|
To enable the engine to ignore an enclosure in a value:
|
Example
Following the example above but using using an added ' / ' in one of the values.
ID,Full_name 1,John /O'connell 2,Jane,/O'connell 3,Scott / O'connell
Solution - Escape EEC
In the example below - we have now:
- Enclosed to value with '
- Escaped the enclosure with /
- Escaped the Escape Enclosure Character by doubling the /.
ID,Full_name 1,'John //O/'connell' 2,'Jane,//O/'connell' 3,'Scott // O/'connell'
How to detect
Detecting errors in the input file can be tricky. If the number of masked rows doesn't match the known number of rows in the file this indicates that there is an issue with EOR, enclosures, and escaping.
Some methods:
- The first call is to check the masked row counts and compare that with the known number of rows in the source file.
- Open the file and check (this can be hard on large files)
- Use a CSV validation tool (search on Google) or import the data into Excel.
Use Bash
It is also possible to use bash to quickly scan the data. Below is an example of a basic bash command to check delimiters and enclosures.
Notes:
- This example is not a complete tool and should be used as a starting point.
- It uses variables to make the code easier to read.
- It will not show which row.
How it works:
- The script uses sed to remove the correct delimiters, then counts the number of enclosures, and then removes any row that isn't 2 (starting and ending enclosure).
-- Set variables for delimiter and enclosure (these need to be escaped) del=, encl=\" file=yourfile.txt -- If this returns any rows - it indicates that there might be issues with the data in the file cat "$file" | sed "s/$encl$del$encl//g" | sed "s/[^($encl)]//g" | awk '{ print length }' | grep -v "2"