Delimited File Masking and EOR Enclosures Escapes (KBA8928)
KBA
KBA# 8928At a Glance
Description | This article describes Delimited File Masking configurations, especially characters (printable and non-printable) for End of Record (EOR) and Delimiters and Enclosure and Escape strategies. Warning: Incorrect configuration might cause incorrect masked output and formatting. |
|||
---|---|---|---|---|
Configuration | These configurations need to be correctly set:
|
|||
End of Record | The End of Record (EOR) will likely differ between systems and needs to be defined. Note: A New Line in a field will create a new record (fields with New Lines need to be enclosed or escaped). |
|||
EOR | The End of Record configures the character(s) that defines the end of the record.
|
|||
Delimited Files | A Delimited File uses a character to separate fields. The delimiter needs to be defined. Note: A delimiter in a field will create a new field (values with a delimiter need to be enclosed or escaped). |
|||
Delimiter | The Delimiter configures the character(s) that defines the delimitation of values in a record.
The most common delimiter is Comma ( , ) but can be any character.
|
|||
Enclosure and Escape | If a value contains a delimiter or an EOR, it needs to be Enclosed or Escaped. | |||
Enclosure | The Text Enclosure configures the character(s) that encloses a value.
The most common text enclosure is Double Quotes ( " ) but can be any character.
|
|||
Escape | The Escape Enclosure strategy configures how to escape the text enclosure.
|
|||
Escape EEC | The Escape EEC (Escape Enclosure Character) is a flag that enables escaping the escape character that escapes the enclosure.
|
|||
EDI | There are some special notes around EDI files:
|
|||
More info | For information about investigating errors and the logs: Please also check the Masking Docs: |
Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
-
Date Release Apr 17, 2024 22.0.0.0 Mar 20, 2024 | Apr 2, 2024 21.0.0.0 | 21.0.0.1 Feb 21, 2024 20.0.0.0 Jan 25, 2024 19.0.0.0 Dec 20, 2023 | Jan 10, 2024 18.0.0.0 | 18.0.0.1 Nov 21, 2023 17.0.0.0 Oct 18, 2023 16.0.0.0 Sep 21, 2023 15.0.0.0 Aug 24, 2023 14.0.0.0 Jul 24, 2023 13.0.0.0 Jun 21, 2023 12.0.0.0 May 25, 2023 11.0.0.0 Apr 13, 2023 10.0.0.0 | 10.0.0.1 Mar 13, 2023 | Mar 20, 2023 9.0.0.0 | 9.0.0.1 Feb 13, 2023 8.0.0.0 Jan 12, 2023 7.0.0.0 Releases Prior to 2023 Major Release All Sub Releases 6.0 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0, 6.0.13.1, 6.0.14.0, 6.0.15.0, 6.0.16.0, 6.0.17.0, 6.0.17.1, 6.0.17.2
Delimited File Configuration
This section describes the different configurations in Delimited File Masking, such as EOR, delimiter, enclosures, and escape characters (strategies).
UI Example
The Format is configured in the Rule Set. To access Special Characters, use the CTRL buttons shown below:
End of Record (EOR)
This defines the character (or sequence of characters) used to define the End of Record.
Option | Hex Code | Comment |
Select one of:
|
Use Hex:
|
This sets the EOR:
|
Custom End of Record
The EOR can be custom and be any character or a sequence of characters (printable or non-printable).
For non-printable characters please see the table in the section below.
Example
An example of a printable EOR sequence is: #n
This means that the two characters ('#' and 'n') will be the EOR.
This will not look like may be expected - below is a comma-separated example (3 columns, 3 rows):
R1C1_Val,R1C2_Val,R1C3_Val#nR2C1_Val,R2C2_Val,R2C3_Val#nR3C1_Val,R3C2_Val,R3C3_Val
EDI
Another example involves EDI files. These are frequently just separated with '~' (tilde) and '*' as a delimiter. A file will look some things like:
BSN*00*000*19880106*11:12:13~HL*1**S~REF*SI*0000001013~HL*2*1*O~PRF*123456~REF*OQ*12345-C~...
If the EOR contains New Line characters - it is essential to determine if it is a Windows or Unix-based file. In this case, it can either be:
- Tilde with Unix LF: ~$[0A]
- Tilde with Windows CR+LF: ~$[0D]$[0A]
BSN*00*000*19880106*11:12:13~ HL*1**S~REF*SI*0000001013~ HL*2*1*O~ PRF*123456~ REF*OQ*12345-C~ ...
Common error \n or \r\n
A common error is to define the EOR using '\n' or '\r\n'. When '\n' is defined as the EOR, this means the two characters '\' and 'n' are used as an EOR (not LF (0x0A) as commonly thought).
To define LF (0x0A) please use the 'LF terminated (Unix)' or a Custom Control Character as listed in the table below.
Non-printable Control Characters
A non-printable control character is a character/code (defined in ASCII) that does not represent a printable symbol. For the EOR, this is usually CR+LF or LF.
The table shows the Ctrl Code (also used in the UI), the Hex Value, a Description, and a Comment.
UI Symbol | Ctrl Character | Description | Comment |
^@ [NUL] ^A [SOH] ^B [STX] ^C [ETX] ^D [EOT] ^E [ENQ] ^F [ACK] ^G [BEL] ^H [BS] ^I [HT] ^J [LF] ^K [VT] ^L [FF] ^M [CR] ^N [SO] ^O [SI] ^P [DLE] ^Q [DC1] ^R [DC2] ^S [DC3] ^T [DC4] ^U [NAK] ^V [SYN] ^W [ETB] ^X [CAN] ^Y [EM] ^Z [SUB] ^[ [ESC] ^\ [FS] ^] [GS] ^^ [RS] ^_ [US] |
$[00] $[01] $[02] $[03] $[04] $[05] $[06] $[07] $[08] $[09] $[0A] $[0B] $[0C] $[0D] $[0E] $[0F] $[10] $[11] $[12] $[13] $[14] $[15] $[16] $[17] $[18] $[19] $[1A] $[1B] $[1C] $[1D] $[1E] $[1F] |
NULL (character) START OF HEADING START OF TEXT END OF TEXT END OF TRANSMISSION ENQUIRY ACKNOWLEDGE BELL (beep) BACKSPACE HORIZONTAL TAB LINE FEED VERTICAL TAB FORM FEED CARRIAGE RETURN SHIFT OUT SHIFT IN DATA LINK ESCAPE DEVICE CONTROL 1 (XON) DEVICE CONTROL 2 DEVICE CONTROL 3 DEVICE CONTROL 4 (XOFF) NEGATIVE ACKNOWLEDGE SYNCHRONOUS IDLE END OF TRANSMISSION BLOCK CANCEL END OF MEDIUM SUBSTITUTE ESCAPE FILE SEPARATOR GROUP SEPARATOR RECORD SEPARATOR UNIT SEPARATOR |
HT = TAB (used in tab-delimited file) LF = Windows EOR (2 of 2). Unix EOR. CR = Windows EOR (1 of 2). |
Delimiter
This defines the character (or characters) used to define the delimiter. The delimiter can be any character or a sequence of characters (printable or non-printable).
File Type | Delimiter | Comment |
Common delimited files are:
|
Use:
|
Notes:
|
Common error: '^t'
A common error is to define '^t' as the tab character. This will in fact define two characters ('^' and 't') as the delimiter.
To define TAB, click the 'Ctrl' and select Horizontal Tab (as listed in the table above).
Non-printable Control Characters
To define non-printable control characters as the delimiter, click the Ctrl button and select the character. These are the same control characters as in the EOR; see above for a list of Control Characters.
Text Enclosure
This section answer what to do if a value (string) in the delimited file contains a delimiter or EOR character.
This occurs when the value (string) is enclosed with an enclosure. This enclosure is also configurable on the engine.
Example
The example shows a CSV file with a value ('Jane,Smith') that contains a comma. Note the 3rd example where the value contains a newline, CR+LF (Windows) or LF (Unix).
ID,Full_name 1,John Smith 2,Jane,Smith 3,Scott Smith
Solution
The solution is the enclose the strings with an enclosure (here " ). Note that the enclosure is optional.
ID,Full_name 1,"John Smith" 2,"Jane,Smith" 3,"Scott Smith"
Enclosure Escape Strategy
This section answer what to do if the enclosed value does include the enclosure character.
There is a solution for this, also. It is referred to as an enclosure escape strategy.
Escape Strategy | Comment |
Two options:
|
To enable the engine to ignore an enclosure in a value:
|
Example
Following the example above but using a single quote as the enclosure. The surname is now O'connell
.
ID,Full_name 1,John O'connell 2,Jane,O'connell 3,Scott O'connell
Solution - Double Enclosure
The solution here is to escape the enclosure by doubling it (providing it twice).
ID,Full_name 1,'John O''connell' 2,'Jane,O''connell' 3,'Scott O''connell'
Solution - Custom
The solution used here is to escape with a character. The character used is / .
ID,Full_name 1,'John O/'connell' 2,'Jane,O/'connell' 3,'Scott O/'connell'
Escape "Enclosure Escape Character"
This section answer what to do if you need to escape the escape character that is used to escape the enclosure. There is a solution for that too.
Escape EEC | Comment |
Only one option:
|
To enable the engine to ignore an enclosure in a value:
|
Example
Following the example above but using using an added ' / ' in one of the values.
ID,Full_name 1,John /O'connell 2,Jane,/O'connell 3,Scott / O'connell
Solution - Escape EEC
In the example below - we now have:
- Enclosed the value with '
- Escaped the enclosure with /
- Escaped the Escape Enclosure Character by doubling the /.
ID,Full_name 1,'John //O/'connell' 2,'Jane,//O/'connell' 3,'Scott // O/'connell'
How to detect
Detecting errors in the input file can be a challenge. If the number of masked rows does not match the known number of rows in the file, this indicates there is an issue with EOR, enclosures, and/or escaping.
Some methods:
- The first step is to check the masked row counts and compare that with the known number of rows in the source file.
- Open the file and check (this can be hard on large files).
- Use a CSV validation tool (search on Google) or import the data into Excel.
Use Bash
It is also possible to use bash to quickly scan the data. Below is an example of a basic bash command to check delimiters and enclosures.
How it works:
- The script uses sed to remove the correct delimiters, then counts the number of enclosures, and then removes any row that is not 2 (starting and ending enclosure).
-- Set variables for delimiter and enclosure (these need to be escaped) del=, encl=\" file=yourfile.txt -- If this returns any rows - it indicates that there might be issues with the data in the file cat "$file" | sed "s/$encl$del$encl//g" | sed "s/[^($encl)]//g" | awk '{ print length }' | grep -v "2"