Skip to main content
Delphix

Delimited File Masking and EOR Enclosures Escapes (KBA8928)

 

 

KBA

KBA# 8928

At a Glance  

Description: This KBA describes Delimited File Masking configurations. Especially characters (printable and unprintable) for End of Record and Delimiters and Enclosure and Escape strategies. 

Note: Incorrect configuration, might cause incorrect masked output and formatting. 
End of Record: The End of Record configures the character(s) that defines the end of the record. 
 
Options:
  • CR+LF term. (Windows, DOS)
  • LF term. (Unix)
  • Custom
Use characters:
  • Unprintable: 0x0D + 0x0A 
  • Unprintable: 0x0A
  • Custom printable and unprintable (Ctrl).
    • Examples:
      • EDI: ~$[0D]$[0A]
Delimiter: The Delimiter configures the character(s) that defines the delimitation of values in a record.

The most common delimiter is Comma ( , ) but can be any character. 

 
Options:
  • Custom
Characters:
  • Custom printable and unprintable (Ctrl).
    • Examples:
      • Comma:  , 
      • Pipe:  | 
      • Tab:  $[09] 
Text Enclosure: The Text Enclosure configures the character(s) that encloses a value.

The most common text enclosure is Double Quotes ( " ) but can be any character.

 
Options:
  • Custom
Characters:
  • Custom printable.
Enclosure  Escape: The Escape Enclosure strategy configures how to escape the text enclosure.
 
Options:
  • Double Enclosure
  • Custom
Enclosure Escape Strategy:
  • The same character twice.
  • Custom printable.
Escape EEC: The Escape EEC (Escape Enclosure Character) is a flag that enables escaping the escape character that escapes the enclosure.
 
Options:
  • Flag (on/off)
Escape the Escape Enclosure Character:
  • The same character twice.
More info: More information about this can be found in the Masking Docs:

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies
Major Release All Sub Releases
6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0

5.3

5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0

5.2

5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

Delimited File Configuration

This section describes the different configurations in Delimited File Masking, such as EOR, delimiter, enclosures, and escape characters (strategies).

UI Example 

The Format is configured in the Rule Set and to access Special Characters - use the CTRL buttons shown below:  

UI_Masking_-_CSV_CTRL_Config.png

End of Record (EOR)

This defines the character (or sequence of characters) used to define the End of Record.

Option Hex Code Comment
Select one of:
  • Custom
  • For: 
    • CR+LF terminated (Windows, DOS)
    • LF terminated (Unix)
Use Hex:
  • -
  • -
  • 0x0D+0x0A
  • 0x0A
This sets the EOR:
  • Use Custom to define desired character(s) as the EOR.
  • For these set it to:
    • $[0D]$[0A]
    • $[0A]

Custom End of Record

The EOR can be custom and be any character or a sequence of characters (printable or unprintable).

For unprintable characters please see the table in the section below. 

 

An example of a printable EOR sequence is: #n 

This means that the two characters ('#' and 'n') will be the EOR.

This will not look like many expect - below is a comma separated example (3 columns, 3 rows):

R1C1_Val,R1C2_Val,R1C3_Val#nR2C1_Val,R2C2_Val,R2C3_Val#nR3C1_Val,R3C2_Val,R3C3_Val

Common error \n or \r\n

A common error is to define '\n' or '\r\n' as the EOR. When '\n' is defined as the EOR  this means that the two characters '\' and 'n' are used as an EOR (not LF (0x0A) as commonly thought).

To define LF (0x0a) please use the 'LF terminated (Unix)' or a Custom Control Character as listed in the table below. 

Note

Note:

'\n' is frequently used in text to indicate newline (note that this is an escaped encoding, part of the text, and not the LF control character).

 

Unprintable Control Characters 

An unprintable control characters is a characters/code (defined in ASCII) that does not represent a printable symbol. For the EOR, this is usually CR+LF or LF. 

The table shows the Ctrl Code (also used in the UI), the Hex Value, a Description, and a Comment.

UI Symbol Ctrl Character Description  Comment
^@ [NUL]
^A [SOH]
^B [STX]
^C [ETX]
^D [EOT]
^E [ENQ]
^F [ACK]
^G [BEL]
^H [BS]
^I [HT]
^J [LF]
^K [VT]
^L [FF]
^M [CR]
^N [SO]
^O [SI]
^P [DLE]
^Q [DC1]
^R [DC2]
^S [DC3]
^T [DC4]
^U [NAK]
^V [SYN]
^W [ETB]
^X [CAN]
^Y [EM]
^Z [SUB]
^[ [ESC]
^\ [FS]
^] [GS]
^^ [RS]
^_ [US]
$[00]
$[01]
$[02]
$[03]
$[04]
$[05]
$[06]
$[07]
$[08]
$[09]
$[0A]
$[0B]
$[0C]
$[0D]
$[0E]
$[0F]
$[10]
$[11]
$[12]
$[13]
$[14]
$[15]
$[16]
$[17]
$[18]
$[19]
$[1A]
$[1B]
$[1C]
$[1D]
$[1E]
$[1F]
NULL (character)
START OF HEADING
START OF TEXT
END OF TEXT
END OF TRANSMISSION
ENQUIRY
ACKNOWLEDGE
BELL (beep)
BACKSPACE
HORIZONTAL TAB
LINE FEED
VERTICAL TAB
FORM FEED
CARRIAGE RETURN
SHIFT OUT
SHIFT IN
DATA LINK ESCAPE
DEVICE CONTROL 1 (XON)
DEVICE CONTROL 2
DEVICE CONTROL 3
DEVICE CONTROL 4 (XOFF)
NEGATIVE ACKNOWLEDGE
SYNCHRONOUS IDLE
END OF TRANSMISSION BLOCK
CANCEL
END OF MEDIUM
SUBSTITUTE
ESCAPE
FILE SEPARATOR
GROUP SEPARATOR
RECORD SEPARATOR
UNIT SEPARATOR










HT = TAB (used in tab-delimited file)
LF = Windows EOR (2 of 2). Unix EOR.


CR = Windows EOR (1 of 2).

 

Delimiter

This defines the character (or characters) used to define the delimiter. The delimiter can be any character or a sequence of characters (printable or unprintable).

File Type Delimiter Comment
Common delimited files are: 
  • Comma Separated Values (CSV)
  • Pipe Delimited File
  • Tab Delimited File
Use:
  •  , 
  •  | 
  •  $[09] 
Notes:
  • Printable character: comma.
  • Printable character: pipe.
  • Click the button 'Ctrl' and select ^I [HT]  Horizontal Tab.

Common error: '^t'

A common error is to define '^t' as the tab character. This will in fact define two characters ('^' and 't') as the delimiter. 

To define TAB, click the 'Ctrl' and select Horizontal Tab (as listed in the table above). 

 

Note

Note:

Do not use '^t'. It is frequently used in text data. This is an escaped encoding in the text (not a tab character).

 

Unprintable Control Characters 

To define unprintable control characters as the delimiter, click the Ctrl button and select the character. These are the same control characters as in the EOR, see above for a list of Control Characters.  

Text Enclosure

If a value (string) in the delimited file contains a delimiter or EOR character, what should you do?

This is when the value (string) is enclosed with an enclosure. This enclosure is also configurable on the engine.

Example

The example shows a CSV file with a value ('Jane,Smith') that contains a comma. Note the 3rd example where the value contains a newline, CR+LF (Windows) or LF (Unix).

ID,Full_name
1,John Smith
2,Jane,Smith
3,Scott
Smith

Solution

The solution is the enclose the strings with an enclosure (here " ). Note that the enclosure is optional.

ID,Full_name
1,"John Smith"
2,"Jane,Smith"
3,"Scott
Smith"

Enclosure Escape Strategy

What if the enclosed value does include the enclosure character?

There is a solution for this, too. It is called enclosure escape strategy.

Escape Strategy Comment
Two options: 
  • Double Enclosure
  • Custom
To enable the engine to ignore an enclosure in a value:
  • Use the enclosure twice. 
  • Use a special escape character (printable only).

Example

Following the example above but using a single quote as the enclosure. The surname is now O'connell. 

ID,Full_name
1,John O'connell
2,Jane,O'connell
3,Scott
O'connell

Solution - Double Enclosure

The solution here is to escape the enclosure by doubling it (writing it twice).

ID,Full_name
1,'John O''connell'
2,'Jane,O''connell'
3,'Scott
O''connell'

Solution - Custom 

The solution used here is to escape with a character. The character used is / .

ID,Full_name
1,'John O/'connell'
2,'Jane,O/'connell'
3,'Scott
O/'connell'

Escape "Enclosure Escape Character"

What if we need to escape the escape character that is used to escape the enclosure? There is a solution for that too.

 

Escape EEC Comment
Only one option: 
  • 'Double Escape'
To enable the engine to ignore an enclosure in a value:
  • Use the escape EEC twice. 

Example

Following the example above but using using an added ' / ' in one of the values. 

ID,Full_name
1,John /O'connell
2,Jane,/O'connell
3,Scott /
O'connell

Solution - Escape EEC 

In the example below - we have now:

  1. Enclosed to value with '
  2. Escaped the enclosure with /
  3. Escaped the Escape Enclosure Character by doubling the /. 
ID,Full_name
1,'John //O/'connell'
2,'Jane,//O/'connell'
3,'Scott //
O/'connell'

How to detect

Detecting errors in the input file can be tricky. If the number of masked rows doesn't match the known number of rows in the file this indicates that there is an issue with EOR, enclosures, and escaping. 

Some methods:

  • The first call is to check the masked row counts and compare that with the known number of rows in the source file.
  • Open the file and check (this can be hard on large files)
  • Use a CSV validation tool (search on Google) or import the data into Excel.

 

Use Bash

It is also possible to use bash to quickly scan the data. Below is an example of a basic bash command to check delimiters and enclosures.

Notes:

  • This example is not a complete tool and should be used as a starting point. 
  • It uses variables to make the code easier to read.
  • It will not show which row.

How it works:

  • The script uses sed to remove the correct delimiters, then counts the number of enclosures, and then removes any row that isn't 2 (starting and ending enclosure).
-- Set variables for delimiter and enclosure (these need to be escaped)
del=,
encl=\"
file=yourfile.txt

-- If this returns any rows - it indicates that there might be issues with the data in the file
cat "$file" | sed "s/$encl$del$encl//g" | sed "s/[^($encl)]//g" |  awk '{ print length }' | grep -v "2"