Skip to main content
Delphix

Delimited File Masking and EOR Enclosures Escapes (KBA8928)

 

 

KBA

KBA# 8928

At a Glance  

Description This article describes Delimited File Masking configurations, especially characters (printable and non-printable) for End of Record (EOR) and Delimiters and Enclosure and Escape strategies. 

Warning: Incorrect configuration might cause incorrect masked output and formatting. 
Configuration These configurations need to be correctly set:
  • in the masked file and
  • in the Rule Set / File Format
End of Record The End of Record (EOR) will likely differ between systems and needs to be defined.
Note: A New Line in a field will create a new record (fields with New Lines need to be enclosed or escaped). 
  EOR The End of Record configures the character(s) that defines the end of the record. 
 
Options:
  • CR+LF term (Windows, DOS)
  • LF term (Unix)
  • Custom
Use characters:
  • Non-printable: $[0D]$[0A]
  • Non-printable$[0A]
  • Custom printable and non-printable (Ctrl).
    • Examples: ~$[0D]$[0A]
Delimited Files A Delimited File uses a character to separate fields. The delimiter needs to be defined.
Note: A delimiter in a field will create a new field (values with a delimiter need to be enclosed or escaped). 
  Delimiter The Delimiter configures the character(s) that defines the delimitation of values in a record.

The most common delimiter is Comma ( , ) but can be any character. 

 
Options:
  • Custom
Characters:
  • Custom printable and non-printable (Ctrl).
    • Examples:
      • Comma:  , 
      • Pipe:  | 
      • Tab:  $[09] 
Enclosure and Escape If a value contains a delimiter or an EOR, it needs to be Enclosed or Escaped. 
  Enclosure The Text Enclosure configures the character(s) that encloses a value.

The most common text enclosure is Double Quotes ( " ) but can be any character.

 
Options:
  • Custom
Characters:
  • Custom printable.
  Escape The Escape Enclosure strategy configures how to escape the text enclosure.
 
Options:
  • Double Enclosure
  • Custom
Enclosure Escape Strategy:
  • The same character twice.
  • Custom printable.
  Escape EEC The Escape EEC (Escape Enclosure Character) is a flag that enables escaping the escape character that escapes the enclosure.
 
Options:
  • Flag (on/off)
Escape the Escape Enclosure Character:
  • The same character twice.
EDI There are some special notes around EDI files:
  1. Check the EOR of the masked EDI file (use Notepad++ or other text editor):
    • The EOR has to be correct and common EORs are (use Custom):
      • Single line: ~
      • Tilde with Unix LF: ~$[0A]
      • Tilde with Windows CR+LF: ~$[0D]$[0A]
  2. The Delimiter is usually: *
More info For information about investigating errors and the logs: Please also check the Masking Docs:

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies
Date Release
Apr 17, 2024 22.0.0.0
Mar 20, 2024 | Apr 2, 2024 21.0.0.0 | 21.0.0.1
Feb 21, 2024 20.0.0.0
Jan 25, 2024 19.0.0.0
Dec 20, 2023 | Jan 10, 2024 18.0.0.0 | 18.0.0.1
Nov 21, 2023 17.0.0.0
Oct 18, 2023 16.0.0.0
Sep 21, 2023 15.0.0.0
Aug 24, 2023 14.0.0.0
Jul 24, 2023 13.0.0.0
Jun 21, 2023 12.0.0.0
May 25, 2023 11.0.0.0
Apr 13, 2023 10.0.0.0 | 10.0.0.1
Mar 13, 2023 | Mar 20, 2023 9.0.0.0 | 9.0.0.1
Feb 13, 2023 8.0.0.0
Jan 12, 2023 7.0.0.0
Releases Prior to 2023
Major Release All Sub Releases
6.0

6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0, 6.0.13.1, 6.0.14.0, 6.0.15.0, 6.0.16.0, 6.0.17.0, 6.0.17.1, 6.0.17.2

Delimited File Configuration

This section describes the different configurations in Delimited File Masking, such as EOR, delimiter, enclosures, and escape characters (strategies).

UI Example 

The Format is configured in the Rule Set. To access Special Characters, use the CTRL buttons shown below:  

KBA - Set Delimiter.png

End of Record (EOR)

This defines the character (or sequence of characters) used to define the End of Record.

Option Hex Code Comment
Select one of:
  • Custom
  • For: 
    • CR+LF terminated (Windows, DOS)
    • LF terminated (Unix)
Use Hex:
  • -
  • -
  • 0x0D+0x0A
  • 0x0A
This sets the EOR:
  • Use Custom to define desired character(s) as the EOR.
  • For these set it to:
    • $[0D]$[0A]
    • $[0A]

Custom End of Record

The EOR can be custom and be any character or a sequence of characters (printable or non-printable).

For non-printable characters please see the table in the section below. 

Example

An example of a printable EOR sequence is: #n 

This means that the two characters ('#' and 'n') will be the EOR.

This will not look like may be expected - below is a comma-separated example (3 columns, 3 rows):

R1C1_Val,R1C2_Val,R1C3_Val#nR2C1_Val,R2C2_Val,R2C3_Val#nR3C1_Val,R3C2_Val,R3C3_Val
EDI

Another example involves EDI files. These are frequently just separated with '~' (tilde) and '*' as a delimiter. A file will look some things like:

BSN*00*000*19880106*11:12:13~HL*1**S~REF*SI*0000001013~HL*2*1*O~PRF*123456~REF*OQ*12345-C~...

If the EOR contains New Line characters - it is essential to determine if it is a Windows or Unix-based file. In this case, it can either be:

  • Tilde with Unix LF: ~$[0A]
  • Tilde with Windows CR+LF: ~$[0D]$[0A]
BSN*00*000*19880106*11:12:13~
HL*1**S~REF*SI*0000001013~
HL*2*1*O~
PRF*123456~
REF*OQ*12345-C~
...

Common error \n or \r\n

A common error is to define the EOR using '\n' or '\r\n'. When '\n' is defined as the EOR, this means the two characters '\' and 'n' are used as an EOR (not LF (0x0A) as commonly thought).

To define LF (0x0A) please use the 'LF terminated (Unix)' or a Custom Control Character as listed in the table below. 

Note

Note:

'\n' is frequently used in text to indicate newline (note that this is an escaped encoding, part of the text, and not the LF control character).

 

Non-printable Control Characters 

non-printable control character is a character/code (defined in ASCII) that does not represent a printable symbol. For the EOR, this is usually CR+LF or LF. 

The table shows the Ctrl Code (also used in the UI), the Hex Value, a Description, and a Comment.

UI Symbol Ctrl Character Description  Comment
^@ [NUL]
^A [SOH]
^B [STX]
^C [ETX]
^D [EOT]
^E [ENQ]
^F [ACK]
^G [BEL]
^H [BS]
^I [HT]
^J [LF]
^K [VT]
^L [FF]
^M [CR]
^N [SO]
^O [SI]
^P [DLE]
^Q [DC1]
^R [DC2]
^S [DC3]
^T [DC4]
^U [NAK]
^V [SYN]
^W [ETB]
^X [CAN]
^Y [EM]
^Z [SUB]
^[ [ESC]
^\ [FS]
^] [GS]
^^ [RS]
^_ [US]
$[00]
$[01]
$[02]
$[03]
$[04]
$[05]
$[06]
$[07]
$[08]
$[09]
$[0A]
$[0B]
$[0C]
$[0D]
$[0E]
$[0F]
$[10]
$[11]
$[12]
$[13]
$[14]
$[15]
$[16]
$[17]
$[18]
$[19]
$[1A]
$[1B]
$[1C]
$[1D]
$[1E]
$[1F]
NULL (character)
START OF HEADING
START OF TEXT
END OF TEXT
END OF TRANSMISSION
ENQUIRY
ACKNOWLEDGE
BELL (beep)
BACKSPACE
HORIZONTAL TAB
LINE FEED
VERTICAL TAB
FORM FEED
CARRIAGE RETURN
SHIFT OUT
SHIFT IN
DATA LINK ESCAPE
DEVICE CONTROL 1 (XON)
DEVICE CONTROL 2
DEVICE CONTROL 3
DEVICE CONTROL 4 (XOFF)
NEGATIVE ACKNOWLEDGE
SYNCHRONOUS IDLE
END OF TRANSMISSION BLOCK
CANCEL
END OF MEDIUM
SUBSTITUTE
ESCAPE
FILE SEPARATOR
GROUP SEPARATOR
RECORD SEPARATOR
UNIT SEPARATOR










HT = TAB (used in tab-delimited file)
LF = Windows EOR (2 of 2). Unix EOR.


CR = Windows EOR (1 of 2).

 

Delimiter

This defines the character (or characters) used to define the delimiter. The delimiter can be any character or a sequence of characters (printable or non-printable).

File Type Delimiter Comment
Common delimited files are: 
  • Comma Separated Values (CSV)
  • Pipe Delimited File
  • Tab Delimited File
Use:
  •  , 
  •  | 
  •  $[09] 
Notes:
  • Printable character: comma.
  • Printable character: pipe.
  • Click the button 'Ctrl' and select ^I [HT]  Horizontal Tab.

Common error: '^t'

A common error is to define '^t' as the tab character. This will in fact define two characters ('^' and 't') as the delimiter. 

To define TAB, click the 'Ctrl' and select Horizontal Tab (as listed in the table above). 

 

Note

Note:

Do not use '^t'. It is frequently used in text data. This is an escaped encoding in the text (not a tab character).

 

Non-printable Control Characters 

To define non-printable control characters as the delimiter, click the Ctrl button and select the character. These are the same control characters as in the EOR; see above for a list of Control Characters.  

Text Enclosure

This section answer what to do if a value (string) in the delimited file contains a delimiter or EOR character.

This occurs when the value (string) is enclosed with an enclosure. This enclosure is also configurable on the engine.

Example

The example shows a CSV file with a value ('Jane,Smith') that contains a comma. Note the 3rd example where the value contains a newline, CR+LF (Windows) or LF (Unix).

ID,Full_name
1,John Smith
2,Jane,Smith
3,Scott
Smith

Solution

The solution is the enclose the strings with an enclosure (here " ). Note that the enclosure is optional.

ID,Full_name
1,"John Smith"
2,"Jane,Smith"
3,"Scott
Smith"

Enclosure Escape Strategy

This section answer what to do if the enclosed value does include the enclosure character.

There is a solution for this, also. It is referred to as an enclosure escape strategy.

Escape Strategy Comment
Two options: 
  • Double Enclosure
  • Custom
To enable the engine to ignore an enclosure in a value:
  • Use the enclosure twice. 
  • Use a special escape character (printable only).

Example

Following the example above but using a single quote as the enclosure. The surname is now O'connell

ID,Full_name
1,John O'connell
2,Jane,O'connell
3,Scott
O'connell

Solution - Double Enclosure

The solution here is to escape the enclosure by doubling it (providing it twice).

ID,Full_name
1,'John O''connell'
2,'Jane,O''connell'
3,'Scott
O''connell'

Solution - Custom 

The solution used here is to escape with a character. The character used is / .

ID,Full_name
1,'John O/'connell'
2,'Jane,O/'connell'
3,'Scott
O/'connell'

Escape "Enclosure Escape Character"

This section answer what to do if you need to escape the escape character that is used to escape the enclosure. There is a solution for that too.

 

Escape EEC Comment
Only one option: 
  • 'Double Escape'
To enable the engine to ignore an enclosure in a value:
  • Use the escape EEC twice. 

Example

Following the example above but using using an added ' / ' in one of the values. 

ID,Full_name
1,John /O'connell
2,Jane,/O'connell
3,Scott /
O'connell

Solution - Escape EEC 

In the example below - we now have:

  1. Enclosed the value with '
  2. Escaped the enclosure with /
  3. Escaped the Escape Enclosure Character by doubling the /. 
ID,Full_name
1,'John //O/'connell'
2,'Jane,//O/'connell'
3,'Scott //
O/'connell'

How to detect

Detecting errors in the input file can be a challenge. If the number of masked rows does not match the known number of rows in the file, this indicates there is an issue with EOR, enclosures, and/or escaping. 

Some methods:

  • The first step is to check the masked row counts and compare that with the known number of rows in the source file.
  • Open the file and check (this can be hard on large files).
  • Use a CSV validation tool (search on Google) or import the data into Excel.
Use Bash

It is also possible to use bash to quickly scan the data. Below is an example of a basic bash command to check delimiters and enclosures.

Note

Notes:

  • This example is not a complete tool and should be used as a starting point. 
  • It uses variables to make the code easier to read.
  • It will not show which row.

How it works:

  • The script uses sed to remove the correct delimiters, then counts the number of enclosures, and then removes any row that is not 2 (starting and ending enclosure).
-- Set variables for delimiter and enclosure (these need to be escaped)
del=,
encl=\"
file=yourfile.txt

-- If this returns any rows - it indicates that there might be issues with the data in the file
cat "$file" | sed "s/$encl$del$encl//g" | sed "s/[^($encl)]//g" |  awk '{ print length }' | grep -v "2"