Skip to main content
Delphix

Column and Data Level Profiling (KBA1008)

 

This page covers the technical overview of Delphix's Masking Profiler that describes Column and Data Level profiling and RegEx constructs.

This page provides additional information to the user documentation about Docs: Managing Profiler Settings.  

At a Glance  

Available in:

Available in all known versions.  3x, 4.x, 5.x
This document is based on 5.2.

Characteristics:

The Profiler uses standard RegEx (JavaScript).
All matches are case insensitive, option /i.

Modes:

There are two modes:

  • Column Level 
  • Data Level 
Special notes:
  • Data  Level
    • Complete strings are searched. This means large text objects will take a longer time.
    • Default: No of Rows searched: Top 100 
    • Default: Percentage Required for match is 80%
Configuration:

Data Level Defaults can be modified. Please contact support. 

When is Profiling Used? 

Assigning Domains/Algorithms to Columns in a Rule Set can be time-consuming and profiling is a way to automate this process. Profiling can look at both the Column name (in other words, First Name, Last Name, Address 1, Credit Card Number etc) and the Data (values) in the database. 

Different types of profiles (or Profile Sets) can be defined and customized for specific requirements. These are then defined in a Profile Job which when executed assigns Domains/Algorithms to Columns based on the search criteria in the Profile Set.

How it Works Together 

To use profiling, there are three different components that need to be defined:

  • Profile Set with search Expressions.
  • Rule Set to the columns/data to Profile.
  • Profile Job

Profile Set 

There are two different components in a Profile:

  1. Profile Set
    • The Profile Set is a set of expressions which are used in the Profile Job. You can add and modify the Profile Sets. 
  2. Expression 
    • These are Regular Expressions (RegEx) that are applied to either a Column name or the data in a Column. 

 

Rule Set 

A Rule Set contains the Connector to the data, the Tables in the Database to be masked and the Columns in the tables.

Profile Job 

The Profile Job is configured around the Rule Set (see above) and when executed applies the Profile Expressions to the database. The matched Domains/Algorithms are assigned to the Columns in the Rule Set.

Profiling Levels - Column and Data Level 

There are two different levels of profiling:

  1. Column Level profiling 
    • Searches through the Column names in the Rule Set from the connected database. looking for specific patterns in Column names.
  2. Data Level profiling 
    • Searches through the top x rows of data itself in the Rule Set, looking for specific patterns in the data.
    • By default, the 100 top rows are searched (this is configurable via support). 

The fastest way to profile a Rule Set is to use Column Level profiling.  There are special cases where Column Level profiling is not sufficient (in other words, the Column names are not specific enough) and Data Level profiling is the only way to profile and assign the correct Domain/Algorithm to a column.

Note that a Profile Set can contain both Column Level profiling and Data Level profiling. Column Level is executed first and will override Data Level profiling. 

 

Note

Note:

Profiling will profile the data as per formatting from the JDBC connection, which will strictly follow format for the data type. For example, Data Type Currency has 4 decimal points and will be formatted with 4 decimal points (when reading the data using a SELECT in say Management Studio the data is displayed with 2 decimal points). For more info, read the definition of the datatype. If there are False Positives then the RegEx might need to be modified (in other words, including a RegEx Exclusion using ?<!).

Default Profile Sets  

New Profile Sets (out of the box) can be created and they can be edited to suit specific profiling requirements. When the Masking Engine is installed, the following Profile Sets and Profile Configurations are pre-installed:

  1. Financial
    • Column Level profiling only configured (for a list of Expressions - scroll down).
  2. HIPAA
    • Column Level profiling only configured (for a list of Expressions - scroll down).

How to Customize Expression Examples 

Each database is different and each database has its own naming convention. It is necessary to update the Profile Sets provided out of the box and create custom Expressions and Profile Sets. 

Below are some examples of expressions. 

Note

Note:

 All Regular Expressions (RegEx) are case insensitive on the Masking Engine. 

To test, some RegEx might need parenthesis () around the expression. A good tool to test is regex101.com.
Remember to set the flag to make the search case insensitive.

Some evaluation tools, such as http://www.regexplanet.com/advanced/java/index.htmlcompare the output of different regular expression evaluation functions.  Note that the Masking Engine uses the find() method rather than matches(), which may need to be taken into account when writing the regular expression.  matches() will match an entire string whereas find() will match a substring.  A leading caret (^) and trailing dollar sign ($) may therefore be needed in the profiling expression to match an entire string.

Column Level Examples 

Simple Expression - First Name 

(?>(fi?rst)_?(na?me?)|f_?name)(?!\w*ID)

This RegEx has two groups and matches: 

  • Group 1: (?>(fi?rst)_?(na?me?)
    • Any leading string, then
    • first or frst, then
    • optional _, then 
    • namenamnme or nm
  • Group 2:  f_?name)(?!\w*ID)
    • f_name or fname
    • Do not matchID
## Examples - Group 1 matches
FIRSTNAME
First_NM
First_NM_ID
Spouse_First_Name

## Examples - Group 2 matches
F_Name
fname

## Examples - Group 2 No match
fname_id

Complex Expression - Address 

^(?:(?!postalcode|city|state|country|(l|ln|lin|line)?_?2{1}|ID).)*addre?s?s?_?(?:(?!city|state|country|(l|ln|lin|line)?_?2{1}|ID).)*$

This RegEx has multiple groups and quantifiers. 

## Examples - matches
ADDR
Address
address_line
addrln_1

## Example - No match
POSTALCODE
city
address_id
id_address
address_city
city_address

Data Level Examples 

Simple Expression - PO Box 

po box|p\.o\.

This RegEx has two groups and matches: 

  • Group 1: po box
    • po boxand any thing there after - in other words, po box 123
  • Group 2:  p\.o\.
    • p.o. and any thing there after - in other words, p.o. 1234
## Examples - matches
PO BOX
PO BOX 123
p.o. 123a

## Example - No match
PO. 123
POBOX 123
PBOX 123

Complex Expression - Address 

(.*[\s]+b(ou)?l(e)?v(ar)?d[\s]*.*)|(.*[\s]+st[.]?(reet)?[\s]*.*)|(.*[\s]+ave[.]?(nue)?[\s]*.*)|(.*[\s]+r(oa)?d[\s]*.*)|(.*[\s]+l(a)?n(e)?[\s]*.*)|(.*[\s]+cir(cle)?[\s]*.*)

This RegEx has multiple groups and quantifiers. 

  • Need to capture the most frequent addresses only. 
    • boulevard (and shorts)
    • st | st. | street
    • ave | ave. | avenue
    • ...
## Examples - matches
ANY STREET
any st.
any blvd
any ave
Any Avenue
any lane
any rd
any cir

## Example - No match
ANY STR
ANY AV
ANY PLACE

Expression with Exclusion - US Phone Number 

This is a simple US Phone Number RegEx (this RegEx is not suitable for international numbers). This RegEx includes an exclusion for any number with 6 digits and 4 decimals. 

((\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b)(?<![0-9]{6}[.][0-9]{4}))
  • Group 1:  (\(?\b....)
    • Match a number with 3 digits, 3 digits and 4 digits (there can be optional space, - or . between each set).
  • Negative Lookbehind:  (?<![...)
    • Do not match 6 digits, dot [.] and 4 digits. 
## Examples - matches
1231231234
123.123.1234
123 123-1234
123123 1234

## Example - No match
12 3456 1234
123456.1234
1.1
any123 123 1234

Out of the Box Expressions and Profile Sets 

Financial Profiling Expressions 

The Profile Set "Financial" contains, Out of the Box, the following Expressions - all are Column Level expressions:

  •  Birth Date
  •  Address
  •  Address Line 2 - after
  •  Account Number
  •  Birth Date1
  •  Customer Number
  •  Credit Card Number
  •  Card Number
  •  Drivers License Number
  •  Drivers License Number1
  •  Email
  •  First Name
  •  Last Name
  •  Middle Name
  •  Security Code
  •  Social Security Number
  •  Tax ID Code or Number
  •  Telephone or Contact Number
  •  Zip or Postal Code
  •  School Name
  •  Tax ID Number
  •  Address Line2 - before
  •  Birth Date2
  •  PO Box
  •  Fax Number
  •  Street Address
  •  IP Address
  •  Web or URL Address

HIPAA Profiling Expressions 

The Profile Set "HIPAA" contains, Out of the Box, the following Expressions - all are Column Level expressions:

  •  Birth Date
  •  Address
  •  Address Line 2 - after
  •  Account Number
  •  Birth Date1
  •  Customer Number
  •  Credit Card Number
  •  Card Number
  •  Drivers License Number
  •  Drivers License Number1
  •  Email
  •  First Name
  •  Last Name
  •  Middle Name
  •  Security Code
  •  Social Security Number
  •  Tax ID Code or Number
  •  Telephone or Contact Number
  •  Zip or Postal Code
  •  School Name
  •  Tax ID Number
  •  Address Line2 - before
  •  Birth Date2
  •  Beneficiary Number
  •  Certificate Number
  •  City
  •  County
  •  License Plate
  •  PO Box
  •  Precinct
  •  Record Number
  •  Serial Number
  •  Signature
  •  VIN
  •  Vehicle
  •  Fax Number
  •  Admission Date
  •  Treatment Date
  •  Discharge Date
  •  Street Address
  •  IP Address
  •  Web or URL Address
  •  Biometric
  •  Certificate ID
  •  Beneficiary ID