Column and Data Level Profiling (KBA1008)
KBA
KBA#1008At a Glance
Description | This KBA provides a technical overview of Delphix's Column and Data Level Profiling. This feature is partly deprecated (see below). This KBA does only cover Column and Data Level Profiling. It describes:
|
---|---|
Updated | Please note that Profiling has been updated and a new feature, Automated Sensitive Data Discovery (ASDD). ASDD was officially introduced in version 9.0.0.0. Use this ASDD for advanced data profiling. See docs for more information. |
Levels and Priority | There are two profiling levels:
|
Search Expressions | The Profiler uses JavaScript RegEx. All matches are case insensitive (uses option /i by default). |
Data Level | Note these things about Data Level profiling:
|
Pro Tip | To exclude a column from Profiling:
|
Configuration | Profiling defaults can be modified using the API (Application Settings).
|
More info | How to troubleshoot:
Information about Profiling and Automated Sensitive Data Discovery (ASDD): |
Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
-
Date Release Jan 25, 2024 19.0.0.0 Dec 20, 2023 | Jan 10, 2024 18.0.0.0 | 18.0.0.1 Nov 21, 2023 17.0.0.0 Oct 18, 2023 16.0.0.0 Sep 21, 2023 15.0.0.0 Aug 24, 2023 14.0.0.0 Jul 24, 2023 13.0.0.0 Jun 21, 2023 12.0.0.0 May 25, 2023 11.0.0.0 Apr 13, 2023 10.0.0.0 | 10.0.0.1 Mar 13, 2023 | Mar 20, 2023 9.0.0.0 | 9.0.0.1 Feb 13, 2023 8.0.0.0 Jan 12, 2023 7.0.0.0 Releases Prior to 2023 Major Release All Sub Releases 6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0, 6.0.10.1, 6.0.11.0, 6.0.12.0, 6.0.12.1, 6.0.13.0, 6.0.13.1, 6.0.14.0, 6.0.15.0, 6.0.16.0, 6.0.17.0, 6.0.17.1, 6.0.17.2
5.3
5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0 5.2
5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1
5.1
5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0
5.0
5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4
When is Profiling Used?
Assigning Domains/Algorithms to Columns in a Rule Set can be time-consuming and profiling is a way to automate this process. Profiling can look at both the Column name (in other words, First Name, Last Name, Address 1, Credit Card Number etc) and the Data (values) in the database.
Different types of profiles (or Profile Sets) can be defined and customized for specific requirements. These are then defined in a Profile Job which when executed assigns Domains/Algorithms to Columns based on the search criteria in the Profile Set.
How it Works Together
To use profiling, there are three different components that need to be defined:
- Profile Set with search Expressions.
- Rule Set to the columns/data to Profile.
- Profile Job
Profile Set
There are two different components in a Profile:
- Profile Set
- The Profile Set is a set of expressions that are used in the Profile Job. You can add and modify the Profile Sets.
- Expression
- These are Regular Expressions (RegEx) that are applied to either a Column name or the data in a Column.
Rule Set
A Rule Set contains the Connector to the data, the Tables in the Database to be masked, and the Columns in the tables.
Profile Job
The Profile Job is configured around the Rule Set (see above) and when executed applies the Profile Expressions to the database. The matched Domains/Algorithms are assigned to the Columns in the Rule Set.
Profiling Levels - Column and Data Level
There are two different levels of profiling:
- Column Level profiling
- Searches through the Column names in the Rule Set from the connected database. looking for specific patterns in Column names.
- Data Level profiling
- Searches through the top x rows of data itself in the Rule Set, looking for specific patterns in the data.
- By default, the 100 top rows are searched (this is configurable via support).
The fastest way to profile a Rule Set is to use Column Level profiling. There are special cases where Column Level profiling is not sufficient (in other words, the Column names are not specific enough) and Data Level profiling is the only way to profile and assign the correct Domain/Algorithm to a column.
Note that a Profile Set can contain both Column Level profiling and Data Level profiling. Column Level is executed first and will override Data Level profiling.
Default Profile Sets
New Profile Sets (out of the box) can be created and they can be edited to suit specific profiling requirements. When the Masking Engine is installed, the following Profile Sets and Profile Configurations are pre-installed:
- Financial
- Column Level profiling is only configured (for a list of Expressions - scroll down).
- HIPAA
- Column Level profiling is only configured (for a list of Expressions - scroll down).
How to Customize Expression Examples
Each database is different and each database has its own naming convention. It is necessary to update the Profile Sets provided out of the box and create custom Expressions and Profile Sets.
Below are some examples of expressions.
Column Level Examples
Simple Expression - First Name
(?>(fi?rst)_?(na?me?)|f_?name)(?!\w*ID)
- 1st Alternative:
(fi?rst)_?(na?me?)
- Any leading string, then
first
orfrst
, then- optional
_
, then name
,nam
,nme
ornm
- 2nd Alternative:
f_?name
f_name
orfname
- NegativeLookahead:
(?!\w*ID)
- Do not match
ID
- Do not match
## Examples - Group 1 matches FIRSTNAME First_NM First_NM_ID Spouse_First_Name ## Examples - Group 2 matches F_Name fname ## Examples - Group 2 No match fname_id
Complex Expression - Address
^(?:(?!postalcode|city|state|country|(l|ln|lin|line)?_?2{1}|ID).)*addre?s?s?_?(?:(?!city|state|country|(l|ln|lin|line)?_?2{1}|ID).)*$
This RegEx has multiple groups and quantifiers.
## Examples - matches ADDR Address address_line addrln_1 ## Example - No match POSTALCODE city address_id id_address address_city city_address
Data Level Examples
Simple Expression - PO Box
po box|p\.o\.
This RegEx has two groups and matches:
- Group 1:
po box
po
box
and any thing there after - in other words,po box 123
- Group 2:
p\.o\.
p.o.
and any thing there after - in other words,p.o. 1234
## Examples - matches PO BOX PO BOX 123 p.o. 123a ## Example - No match PO. 123 POBOX 123 PBOX 123
Complex Expression - Address
(.*[\s]+b(ou)?l(e)?v(ar)?d[\s]*.*)|(.*[\s]+st[.]?(reet)?[\s]*.*)|(.*[\s]+ave[.]?(nue)?[\s]*.*)|(.*[\s]+r(oa)?d[\s]*.*)|(.*[\s]+l(a)?n(e)?[\s]*.*)|(.*[\s]+cir(cle)?[\s]*.*)
This RegEx has multiple groups and quantifiers.
- Need to capture the most frequent addresses only.
- boulevard (and shorts)
- st | st. | street
- ave | ave. | avenue
- ...
## Examples - matches ANY STREET any st. any blvd any ave Any Avenue any lane any rd any cir ## Example - No match ANY STR ANY AV ANY PLACE
Expression with Exclusion - US Phone Number
This is a simple US Phone Number RegEx (this RegEx is not suitable for international numbers). This RegEx includes an exclusion for any number with 6 digits and 4 decimals.
((\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b)(?<![0-9]{6}[.][0-9]{4}))
- Group 1:
(\(?\b....)
- Match a number with 3 digits, 3 digits and 4 digits (there can be optional space, - or . between each set).
- Negative Lookbehind:
(?<![...)
- Do not match 6 digits, dot [.] and 4 digits.
## Examples - matches 1231231234 123.123.1234 123 123-1234 123123 1234 ## Example - No match 12 3456 1234 123456.1234 1.1 any123 123 1234
Out of the Box Expressions and Profile Sets
Financial Profiling Expressions
The Profile Set "Financial" contains, Out of the Box, the following Expressions - all are Column Level expressions:
- Birth Date
- Address
- Address Line 2 - after
- Account Number
- Birth Date1
- Customer Number
- Credit Card Number
- Card Number
- Drivers License Number
- Drivers License Number1
- First Name
- Last Name
- Middle Name
- Security Code
- Social Security Number
- Tax ID Code or Number
- Telephone or Contact Number
- Zip or Postal Code
- School Name
- Tax ID Number
- Address Line2 - before
- Birth Date2
- PO Box
- Fax Number
- Street Address
- IP Address
- Web or URL Address
HIPAA Profiling Expressions
The Profile Set "HIPAA" contains, Out of the Box, the following Expressions - all are Column Level expressions:
- Birth Date
- Address
- Address Line 2 - after
- Account Number
- Birth Date1
- Customer Number
- Credit Card Number
- Card Number
- Drivers License Number
- Drivers License Number1
- First Name
- Last Name
- Middle Name
- Security Code
- Social Security Number
- Tax ID Code or Number
- Telephone or Contact Number
- Zip or Postal Code
- School Name
- Tax ID Number
- Address Line2 - before
- Birth Date2
- Beneficiary Number
- Certificate Number
- City
- County
- License Plate
- PO Box
- Precinct
- Record Number
- Serial Number
- Signature
- VIN
- Vehicle
- Fax Number
- Admission Date
- Treatment Date
- Discharge Date
- Street Address
- IP Address
- Web or URL Address
- Biometric
- Certificate ID
- Beneficiary ID