Skip to main content
Delphix

Data Level Profiling - Parameters and Calculations (KBA8131)

 

 

KBA

KBA# 8131

At a Glance

Description: 

This KBA details how Data Level Profiling is calculated and working.

  • The new improved ASDD version is now available.
Percentage (Confidence): Data Level Percentage (Confidence) is calculated using (for exact formula see below): 
% = 100 * [Count rows with PII] / DataLevelRows


This means that the table profiled needs to have more than DataLevelRows rows in the table.
Setting:
[DataLevelRows]
This parameter defines the TOP x number of rows selected for profiling.

It is also used in the Percentage calculation. 
Feature Flag:
[EnableDataLevelCount]
Feature flag. If enabled it will count the number of records in the table at the end of the Profile Job.

This feature has issues and is incorrectly documented. Recommendation: Do not use. 
Improvements (ASDD): A new version of Profiling, called ASDD, is now available. This is a significant improvement to Profiling and it is recommended to upgrade and use ASDD.
  • Use version 14.0 or higher.

 

Note

Note:

The KBA will use PII to indicate any type of Sensitive Information to be identified by the Profiler. It can be personal but does not need to be. 

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies
Major Release All Sub Releases
6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0

5.3

5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0

5.2

5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

Key Parameters

There are five key parameters to consider. One is set in the Job Configuration and four are Application Settings.

Job Configuration

'Multiple Profiler Expression Check'

Job Configuration: 'Multiple Profiler Expression Check'

  • Default: Disabled.
  • This sets how the profiler will act when there is a match of Multiple Profile Expressions.
  • All Data Level Profile Expressions in the Profile Set are used to calculate the total rows with PII.
  • If more than one is identified, use DefaultMultiphiAlgorithm

Application Settings

DefaultMultiphiAlgorithm

Application Settings: DefaultMultiphiAlgorithm

  • Default: NULL SL.
  • This is the algorithm used when 'Multiple Profiler Expression Check' is selected (see above).
Note

Note:

The Domain remains the one selected if the flag was not set.

 

DataLevelPercentage

Application Settings: DataLevelPercentage

  • Default: 80% 
  • This is the threshold for the percentage of the total number of PII rows detected.
  • See below for calculation.

DataLevelRows

Application Settings: DataLevelRows

  • Default: 100
  • This sets the number of rows selected from the profiled table. This ResultSet is used to calculate the number of PII rows. 
  • This number is also used as the Sample Size unless EnableDataLevelCount is enabled. 

EnableDataLevelCount

Application Settings: EnableDataLevelCount

  • Default: False
  • If enabled, the number of rows in the table is calculated using a SELECT COUNT(1) statement for each table profiled.

 

Note

Note:

This does not profile all rows in the table.

 

Warning

Warning:

On MS SQL Server, this SELECT COUNT is missing the SCHEMA and is therefore using the default schema for the user (usually dbo). 

If the schema.table does not exist in the default schema for the user the Profile Job will fail.

 

Calculation and Allocation

The calculation is simple.

Process

  1. Create and execute a Data Level Profile Job that Reads the TOP DataLevelRows from each table. 
    • The Profile Job will Count the number of PII rows matching each individual Profile Expressions in the ResultSet (from the query above - note the TOP).

     

  2. After the Profile Job execution:
    • If the EnableDataLevelCount is set, Count the number of rows in the table.
    • If this fails, all domains and algorithms will be set to null (as expected).

     

  3. Percentage Calculation: 
    • if (countedRows < DataLevelRows)
      • then Percentage = PIIrows / countedRows
      • else Percentage = PIIrows / DataLevelRows

       

  4. If Percentage >= DataLevelPercentage (this is the total - Percentage of all PII rows in the sample)
    • For each Expression having a PII count above, set the Domain in the Profile Result (this is what you see in the UI).
    • If one match only, set Domain and Algorithm to this one. 
    • If multiple matches and Multiple Profiler Expression Check is:
      • Disabled: Set the algorithm to the last one in the list (exact details here unknown).
      • Enabled: Set the algorithm to DefaultMultiphiAlgorithm.


Known issues

The issue happens when the profiled table has fewer rows than DataLevelRows and the SELECT Count is counted on the incorrect schema. Amended to this is that the Percentage is counted on the DataLevelRows (which can cause missing PII in small tables). 

Solutions

There are some solutions:

  • Use a user that has the default schema set to the profiled tables.
  • If not possible, do not use EnableDataLevelCount and try to group tables in Profile Jobs based on number of rows in the tables.
    • This will minimize the effects and will calculate a close as possible Percentage.

 

 


Related Articles

The following articles may provide more information or related information to this article: