Data Level Profiling - Parameters and Calculations (KBA8131)
KBA
KBA# 8131At a Glance
Description: | This KBA details how Data Level Profiling is calculated and working. |
---|---|
Percentage (Confidence): | Data Level Percentage (Confidence) is calculated using (for exact formula see below):
This means that the table profiled needs to have more than DataLevelRows rows in the table. |
Setting: [DataLevelRows] |
This parameter defines the TOP x number of rows selected for profiling. It is also used in the Percentage calculation. |
Feature Flag: [EnableDataLevelCount] |
Feature flag. If enabled it will count the number of records in the table at the end of the Profile Job. This feature has bugs and is incorrectly documented. Recommendation: Do not use. |
Improvements: | There is an ongoing project to update and improve Profiling. The feature will also get a new name. Improvements will be incremental and these are slated for the next upcoming versions. ETA unknown. |
Applicable Delphix Versions
- Click here to view the versions of the Delphix engine to which this article applies
-
Major Release All Sub Releases 6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1, 6.0.9.0, 6.0.10.0 5.3
5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0 5.2
5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1
5.1
5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0
5.0
5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4
Key Parameters
There are five key parameters to consider. One is set in the Job Configuration and four are Application Settings.
Job Configuration
'Multiple Profiler Expression Check'
Job Configuration: 'Multiple Profiler Expression Check'
- Default: Disabled.
- This sets how the profiler will act when there is a match of Multiple Profile Expressions.
- All Data Level Profile Expressions in the Profile Set are used to calculate the total rows with PII.
- If more than one is identified, use DefaultMultiphiAlgorithm.
Application Settings
DefaultMultiphiAlgorithm
Application Settings: DefaultMultiphiAlgorithm
- Default: NULL SL.
- This is the algorithm used when 'Multiple Profiler Expression Check' is selected (see above).
DataLevelPercentage
Application Settings: DataLevelPercentage
- Default: 80%
- This is the threshold for the percentage of the total number of PII rows detected.
- See below for calculation.
DataLevelRows
Application Settings: DataLevelRows
- Default: 100
- This sets the number of rows selected from the profiled table. This ResultSet is used to calculate the number of PII rows.
- This number is also used as the Sample Size unless EnableDataLevelCount is enabled.
EnableDataLevelCount
Application Settings: EnableDataLevelCount
- Default: False
- If enabled, the number of rows in the table is calculated using a SELECT COUNT(1) statement for each table profiled.
Calculation and Allocation
The calculation is simple.
Process
- Create and execute a Data Level Profile Job that Reads the TOP DataLevelRows from each table.
- The Profile Job will Count the number of PII rows matching each individual Profile Expressions in the ResultSet (from the query above - note the TOP).
- After the Profile Job execution:
- If the EnableDataLevelCount is set, Count the number of rows in the table.
- If this fails, all domains and algorithms will be set to null (as expected).
- Percentage Calculation:
- if (countedRows < DataLevelRows)
- then Percentage = PIIrows / countedRows
- else Percentage = PIIrows / DataLevelRows
- if (countedRows < DataLevelRows)
- If Percentage >= DataLevelPercentage (this is the total - Percentage of all PII rows in the sample)
- For each Expression having a PII count above, set the Domain in the Profile Result (this is what you see in the UI).
- If one match only, set Domain and Algorithm to this one.
- If multiple matches and Multiple Profiler Expression Check is:
- Disabled: Set the algorithm to the last one in the list (exact details here unknown).
- Enabled: Set the algorithm to DefaultMultiphiAlgorithm.
Known issues
The issue happens when the profiled table has fewer rows than DataLevelRows and the SELECT Count is counted on the incorrect schema. Amended to this is that the Percentage is counted on the DataLevelRows (which can cause missing PII in small tables).
Solutions
There are some solutions:
- Use a user that has the default schema set to the profiled tables.
- If not possible, do not use EnableDataLevelCount and try to group tables in Profile Jobs based on number of rows in the tables.
- This will minimize the effects and will calculate a close as possible Percentage.
Related Articles
The following articles may provide more information or related information to this article:
- Masking Doc: Configuring Profiling Settings
- Masking Doc: Reporting Profiling Results
- KBA: Column and Data Level Profiling (KBA1008)
- KBA: Application Settings API (KBA4539)