Skip to main content
Delphix

Profiling Data in Database Column for MySQL (KBA5238)

 

KBA

KBA# 5238

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies
Major Release All Sub Releases
6.0 6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0

5.3

5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0

5.2

5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1

5.1

5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0

5.0

5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

At a Glance 

Description: This page shows queries to use on MySQL to profile data. This is especially useful to discover non-conforming data. 
Queries applicable to: MySQL
Queries listed:  The following queries are listed below: 
  • How many rows in the table?
  • How many rows with a specific length?
  • How many duplicates (is the data unique)?
  • How many rows with a specific profile?
  • How many rows with special characters in the cell?
Non-Conforming Classification: This is useful to identify Non-Conforming characters on the Masking Engine.

For details about the Unicode characters please look here: 
https://www.compart.com/en/unicode/category/

Profiling of Data in Masked Column 

These queries (and functions) are used to profile data in columns to be masked. Doing so will help you understand what algorithm is best suited and how to configure the algorithm. The Unicode character profile has been used to match the patterns reported in the masking engine Monitor Page for nonconforming data. 

The following codes have been used: 

  • N - Numbers (0-9)
  • L - Letters (A-Z)
  • Z - Space

All other characters are shown.

Queries for MySQL  

How many rows in the table? 

This is important to know to understand the statistical distribution of the profile result below. 

-- Replace [table]
--  
SELECT COUNT(*) cnt_rows FROM [table];
Example 
 cnt_rows
  4534824 

How many rows with a specific length (characters and bytes)? 

This query shows the number of rows with a specific length. There are two queries - the first for characters and the second for bytes. The query has been limited to only count the first 1,000 rows. 

For memory-related issues use bytes. For masking results, use characters. 

Note

Note:

The LENGTH function might need to be replaced for specific datatypes. 

 

-- Replace [table]
-- Replace [mask_col]
-- Only displaying top 1000 rows. Change if more is needed. 
--
SELECT foo.str_len, COUNT(*) cnt 
FROM (SELECT CHAR_LENGTH([mask_col]) str_len FROM [table] LIMIT 1000) AS foo
GROUP BY 1
ORDER BY 2 DESC;
SELECT foo.bytes, COUNT(*) cnt
FROM (SELECT LENGTH([mask_col]) bytes FROM [table] LIMIT 1000) AS foo
GROUP BY 1
ORDER BY 2 DESC;
Example 

Some algorithms are specifically created to fit a specific length. This example shows 'Home Phone Number' and there is clearly some additional data other than phone numbers here. 

 str_len       cnt 
 10            204 
 11             65 
 12             10 
...
 24              2 
 25              1 
 26              1 

How many duplicates (is the data unique)?

Use this query when the uniqueness of the values in a column is important. This query should be executed on the column before and after masking. Especially investigate columns that are referential keys (i.e., PK and FK) or have unique constraints. Note that these might be composite and multiple columns may need to be investigated. Also, values might need to be formatted using a cast. 

This query shows if there are duplicated values in a column. The expected result is one row with 'dups' = 1.

Note

Note:

No values from the columns are shown. This query only shows statistics (number of duplicates). 

 

-- Replace [table]
-- Replace [column]
--
-- dups should be 1 for no duplicates.
--
SELECT count(*) as cnts, s1.dups
FROM (
   SELECT [column], count(*) dups
   FROM [table]
   GROUP BY [column]) s1
GROUP BY s1.dups;

Concatenate columns (using | as a separator as it is not frequently used):

CONCAT(column1, "|", column2)
Example  

'dups' = 1 means unique entries. There are 4,534,822 unique entries in the example below and there is 1 duplicated value.

      cnts    dups
   4534822       1
         1       2

How many rows with a specific profile? 

This query shows the number of rows with a specific length (number of characters). The query has been limited to only count the 1,000 first rows. 

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT profileASCII([mask_col]) as profile, Count(*) as cnt
FROM [table]
GROUP BY profileASCII([mask_col])
ORDER BY cnt DESC
LIMIT 50;
Function 

To use the query above, please create this function first. It will profile data based on: 

  • N - Numbers (0-9)
  • L - Letters (A-Z)
  • Z - Space
  • All the other characters are shown.
-- MySQL
CREATE FUNCTION profileASCII(inStr VARCHAR(1000))
RETURNS VARCHAR(1000)
DETERMINISTIC
  BEGIN
    DECLARE tmpChr varchar(10);
    DECLARE tmpStr VARCHAR(1000);
    DECLARE incr INT;
    SET incr = 1;
    SET tmpStr = '';

    WHILE incr <= LENGTH( inStr ) DO
       SET tmpChr = LOWER(SUBSTRING( inStr, incr, 1));
       IF FIND_IN_SET( tmpChr, '0,1,2,3,4,5,6,7,8,9') > 0 THEN 
           tmpChr = 'N'
       ELSEIF FIND_IN_SET( tmpChr, 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z') > THEN
           tmpChr = 'L'
       ELSEIF tmpChr = ' ' THEN
           tmpChr = 'Z'
       tmpStr = CONCAT( tmpStr, tmpChr)
       SET incr = incr + 1;
    END WHILE;
    RETURN inStr;
  END;
Example

The example below shows an example from profiling 'Salutation' (for example 'Mr'). The example shows the length and that the data has some records with full stop and some without. 

 profile       cnt 
--------- ---------
 LL        376,849 
 LL.       206,021 
 LLL.       34,005 
 LLLL       31,291 
 LLL           127 
 LLLL.           3 

What rows have Special Characters (how to find them)? 

This query can be used to show the first records in the database with Special Characters. The number of rows selected might need to be changed.

-- Replace [table]
-- Replace [mask_col]
-- Limited to 50 rows.

SELECT [mask_col] FROM [table]
WHERE [mask_col] REGEXP '[^a-zA-Z0-9]'
LIMIT 50;

Related Articles