Skip to main content
Delphix

Profile Data in Database Column for Oracle (KBA4448)

 

 

KBA

KBA# 4448

 

At a Glance

Versions: Applicable Delphix Masking versions: 4.x, 5.x, 6.x
Description: This page shows queries to use on Oracle to profile data. This is especially handy to discover non-conforming data. 
Queries applicable to: Oracle
Queries listed:  The following queries are listed below: 
  • How many rows in the table?
  • How many rows with a specific length?
  • How many duplicates (is the data unique)?
  • How many rows with a specific profile?
  • How many rows with special characters in the cell?
  • What rows have Special Characters (how to find them)?
Non-Conforming Classification: This is useful to identify Non-Conforming characters on the Masking Engine.

For details about the Unicode characters please look here: 
https://www.compart.com/en/unicode/category/

Profiling of Data in Masked Column

These queries (and functions) are used to profile data in columns to be masked to understand what algorithm is best suited and how to configure the algorithm. Unicode character profile has been used to match the patterns reported in the masking engine's Monitor Page for nonconforming data. 

The following codes have been used: 

  • N - Numbers (0-9)
  • L - Letters (A-Z)
  • Z - Space

All the other characters are shown.

Queries for Oracle 

How many rows in the table?

This is important to know to understand the statistical distribution of the profile result below. 

-- Replace [table]
--  
SELECT COUNT(*) cnt_rows FROM [table];
Example
 CNT_ROWS 
----------
  4534824 

How many duplicates (is the data unique)?

Use this query when the uniqueness of the values in a column is important. This query should be executed on the column before and after masking. Especially investigate columns that are referential keys (i.e. PK and FK) or have unique constraints. Note that these might be composite and multiple columns needs to be investigated. Also, values might need to be formatted using a cast. 

This query shows if there are duplicated values in a column. The expected result is one row with 'dups' = 1.

Note

Note:

No values from the columns are shown. This query only shows statistics (number of duplicates).

 

-- Replace [table]
-- Replace [column]
--
-- dups should be 1 for no duplicates.
--
SELECT count(*) as cnts, s1.dups
FROM (
   SELECT [column], count(*) dups
   FROM [table]
   GROUP BY [column]) s1
GROUP BY dups;

Concatenate columns (using | as a separator as it is not frequently used):

column1 || '|' || column2
Example

DUPS = 1 means unique entries. There are 4,534,822 unique entries in the example below and there is 1 duplicated value.

     CNTS      DUPS
---------- ---------
  4534822         1
        1         2     

How many rows with a specific length (characters and bytes)?

This query shows the number of rows with a specific length. There are two queries - one for characters and one for bytes. The first 1,000 rows are only counted. For memory-related issues use bytes (LENGTHB). For masking results, use characters (LENGTH). 

Note

Note:

The function might need to be replaced for some specific datatypes.

 

-- Replace [table]
-- Replace [mask_col]
-- 
SELECT LENGT([mask_col]) as STR_LEN, COUNT(*) as cnt 
FROM [table] 
WHERE ROWNUM < 1001
GROUP BY LENGT([mask_col]) 
ORDER BY cnt DESC;
SELECT LENGTHB([mask_col]) as bytes, COUNT(*) as cnt
FROM [table]
WHERE ROWNUM < 1001
GROUP BY LENGTHB([mask_col])
ORDER BY cnt DESC;
Example

Some algorithms are specifically created to fit a specific length. This example shows 'Home Phone Number' and there are clearly some additional data other than phone numbers in here. 

 STR_LEN       CNT 
--------- ---------
 10        727,204 
 11        399,465 
 12        104,710 
 9          81,835 
...
 24              2 
 25              1 
 26              1 

How many rows with a specific profile?

This query shows the number of rows with a specific length (number of characters). The query has been limited to only count the 1,000 first rows. 

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT profileASCII([mask_col]) as profile, Count(*) as cnt 
FROM [table]
WHERE ROWNUM < 1001
GROUP BY profileASCII([mask_col]) 
ORDER BY cnt DESC;
Function

To use the query above, please create this function first. It will profile data based on: 

  • N - Numbers (0-9)
  • L - Letters (A-Z)
  • Z - Space

All the other characters are shown.

-- ORACLE
CREATE OR REPLACE FUNCTION profileASCII (inpStr in VARCHAR2)
RETURN VARCHAR2 IS
   tmpChr VARCHAR2(10);
   tmpStr VARCHAR2(10);
   resStr VARCHAR2(1000);
   increment INT := 1;
BEGIN
   WHILE increment <= LENGTH(inpStr)
   LOOP
     tmpChr := SUBSTR(inpStr, increment, 1);
     IF REGEXP_LIKE(tmpChr,'[0-9]') THEN
        tmpStr := 'N';
     ELSIF REGEXP_LIKE(tmpChr, '[a-z,A-Z]') THEN
        tmpStr := 'L';
     ELSIF REGEXP_LIKE(tmpChr, '[ ]') THEN
        tmpStr := 'Z';
     ELSE
        tmpStr := tmpChr;
     END IF;
     resStr := resStr || tmpStr;
     increment := increment + 1;

   END LOOP;
   RETURN resStr;
END;
/
Example

The example below shows an example from profiling 'Salutation' (for example 'Mr'). The example shows the length and that the data has some records with full stop and some without. 

 profile       cnt 
--------- ---------
 LL        376,849 
 LL.       206,021 
 LLL.       34,005 
 LLLL       31,291 
 LLL           127 
 LLLL.           3 

How many rows with special characters in the cell?

This query has specifically been created to extract out special characters in data that needs to be ignored in masking algorithms. The query shows the number of rows with special characters. The query has been limited to only count the 1,000 first rows. 

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT profileSpecialASCII([mask_col]) as special, Count(*) as cnt 
FROM [table]
WHERE ROWNUM < 1001
GROUP BY profileSpecialASCII([mask_col]) 
ORDER BY cnt DESC;
Function 

To use the query above, create this function first. 

-- ORACLE
CREATE OR REPLACE FUNCTION profileSpecialASCII (inpStr in VARCHAR2)
RETURN VARCHAR2 IS
   tmpChr VARCHAR2(10);
   tmpStr VARCHAR2(10);
   resStr VARCHAR2(1000);
   increment INT := 1;
BEGIN
   WHILE increment <= LENGTH(inpStr)
   LOOP
     tmpChr := SUBSTR(inpStr, increment, 1);
     IF REGEXP_LIKE(tmpChr,'[0-9,a-z,A-z, ]') THEN
        tmpStr := '';
     ELSE
        tmpStr := tmpChr;
     END IF;
     resStr := resStr || tmpStr;
     increment := increment + 1;

   END LOOP;
   RETURN resStr;
END;
/
Example

Finding special characters in the data could be very useful and tells a lot about the data - in some cases, it needs to be taken care of. The sample below is taken from a 'Fullname' column. 

 special       cnt 
--------- ---------
 _         205,585 
 _&         98,783 
 _-         72,624 
 _++        24,233 
 _'         17,766 
 _()        13,579 
 _/         12,158 
...
 _--         3,677 
 _,          1,365 
 _()*        1,245 
 _(/)        1,204 
 _---        1,146 

What rows have Special Characters (how to find them)?

This query can be used to show the first records in the database with Special Characters. The number of rows selected might need to be changed.

-- Replace [table]
-- Replace [mask_col]
-- Modify 1001 if more sample data is needed.

SELECT ROWID, [mask_col] FROM [table]
WHERE ROWNUM < 1001
AND regexp_like([mask_col],'[^a-zA-Z0-9]');

Related Articles