Profiling Data in Database Column for DB2 (KBA6995)

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

KBA

KBA# 6995

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies

Major Release	All Sub Releases
6.0	6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0, 6.0.2.1, 6.0.3.0, 6.0.3.1, 6.0.4.0, 6.0.4.1, 6.0.4.2, 6.0.5.0, 6.0.6.0, 6.0.6.1, 6.0.7.0, 6.0.8.0, 6.0.8.1
5.3	5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0, 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0
5.2	5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1
5.1	5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0
5.0	5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1, 5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

At a Glance

Description:	This article provides queries to use on IBM DB2 to profile data. This is especially handy to discover non-conforming data or other data-related issues.
Queries applicable to:	IBM DB2
Queries listed:	The following queries are listed below: How many rows in the table? How many duplicates (is the data unique)? How many rows with a specific length? How many rows with a specific profile? How many rows with special characters in the cell? What rows have Special Characters (how to find them)?
Non-Conforming Classification:	This is useful to identify Non-Conforming characters on the Masking Engine. For details about the Unicode characters please look here: https://www.compart.com/en/unicode/category/

Profiling of Data in Masked Column

These queries (and functions) are used to profile data in columns to be masked to understand what algorithm is best suited and how to configure the algorithm. Unicode character profile has been used to match the patterns reported in the masking engine Monitor Page for nonconforming data.

The following codes have been used:

N - Numbers (0-9)
L - Letters (A-Z)
Z - Space

All the other characters are shown.

Queries for DB2

How many rows in the table?

This is important to know to understand the statistical distribution of the profile result below.

-- Replace [table]
--  
SELECT COUNT(*) cnt_rows FROM [table];

Example

 CNT_ROWS
----------
  4534824

How many duplicates (is the data unique)?

Use this query when the uniqueness of the values in a column is important. This query should be executed on the column before and after masking. Especially investigate columns that are referential keys (i.e. PK and FK) or have unique constraints. These might be composite and require multiple columns to be investigated. Also, values might need to be formatted using a cast.

This query shows if there are duplicated values in a column. The expected result is one row with 'dups' = 1.

-- Replace [table]
-- Replace [column]
--
-- dups should be 1 for no duplicates.
--
SELECT s1.dups, count(*) as cnts
FROM (
   SELECT [column], count(*) as dups
   FROM [table]
   GROUP BY [column]) s1
GROUP BY dups;

Concatenate columns (using | as a separator as it is not frequently used):

column1 || '|' || column2

Example

DUPS = 1 means unique entries. There are 4,534,822 unique entries in the example below and there is 1 duplicated value.

      DUPS      CNTS
---------- ----------
         1   4534822
         2         1

How many rows with a specific length (characters and bytes)?

This query shows the number of rows with a specific length. There are two queries - one for characters and one for bytes. The first 1,000 rows are counted. For memory-related issues use bytes (LENGTHB). For masking results, use characters (LENGTH).

-- Replace [table]
-- Replace [mask_col]
-- 
SELECT s1.len_char, s1.len_bytes, COUNT(*) as cnt 
FROM 
(SELECT LENGTH([mask_col]) as len_char, LENGTHB([mask_col]) as len_bytes 
 FROM [table] FETCH FIRST 1000 ROWS ONLY) s1
GROUP BY s1.len_char, s1.len_bytes 
ORDER BY cnt DESC;

Example

Some algorithms are specifically created to fit a specific length. This example shows 'Home Phone Number' and there are clearly some additional data other than phone numbers in here.

 len_char   len_bytes       CNT
---------- ----------- ---------
       10          20       604
       11          22        65
       12          24        10
...
       24          48         2
       26          52         1

How many rows with a specific profile?

This query shows the number of rows with a specific length (number of characters). The query has been limited to only count the first 1,000 rows.

This will profile data based on:

N - Numbers (0-9)
L - Letters (A-Z)
Z - Space

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT s1.profile, count(*) cnt from 
(SELECT 
 TRANSLATE(TRANSLATE(TRANSLATE([mask_col],
  'LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL',
   'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'),
  'NNNNNNNNNN',
   '1234567890'),
  'Z',
   ' ') as profile
 FROM [table] FETCH FIRST 1000 ROWS ONLY) s1
GROUP BY profile
ORDER BY cnt DESC;

Example

The example below shows an example from profiling 'Salutation' (for example 'Mr'). The example shows the length and that the data has some records with full stop and some without.

 profile       cnt
--------- ---------
 LL            376
 LL.           206
 LLL.           34
 LLLL           31
 ...

How many rows with special characters in the cell?

This query has specifically been created to extract special characters in data that need to be ignored in masking algorithms. The query shows the number of rows with special characters. The query has been limited to only count the first 1,000 rows.

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT s1.profile, count(*) cnt from
(SELECT TRANSLATE([mask_col], '', 
 ' 1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz') as profile
 FROM [table] FETCH FIRST 1000 ROWS ONLY) s1
GROUP BY profile
ORDER BY cnt DESC;

Example

Finding special characters in the data could be very useful and tells a lot about the data - in some cases, it needs to be taken care of. The sample below is taken from a 'Fullname' column.

 special       cnt
--------- ---------
 _             205
 _&             98
 _-             72
 _++            24
...

What rows have Special Characters (how to find them)?

This query can be used to show the first records in the database with Special Characters. The number of rows selected might need to be changed.

-- Replace [table]
-- Replace [mask_col]
-- Include [key] to include an identifier so the record can be found.
-- Modify 1000 if more sample data is needed.

SELECT * FROM 
(SELECT [key], [mask_col] 
 FROM [table]
 FETCH FIRST 1000 ROWS ONLY)
WHERE regexp_like([mask_col],'[^a-zA-Z0-9]');

Example

Example of the output (the [key] here is 'EMPNO'):

EMPNO  FIRSTNME    
------ ------------
000200 DAVID_123     
1 record(s) selected.

For more Queries and additional Databases:
- Profile Data in Database Column
For Non-Conforming Data
- Displaying Non conformant Data

KBA

Applicable Delphix Versions

At a Glance

Profiling of Data in Masked Column

Queries for DB2

How many rows in the table?

Example

How many duplicates (is the data unique)?

Example

How many rows with a specific length (characters and bytes)?

Example

Example

How many rows with special characters in the cell?

Example

What rows have Special Characters (how to find them)?

Example

Related Articles