Profile Data in Database Column for SQL Server (KBA5051)

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

KBA

KBA# 5051

Applicable Delphix Versions

Click here to view the versions of the Delphix engine to which this article applies

Major Release	All Sub Releases
6.0	6.0.0.0, 6.0.1.0, 6.0.1.1, 6.0.2.0
5.3	5.3.0.0, 5.3.0.1, 5.3.0.2, 5.3.0.3, 5.3.1.0, 5.3.1.1, 5.3.1.2, 5.3.2.0, 5.3.3.0, 5.3.3.1, 5.3.4.0, 5.3.5.0 5.3.6.0, 5.3.7.0, 5.3.7.1, 5.3.8.0, 5.3.8.1, 5.3.9.0
5.2	5.2.2.0, 5.2.2.1, 5.2.3.0, 5.2.4.0, 5.2.5.0, 5.2.5.1, 5.2.6.0, 5.2.6.1
5.1	5.1.0.0, 5.1.1.0, 5.1.2.0, 5.1.3.0, 5.1.4.0, 5.1.5.0, 5.1.5.1, 5.1.6.0, 5.1.7.0, 5.1.8.0, 5.1.8.1, 5.1.9.0, 5.1.10.0
5.0	5.0.1.0, 5.0.1.1, 5.0.2.0, 5.0.2.1, 5.0.2.2, 5.0.2.3, 5.0.3.0, 5.0.3.1, 5.0.4.0, 5.0.4.1 ,5.0.5.0, 5.0.5.1, 5.0.5.2, 5.0.5.3, 5.0.5.4

At a Glance

Description:	This page shows queries to use on SQL Server to profile data. This is especially handy to discover non-conforming data.
Queries applicable to:	MS SQL Server Tested on 2005, 2008, 2008R2, 2012, 2014, 2016, 2019.
Queries listed:	The following queries are listed below: How many rows in the table? How many rows with a specific length? How many duplicates (is the data unique)? How many rows with a specific profile? How many rows with special characters in the cell? What rows have Special Characters (how to find them)?
Non-Conforming Classification:	This is useful to identify Non-Conforming characters on the Masking Engine. For details about the Unicode characters please look here: https://www.compart.com/en/unicode/category/

Profiling of Data in Masked Column

These queries (and functions) are used to profile data in columns to be masked to understand what algorithm is best suited and how to configure the algorithm. Unicode character profile has been used to match the patterns reported in the masking engine's Monitor Page for nonconforming data.

The following codes have been used:

N - Numbers (0-9)
L - Letters (A-Z)
Z - Space
All the other characters are shown.

Queries for SQL Server

How many rows in the table?

This is important to know to understand the statistical distribution of the profile result below.

-- Replace [table]
--  
SELECT COUNT(*) cnt_rows FROM [table];

Example

+----------+
| cnt_rows |
+----------+
|  4534824 |
+----------+

Data length (chars and bytes)

This query shows the number of rows with a specific length (number of characters and bytes). The query has been limited to only count the 1,000 first rows.

-- Replace [table]
-- Replace [mask_col]
-- 
SELECT TOP 1000 LEN([mask_col]) as str_len, DATALENGTH([mask_col]) as bytes, Count(*) as cnt 
FROM [table] 
GROUP BY LEN([mask_col]), DATALENGTH([mask_col]) 
ORDER BY cnt DESC;

Example

Some algorithms are specifically created to fit a specific length. This example shows 'Home Phone Number' and there are clearly some additional data other than phone numbers in here.

+---------+---------+---------+
| str_len | bytes   |     cnt |
+---------+---------+---------+
| 10      | 10      | 727,204 |
| 11      | 11      | 399,465 |
| 12      | 12      | 104,710 |
| 9       | 9       |  81,835 |
| 8       | 8       |  34,741 |
...
| 24      | 24      |       2 |
| 25      | 25      |       1 |
| 26      | 26      |       1 |
+---------+---------+---------+

Duplicates (is the data unique)

Use this query when the uniqueness of the values in a column is important. This query should be executed on the column before and after masking. Especially investigate columns that are referential keys (i.e. PK and FK) or have unique constraints. Note that these might be composite and multiple columns needs to be investigated. Also, values might need to be formatted using a cast.

This query shows if there are duplicated values in a column. The expected result is one row with 'dups' = 1.

-- Replace [table]
-- Replace [column]
--
-- dups should be 1 for no duplicates.
--
SELECT count(*) as cnts, s1.dups
FROM (
   SELECT [column], count(*) dups
   FROM [table]
   GROUP BY [column]) s1
GROUP BY s1.dups;

Concatenate columns:

column1 + "|" + column2

Example

'dups' = 1 means unique entries. There are 4,534,822 unique entries in the example below and there is 1 duplicated value.

+---------+---------+
|    cnts |    dups |
+---------+---------+
| 4534822 |       1 |
|       1 |       2 |
+---------+---------+

Data profile

This query shows the number of rows with a specific length (number of characters). The query has been limited to only count the 1,000 first rows.

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT TOP 1000 [dbo].[profileASCII]([mask_col]) as profile, Count(*) as cnt 
FROM [table] 
GROUP BY [dbo].[profileASCII]([mask_col]) 
ORDER BY cnt DESC;

Function

To use the query above, please create this function first. It will profile data based on:

N - Numbers (0-9)
L - Letters (A-Z)
Z - Space
All the other characters are shown.

-- MS SQL
CREATE FUNCTION [dbo].[profileASCII](@inputString VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
  BEGIN
    DECLARE @tmpChr VARCHAR(10);
    DECLARE @tmpStr VARCHAR(10);
    DECLARE @increment INT;
    SET @increment = 1;
    WHILE @increment <= DATALENGTH(@inputString)
    BEGIN
      SET @tmpChr = SUBSTRING(@inputString, @increment, 1);
      IF (@tmpChr LIKE '[0-9]') SET @tmpStr = 'N'; ELSE
      IF (@tmpChr LIKE '[A-Z]') SET @tmpStr = 'L'; ELSE
      IF (@tmpChr LIKE '[ ]')   SET @tmpStr = 'Z'; ELSE
      SET @tmpStr = @tmpChr;
      SET @inputString = LEFT(@inputString, @increment -1) + @tmpStr + RIGHT(@inputString, LEN(@inputString) - @increment );
      SET @increment = @increment + 1;
    END;
  RETURN @inputString;
  END;
GO

Example

The example below shows an example from profiling 'Salutation' (for example 'Mr'). The example shows the length and that the data has some records with full stop and some without.

+---------+---------+
| profile |     cnt |
+---------+---------+
| LL      | 376,849 |
| LL.     | 206,021 |
| LLL.    |  34,005 |
| LLLL    |  31,291 |
| LLL     |     127 |
| LLLL.   |       3 |
+---------+---------+

Special characters

This query has specifically been created to extract out special characters in data that needs to be ignored in masking algorithms. The query shows the number of rows with special characters. The query has been limited to only count the 1,000 first rows.

-- Create Function Below
-- Replace [table]
-- Replace [mask_col]
-- 
SELECT TOP 1000 [dbo].[profileSpecialASCII]([mask_col]) as special, Count(*) as cnt 
FROM [table] 
GROUP BY [dbo].[profileSpecialASCII]([mask_col]) 
ORDER BY cnt DESC;

Function

To use the query above, please create this function first.

All the other characters are only shown.

-- MS SQL
CREATE FUNCTION [dbo].[profileSpecialASCII](@inputString VARCHAR(1000))
RETURNS VARCHAR(1000)
  AS
    BEGIN
      DECLARE @tmpChr VARCHAR(10);
      DECLARE @tmpStr VARCHAR(10);
      DECLARE @increment INT;
      SET @increment = 1;
      WHILE @increment <= DATALENGTH(@inputString)
        BEGIN
          SET @tmpChr = SUBSTRING(@inputString, @increment, 1);
          IF (@tmpChr LIKE '[0-9]') SET @tmpStr = 'x'; ELSE
          IF (@tmpChr LIKE '[A-Z]') SET @tmpStr = 'x'; ELSE
          SET @tmpStr = @tmpChr;
          SET @inputString = LEFT(@inputString, @increment -1) + @tmpStr + RIGHT(@inputString, LEN(@inputString) - @increment );
          SET @increment = @increment + 1;
        END;
      RETURN REPLACE(@inputString, 'x', '');
   END;
GO

Example

Finding special characters in the data could be very useful and tells a lot about the data - in some cases, it needs to be taken care of. The sample below is taken from a 'Fullname' column.

+---------+---------+
| special |     cnt |
+---------+---------+
| _       | 205,585 |
| _&      |  98,783 |
| _-      |  72,624 |
| _++     |  24,233 |
| _'      |  17,766 |
| _()     |  13,579 |
| _/      |  12,158 |
| _&&     |   8,431 |
| _.      |   8,053 |
| _**     |   7,714 |
| _+      |   7,425 |
| _*      |   6,294 |
| _//     |   4,992 |
| _#      |   4,213 |
| _****   |   4,162 |
| _--     |   3,677 |
| _,      |   1,365 |
| _()*    |   1,245 |
| _(/)    |   1,204 |
| _---    |   1,146 |
+---------+---------+

Special Characters (how to find them)

This query can be used to show the first records in the database with Special Characters. The number of rows selected might need to be changed.

-- Replace [table]
-- Replace [mask_col]
-- Modify 100 if more sample data is needed.
-- Add URI (Unique Row Identifier) if needed.

SELECT TOP (100) [mask_col]
FROM [table] where [mask_col] LIKE '%[^0-9a-zA-Z]%';

Example

The example below shows the identification of Special Characters that are Japanese Katakana and Kanji. [ID] has been added as URI to show which rows have the Special Characters.

+----+-----------+
| ID |  mask_col |
+----+-----------+
|  2 | ABC番1234 |
|  4 | ABナ12345 |
+----+-----------+

Troubleshooting and Accessing Masking Logs (KBA7988)
For Non-Conforming Data
- Analyze Non-Conforming Data in Masking (KBA5039)
- Displaying Non conformant Data

KBA

Applicable Delphix Versions

At a Glance

Profiling of Data in Masked Column

Queries for SQL Server

How many rows in the table?

Example

Data length (chars and bytes)

Example

Duplicates (is the data unique)

Example

Data profile

Function

Example

Special characters

Function

Example

Special Characters (how to find them)

Example

Related Articles