« | »

Data profiling rules and data format strings

A very useful technique in data profiling is data format analysis.

Rather than looking at the actual individual values for a given column, by profiling the structure of the values you can understand at a higher level the quality of data. This technique is primarily used for string based data.

Can’t find the forest because of the trees?

Lets look at a simple but very real example. You have a data set with a few tens of thousands of rows that includes a text field for telephone number.

What kind of phone numbers do you have?

Nobody wants to look through 40,000 rows value by value to see where the bad records are.

Profiling with character substitution and elimination

With a data profiler you can define rules:

  1. IF the character is a digit (0-9) then replace it with the letter “n”
  2. IF the character is a letter (a-z) then replace it with the letter “a”
  3. These two simple rules turn 40,000 rows of phone numbers into a short list of different phone number formats- and we can see which ones seem valid or seem to have problems.

    By adding one more rule, eliminating a few characters that do not directly affect the phone number, we can reduce the list even further.

    The last rule ignores the space, open and close brackets and dash characters- and that simplifies the different formats in the analysis to a mere five formats;

    1. 10 digits
    2. 11 digits
    3. 9 digits
    4. A 7 character string (drilling down by clicking on this bar reveals the word “unknown”- not so useful.)
    5. Missing.

    Of course, data format profiling does not always give a final answer regarding data quality. In this case, just because a phone number has the right number of digits, does not mean that it is a valid phone number, and even if its a valid phone number it may not be the right phone number for that customer…

    Data format analysis for structured codes

    This same technique is also useful for any fields that are meant to contain structured codes.

    For example, say a valid product code is supposed to start with one letter, where the first letter should be one of A,B,D or G be followed by a dash, and then four digits.

    So A-2324 is a valid code, but M-2334 and J234 are not, for example.

    The following rules would help detect a validly formatted string (and point out the issues:)

    The result on a mocked up set of data:

    Using data format patterns to examine the contents of a string column is a very useful way to start to understand what’s in the column. More than just a Yes-No result, it actually gives you a visual look at what types of issues exist.

    The analysis here was done using the Datamartist tool, an easy to use data profiler and data transformation tool. To try making some data format rules for your own data, give the free trial a try.

    Up next in our data profiling series of blog posts- an even more powerful (although often more complex) technique called “Regular Expressions” or “Regex”. This specification language can define complex rules that analyze strings and determine if they belong to a given set (say, “Valid product codes” as in this example) or not.

Tagged as:

Twitter

« | »

Comments are closed.