« An introduction to using regular expressions for data quality validation | The hitchhikers guide to Data Quality »
Using regular expressions to check data quality Part 2
Regular expressions are a powerful way to test if strings match a given pattern or rule set. They can be used to validate the structure of a string field in data, highlighting any obviously incorrect string values.
Note: In a previous post, I introduced regular expressions and went through a simple example using Canadian Postal code validation.
In this post, we’ll learn some more regular expression syntax, and explore some examples.
Using regular expressions to validate codes
Often, within ERP systems, various entities such as organisational units, products, etc. are assigned structured alphanumeric codes. These codes have a defined structure- all valid codes must have this structure. For example, a product code might have the pattern:
aaa-nnnnn
Where the first three characters are a product group code, a mandatory dash, and then a five digit product number.
You will remember from last time, a set of acceptable characters is defined using square brackets. So digits only can be specified by [0123456789] and letters by [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ].
To simplify things, dashes can be used to include all digits or letters in a range, and these to formats can of course be combined. The string [a-zA-Z0-9] specifies that any letter or digit is acceptable.
In addition, you can make a rule as to how many consecutive characters are required that follow the rule by adding curly brackets. So the following pattern would test for a valid product code in this format:
^[a-zA-Z]{3}[-][0-9]{5}$
The {3} and {5} apply a length limit to the immediately preceeding test- so there must be exactly 3 letters, a dash and then five numbers.
To specify a range of lengths, you can put two numbers in the curly brackets, separated by a comma. So to allow product codes like ABC-1, DFG-12 or HGF-34564 we can use {1,5} instead for the second test- this allows the number after the dash to be made up of between one and five digits.
^[a-zA-Z]{3}[-][0-9]{1,5}$
From this example, you can see how its possible to fairly easily make patterns to test for a wide range of structured codes.
Defining more than one rule set at a time
Sometimes, you might have two different formats possible for a given value. As painful as it is, companies often change their coding rules, but due to legacy constraints, keep legacy coded entities in their data sets as well.
In regular expressions, it is easy to combine two completely different patterns in one test by using the pipe character (“|”). In fact, this operator can also be used within a single pattern to check for one OR the other of two different things.
For example, lets pretend that we’ve added a whole new product line to our offering, and the codes for these new products will have the structure XY-NNNNN-AA and XY is either the letter “A” followed by a single non-zero digit (ie A1,A2,A3…) OR if X is any letter other than A then Y should be a letter too. (I’m just making this up, but I can tell you some of the rules I’ve seen in the real world are a lot more complex and seemly arbitrary than this.)
A regular expression that would validate this new product code would be as follows:
^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$
Note that we put the first expressions all in some curved brackets, and separated two different rules by the “|” to make an or. So if [aA][1-0] OR [b-zB-Z]{2} is true at the start of the string, then that part of the test is OK. By starting at “b” we exclude the bad codes of “AB” or “AG” etc. because the rule says if it starts with A, the second character needs to be a number.
But, of course all of our old product codes are still going to be there, and they won’t pass this new test- so to combine the two, we just take each of them, and put a pipe between them:
^[a-zA-Z]{3}[-][0-9]{1,5}$|^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$
It looks like crazy gibberish- but when you build it a bit at a time, it all makes sense. And now, you can easily detect if there are any product codes that don’t conform to either valid structure.
Using this code in Datamartist
If you were using the Datamartist tool, you could find the bad product codes by adding a new column using a calculation block, called, say “ProductCodeValid”, that was defined as the following Datamartist expression:
REGEX([PRODUCT],”^[a-zA-Z]{3}[-][0-9]{1,5}$|^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$”)
This column will now have TRUE for the records where the product code is well formed, and FALSE where there are issues.
Regular expressions are a very useful way to check many types of data quality and they can help you avoid all sorts of crazy tests with LEFT, RIGHT and MID string functions.
Using REGEX will let you create much more powerful data quality code.
If you want to try building some REGEX yourself, you can download the Datamartist free trial, and give it a go with your own data.
« An introduction to using regular expressions for data quality validation | The hitchhikers guide to Data Quality »