Useful Regular Expressions for data quality

Data quality regular expressions

You will find here a few useful REGEX that I have found or written, they can be used easily to validate values within data by using the REGEX() function in Datamartist.

For example, here is a simple expression that validates if a string contains a valid Canadian postal code or not:

^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$

If in Datamartist you have a column called “PostalCode” in a data set, you can add a column called “ValidCanPC”, for example, using the following expression:

REGEX([PostalCode],”^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$”)

Combining REGEX

To combine more than one regular expression, you can either add the expressions together using the regex “OR” operator which is “|”, or you can use the OR function to combine two REGEX() functions.

For example, to determine if a postal code is either a valid Canadian postal code, OR a valid US postal code, you can use the following:

REGEX([PostalCode],”^\d{5}(-\d{4})?$|^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$”)

Where the “|” you see divides the two regex, the first shorter one being that for the US postal codes.

Find below some more regular expressions you might find useful (Please note, while I have tested many of these, I can’t vouch for complete accuracy- “Caveat Emptor” and please let me know if you find any issues (or have an improved version)- never assume any regex is infallible- depending on what they are trying to detect, false positives and false negatives are of course possible.

In a way, far from being a cure all for data validation, regular expressions underline how difficult it is- but they are a useful tool.

Postal Codes

Canada

^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][0-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[0-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][0-9]

US

^\d{5}(-\d{4})?$

United Kingdom

^([A-PR-UWYZa-pr-uwyz0-9][A-HK-Y0-9][AEHMNPRTVXYaehmnprtvxy0-9]?[ABEHMNPRVWXYabehmnprvwxy0-9]? {1,2}[0-9][ABD-HJLN-UW-Zabd-hjln-uw-z]{2}|GIR 0AA)$

Email Address

^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

Phone Number

This one is a real killer- so many different ways to write phone numbers, and so many different formats- just a few examples;

North American phone numbers (allows an “x” and extension on the end):

^[1]?[ ]?[\\(]{0,1}([0-9]){3}[\\)]{0,1}[ ]?([0-9]){3}[ ]?[-]?[ ]?([0-9]){4}[ ]?([xX][ ]?[0-9]{1,5})?$

UK Phone numbers

(\s*\(?0\d{4}\)?\s*\d{6}\s*)|(\s*\(?0\d{3}\)?\s*\d{3}\s*\d{4}\s*)

Web page URI

^(http\://)?[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(/\S*)?$

Dates

This one is obviously a huge one, considering how many different formats their are- here is an example of just one format, yyyy-mm-dd. Obviously, while a universal date detector is probably not something you’d want to try to build, as you have specific needs building patterns provides a handy way to validate strings.

^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$