Data quality regular expressions
You will find here a few useful REGEX that I have found or written, they can be used easily to validate values within data by using the REGEX() function in Datamartist.
For example, here is a simple expression that validates if a string contains a valid Canadian postal code or not:
^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$
If in Datamartist you have a column called “PostalCode” in a data set, you can add a column called “ValidCanPC”, for example, using the following expression:
REGEX([PostalCode],”^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$”)
Combining REGEX
To combine more than one regular expression, you can either add the expressions together using the regex “OR” operator which is “|”, or you can use the OR function to combine two REGEX() functions.
For example, to determine if a postal code is either a valid Canadian postal code, OR a valid US postal code, you can use the following:
REGEX([PostalCode],”^\d{5}(-\d{4})?$|^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$”)
Where the “|” you see divides the two regex, the first shorter one being that for the US postal codes.
Find below some more regular expressions you might find useful (Please note, while I have tested many of these, I can’t vouch for complete accuracy- “Caveat Emptor” and please let me know if you find any issues (or have an improved version)- never assume any regex is infallible- depending on what they are trying to detect, false positives and false negatives are of course possible.
In a way, far from being a cure all for data validation, regular expressions underline how difficult it is- but they are a useful tool.
Postal Codes
Canada
^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][0-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[0-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][0-9]
US
^\d{5}(-\d{4})?$
United Kingdom
^([A-PR-UWYZa-pr-uwyz0-9][A-HK-Y0-9][AEHMNPRTVXYaehmnprtvxy0-9]?[ABEHMNPRVWXYabehmnprvwxy0-9]? {1,2}[0-9][ABD-HJLN-UW-Zabd-hjln-uw-z]{2}|GIR 0AA)$
Email Address
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
Phone Number
This one is a real killer- so many different ways to write phone numbers, and so many different formats- just a few examples;
North American phone numbers (allows an “x” and extension on the end):
^[1]?[ ]?[\\(]{0,1}([0-9]){3}[\\)]{0,1}[ ]?([0-9]){3}[ ]?[-]?[ ]?([0-9]){4}[ ]?([xX][ ]?[0-9]{1,5})?$
UK Phone numbers
(\s*\(?0\d{4}\)?\s*\d{6}\s*)|(\s*\(?0\d{3}\)?\s*\d{3}\s*\d{4}\s*)
Web page URI
^(http\://)?[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(/\S*)?$
Dates
This one is obviously a huge one, considering how many different formats their are- here is an example of just one format, yyyy-mm-dd. Obviously, while a universal date detector is probably not something you’d want to try to build, as you have specific needs building patterns provides a handy way to validate strings.
^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$