Become a regex regular & wrangle imperfect data

Tableau 9.0 added regular expressions to the lineup of Tableau functions. If, like me, your reaction to hearing this news was, โ€œAwesome!โ€โ€”and by โ€œawesomeโ€ I mean โ€œwhat the heck is RegEx?โ€ then this blog post is for you!

Tableau 9.0 added regular expressions to the lineup of Tableau functions. If, like me, your reaction to hearing this news was, โ€œAwesome!โ€โ€”and by โ€œawesomeโ€ I mean โ€œwhat the heck is RegEx?โ€โ€”then this blog post is for you!


Xkcd.com on regular expressions.

Letโ€™s say youโ€™re like I was. Regular expressions, or regex, isnโ€™t part of our vocabulary. So what is regular expressions? As the comic above indicates, itโ€™s a syntax for searching specific patterns in text, such as an address, URL, or a hex value.

Why you should care about regex

What if youโ€™re working with less-than-perfect data? Mark Fraser discovered his dataset recorded UK postcodes in a nonstandard manner. He used regex to parse out the first portion of the postal code as that is the granularity that Tableau can map.

What if youโ€™re looking at web traffic for links like this: website.com/domain/orderpage.html?productcat=shoes&brand=nike&shoenumber=12345&color=blue and you want to be able to do analysis based on pulling out the product category, brand, shoe number, or color? Thatโ€™s exactly what Chris Smith needed, and with regex, the solution became much simpler than the string parsing ELSEIF & FIND solution.

Hereโ€™s a closer look. The regex /[0-9]{3}-[0-9]{3}-[0-9]{4}/ matches a phone number written in the format ###-###-####. If we want to match a phone number written in the format (###) ###-####, or with periods instead of dashes, or with an extension, or international formats, itโ€™s all possible with a more complicated bit of regex.

How to get started with regex

What kind of strings can regex find? Hereโ€™s a list of common regex broken down and explained. I always work best with examples, and these eight regular expressions helped me solidify my understanding in a concrete way.

If you want to learn regex, I recommend this quick start guide. It defines regex as โ€œa pattern describing a certain amount of text,โ€ and goes into the basics at a nice high level. Once you have an overview, jump in a bit deeper. Learn Regex the Hard Way walks through the basic elements of regex step by step, slowly adding a framework and offering suggested exercises.

This brings us to perhaps the most important resource I can suggest, the simulator. RegExr is a phenomenal tool to learn, build, and test regex. With syntax highlighting, mouse-over explanations, real-time visual results, a powerful navigation for examples, a cheat sheet, and a detailed reference library, this is THE tool.

Going beyond the basics

Once you have the basics down, itโ€™s always good to have a resource for the nitty-gritty details. Tableauโ€™s Online Help references ICUโ€™s regex documentation as the official syntax guide for Tableau Data Extracts. Your data source may have a different syntax if you have a live connection. I always like to have multiple resources, so I also checked out the Python regex documentation (though make sure the behavior matches ICUโ€™s, or your specific data source).

Now that we have all these great resources for writing regex, how do we use it in Tableau itself? Tableau supports the following:

  • REGEXP_REPLACE(string, pattern, replacement): Returns a copy of the given string where the regular expression pattern is replaced by the replacement string.
  • REGEXP_MATCH(string, pattern): Returns true if a substring matches the regex pattern.
  • REGEXP_EXTRACT(string, pattern): Returns the portion of the string matching the regular expression pattern.
  • REGEXP_EXTRACT_NTH(string, pattern, index): Returns the portion of the string that matches the regular expression pattern. The substring is matched to the nth capturing group, where n is the given index.

Tableau Zen Master Mark Jackson has a great blog post on regex, and gives this example of using REGEXP_EXTRACT to pull out dosage information from medical data:

I took some Airbnb data from Portland and parsed out the bed/bath information from the listings:

Want to see what else is out there? The Regex Library lets you search existing regex patterns, leveraging the communityโ€™s knowledge so you donโ€™t have to reinvent the wheel.

What regex resources do you use? Share in the comments below.