Become a regex regular & wrangle imperfect data
Tableau 9.0 added regular expressions to the lineup of Tableau functions. If, like me, your reaction to hearing this news was, “Awesome!”—and by “awesome” I mean “what the heck is RegEx?”—then this blog post is for you!
Xkcd.com on regular expressions.
Let’s say you’re like I was. Regular expressions, or regex, isn’t part of our vocabulary. So what is regular expressions? As the comic above indicates, it’s a syntax for searching specific patterns in text, such as an address, URL, or a hex value.
Why you should care about regex
What if you’re working with less-than-perfect data? Mark Fraser discovered his dataset recorded UK postcodes in a nonstandard manner. He used regex to parse out the first portion of the postal code as that is the granularity that Tableau can map.
What if you’re looking at web traffic for links like this: website.com/domain/orderpage.html?productcat=shoes&brand=nike&shoenumber=12345&color=blue
and you want to be able to do analysis based on pulling out the product category, brand, shoe number, or color? That’s exactly what Chris Smith needed, and with regex, the solution became much simpler than the string parsing ELSEIF & FIND solution.
Here’s a closer look. The regex /[0-9]{3}-[0-9]{3}-[0-9]{4}/
matches a phone number written in the format ###-###-####. If we want to match a phone number written in the format (###) ###-####, or with periods instead of dashes, or with an extension, or international formats, it’s all possible with a more complicated bit of regex.
How to get started with regex
What kind of strings can regex find? Here’s a list of common regex broken down and explained. I always work best with examples, and these eight regular expressions helped me solidify my understanding in a concrete way.
If you want to learn regex, I recommend this quick start guide. It defines regex as “a pattern describing a certain amount of text,” and goes into the basics at a nice high level. Once you have an overview, jump in a bit deeper. Learn Regex the Hard Way walks through the basic elements of regex step by step, slowly adding a framework and offering suggested exercises.
This brings us to perhaps the most important resource I can suggest, the simulator. RegExr is a phenomenal tool to learn, build, and test regex. With syntax highlighting, mouse-over explanations, real-time visual results, a powerful navigation for examples, a cheat sheet, and a detailed reference library, this is THE tool.
Going beyond the basics
Once you have the basics down, it’s always good to have a resource for the nitty-gritty details. Tableau’s Online Help references ICU’s regex documentation as the official syntax guide for Tableau Data Extracts. Your data source may have a different syntax if you have a live connection. I always like to have multiple resources, so I also checked out the Python regex documentation (though make sure the behavior matches ICU’s, or your specific data source).
Now that we have all these great resources for writing regex, how do we use it in Tableau itself? Tableau supports the following:
- REGEXP_REPLACE(string, pattern, replacement): Returns a copy of the given string where the regular expression pattern is replaced by the replacement string.
- REGEXP_MATCH(string, pattern): Returns true if a substring matches the regex pattern.
- REGEXP_EXTRACT(string, pattern): Returns the portion of the string matching the regular expression pattern.
- REGEXP_EXTRACT_NTH(string, pattern, index): Returns the portion of the string that matches the regular expression pattern. The substring is matched to the nth capturing group, where n is the given index.
Tableau Zen Master Mark Jackson has a great blog post on regex, and gives this example of using REGEXP_EXTRACT to pull out dosage information from medical data:
I took some Airbnb data from Portland and parsed out the bed/bath information from the listings:
Want to see what else is out there? The Regex Library lets you search existing regex patterns, leveraging the community’s knowledge so you don’t have to reinvent the wheel.
What regex resources do you use? Share in the comments below.
相關文章
Subscribe to our blog
在收件匣中收到最新的 Tableau 消息。