Fall in Love with Verbose Regular Expressions
Why use Regular Expressions
Many programming languages feature regular expression processing, in fact some like grep, sed & awk arguably not programming languages have little else.
Regular Expressions AKA regex or re are great for searching, matching, parsing and even changing text (on modern systems this includes UTF-8, etc.). They are even often used as the parser for Domain Specific Languages.
There are a number of dialects of regular expression but they all have the following in common:
- They provide mechanisms for pattern matching
- They include wildcards e.g. . for any character, \d for any digit, \s for any whitespace, \S for any non-whitespace, [ABCabc] for any of those characters, etc.
- They include “Quantifiers” i.e. some way of specifying lengths of matches, e.g. * for any number of matches including 0, + for one or more, ? for 0 or 1, {4,7} for 4–7, etc.
- They include “anchors”, e.g. ^ or \A for start of string, $ for end of string, etc.
- They include Group Constructs which may include capturing, named capturing, non-capturing, named, look back/ahead, etc.
- They may have modifiers that change the behaviour such as case matching, Unicode/ASCII switches, line end handling, etc.
- There are often many ways to achieve a given result.
- They tend to be very fast, memory efficient and excel at stream processing.
- They have their own language usually distinct from the “host” language that they are running within.
- The Syntax is extremely terse but can be very powerful.
A really simple example might be extracting the digits from a string:
But if you have more complex requirements this can get complicated quickly.
As you can see regular expressions are very powerful and there are a number of books & courses on how to use them effectively. There are also a lot of tools and sites available to help developing & testing your regular expressions (one of my favourites is https://regex101.com/) and some even support verbose or extended syntax.
What is a Verbose Regular Expression and why use it?
It is the last two points of these that I would like to address and to suggest making use of the verbose or extended option that Python & PHP programmers have as an option when dealing with regular expressions — as a python developer I will concentrate on the Python re dialect of regular expressions rather than PHPs PCRE/PCRE2 both of which offer similar features.
The regular expression syntax can differ greatly from the “host” programming language and can be very powerful & sophisticated. However, the power and terseness, in addition to the lower familiarity, can lead to problems especially as the use of the powerful features grows. Some programming language parsers, especially domain specific languages (DLSs), are implemented largely with regular expressions, which as you can imagine can get very “sophisticated” (read hard to understand).
Verbose regular expressions also known as extended regular expressions are unusual in that they can inline include comments. That’s right comments! They follow a pattern that whitespaces in the regular expression are ignored and anything after a whitespace hash “ #” is a comment so is ignored. Of course this is an area where the python triple quoted multiline string comes in very handy.
Lets have a concrete example to illustrate the problem. I recently had to process a number of log files where the lines that I was interested in extracting all started with a time stamp, e.g. “2021 Jan 03 Mon 03:45:12.000, “ followed by any of:
- 12345.6 Mohms some other text
- Some text 12.345 nf
- real gain some text 123.45
Where the italic denotes variable content of variable length and the bold denotes the information that I was interested in. BTW: I know that the units should be MOhms and nF but I didn’t write the code that produced the log files I just had to cope with the results.
As a first pass I created 3 python regular expressions, not in verbose mode:
- re.compile(r’(\d{4}\s\S{3}\s\d{2}\s\S{3}\s[\d:\.]{12}),\s(\d+\.\d+)\sMohms.*’)
- re.compile(r’(\d{4}\s\S{3}\s\d{2}\s\S{3}\s[\d:\.]{12}),\s.*\s(\d+\.\d+)nf.*’)
- re.compile(r’(\d{4}\s\S{3}\s\d{2}\s\S{3}\s[\d:\.]{12}),\sreal gain.*\s(\d+\.\d+)\n’)
These can do the job and as you can see they are extremely compact but the problem is if you come back to them after a few weeks, or even after lunch, you are likely to have issues with working out what they do & how — this is one definition of un-maintainable code. Of course I could have added a long comment on the lines before but it was likely to be convoluted and to get out of step. Python’s re.VERBOSE flag to the rescue:
Lets deal with the timestamp first:
As you can hopefully see coming back to this after a break is a lot less intimidating.
Likewise I could define patterns for the non-timestamp part of the 3 above with:
I think that you will agree that this is a lot more clean and maintainable. Of course I could have combined all 3 into a single regular expression but that didn’t suit how I was processing the results going forwards — and I had got the results and performance that I needed.
The one “gotcha” to watch out for is that all whitespaces in your regular expression will be ignored so you must use \s or equivalent to include whitespaces in what you are looking for.
Winding Up
Hopefully the above has encouraged you to use verbose regular expressions, if you are in a programming environment that permits them. Your future self and any other maintainers will thank you if you ever have to touch this code again. The usual rule applies of course that if you change the code but not the comments then they become worse than useless so keep them up.