Regular Expression— Python — Part I

Diane Khambu
Python in Plain English
5 min readJan 1, 2023

--

Photo by Priscilla Du Preez on Unsplash

It’s going to be the end of the year 2022 and I haven’t yet written any article for the month of December. So let’s give a look into Python’s re package’s special characters and interfaces to the Regular Expression RE engine:

. (Dot) In default mode, matches any characters except a newline character \n .

^ (Caret) Matches the start of a string.

$ Matches the end of the string or just before the newline at the end of the string.

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions are as possible.

? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE .

+ Causes the resulting RE to match 1 or more repetitions of the preceding RE .

\ Either escapes special characters like * , ? or signals a special sequence. Special sequence consists of \ and a ASCII character. For example \s matches whitespace character.

{m} Specifies that exactly m copies of the previous RE should be matched.

{m, n} Causes the resulting RE to match from m to n repetitions of the preceding RE .

[] Used to indicate a set of characters. [abc] will match 'a’or 'b’ or 'c' . ^ inside of [] will negate the characters inside it.

The following is some list of special sequences for Unicode versions. Unicode versions match any character that’s in the appropriate category in the Unicode database.

\d Matches any decimal digit; equivalent to the class [0-9] .

\D Matches any non-digit character; equivalent to the class [^0-9] .

\s Matches any whitespace character

\S Matches any non-whitespace character

\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_] .

\W Matches any non-alphanumeric character; equivalent to the class [^a-zA-Z0-9_] .

\b Matches the empty string, but only at the beginning or end of a word aka word boundary.

Now let’s use them in Python my importing re module which provides an interface to the regular expression engine. The engine allows us to compile RE s into objects and then perform matches with them.

The following examples is like that in a Python shell.

>>> import re
>>> p = re.compile('happy*')
>>> p
re.compile('happy*')

re.compile() also accepts optional flags arguments which are used to enable various special features and variations.

>>> p = re.compile('happy*', re.IGNORECASE)

We can insert RE either in regular string or raw string. Raw string with r like r'happy*' while regular string needs extra escaping for special characters.

For example, if we want text string to match \new , the backslash here must be escaped for re.compile() and also for Python string literal.

Different backslash escaping for different stages

From the table, we see the backslash avalanche of four \ to match \new . The solution for it is to use Python’s raw string notation for regular expression. Backslashes are not handled in any special way in a string literal prefixed with r . So r'\n' is a two-character string containing '\' and 'n' . While '\n’ is a one-character string containing a newline.

Now let’s perform matches with the compiled RE objects. Before that here are some methods of the compiled objects.

  • match() : Determine if the RE matches at the beginning of the string.
  • search() : Scan through a string, looking for any location where this RE matches.
  • findall() : Find all substrings where the RE matches, and returns them as a list.
  • finditer() : Find all substring where the RE matches, and returns them as an iterator.

Let’s see examples:

>>> import re
>>> p = re.compile('happy*')
>>> p
re.compile('happy*')
>>> m = p.match('happy new year 2023!')
>>> m
<re.Match object; span=(0, 5), match='happy'>

The match object instances has following methods:

  • group() : Return the string matched by the RE .
  • start() : Return the starting position of the match.
  • end() : Return the ending position of the match.
  • span() : Return a tuple containing the (start, end) position of the match.

Let’s use these methods in out match object instance m .

>>> m.group()
'happy'
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)

The match() and search() method of compiled object returns None if there is no match.

>>> p = re.compile('happiest*')
>>> p
re.compile('happiest*')
>>> m = p.match('happy new year 2023!')
>>> m
>>> print(p.match('happy new year 2023!'))
None

While programming, common style is to store the match object in a variable and then check if it was none.

>>> p = re.compile('happy*')
>>> p
re.compile('happy*')
>>> m = p.match('happy new year 2023!')
>>> if m:
... print('Happiest new year 2023!')
... else:
... print('Not a new year')
...
Happiest new year 2023!

Let’s use the search() method of a compiled object to find a match.

>>> p = re.compile(r'(\d+)!')
>>> p
re.compile('(\\d+)!')
>>> m = p.search('happy new year 2023!')
>>> m
<re.Match object; span=(15, 20), match='2023!'>
>>> m.group()
'2023!'

Now let’s look at findall() method of a compiled object. Here we are looking for all words that ends with est .

>>> p = re.compile(r'\b\w+est\b')
>>> p
re.compile('\\b\\w+est\\b')
>>> m = p.findall('One of the coolest, brightest, humblest, funniest, kindest,
friendliest person one can meet.')
>>> m
['coolest', 'brightest', 'humblest', 'funniest', 'kindest', 'friendliest']

Now let’s look at the finditer() method.

>>> p = re.compile(r'\b\w+est\b')
>>> p
re.compile('\\b\\w+est\\b')
>>> for quality in p.finditer('One of the coolest, brightest, humblest,
funniest, kindest, friendliest person one
can meet.'):
... print(quality.group())
...
coolest
brightest
humblest
funniest
kindest
friendliest

We can combine both .compile and match methods into one using module-level functions.

>>> print(re.match(r'happy*\s', 'happy new year 2023!'))
<re.Match object; span=(0, 6), match='happy '>
>>> print(re.match(r'happiest*\s', 'happy new year 2023!'))
None
>>> print(re.findall(r'\w+!', 'goodbye 2022!, happy new year 2023!'))
['2022!', '2023!']

Under the hood, compile object is created and appropriate match method is called. If you are accessing regex inside a loop, pre-compiling will save few function calls. Outside of loops, there’s not much difference due to the internal cache.

In conclusion, to use re module in Python, create a compile object with needed regex and use match methods of your need to find matches.

Congratulations! 🦄

Happy New Year 2023! 🐥 See you soon with more articles and keeping learning on.

Inspirations:

You can support me in patreon!

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Looking to scale your software startup? Check out Circuit.

--

--