Re Regular Expression in Python

re

A regular expression (or RegEx) specifies a set of strings that matches it.

A regex is a sequence of characters that defines a search pattern, mainly for the use of string pattern matching.

The re.search() expression scans through a string looking for the first location where the regex pattern produces a match.
It either returns a MatchObject instance or returns None if no position in the string matches the pattern.

Code

>>> import re
>>> print bool(re.search(r"ly","similarly"))
True

The re.match() expression only matches at the beginning of the string.
It either returns a MatchObject instance or returns None if the string does not match the pattern.
Code

>>> import re
>>> print bool(re.match(r"ly","similarly"))
False
>>> print bool(re.match(r"ly","ly should be in the beginning"))
True

Metacharacters

Metacharacters are characters with a special meaning:

Character	Description	Example	Try it
[]	A set of characters	"[a-m]"	Try it »
\	Signals a special sequence (can also be used to escape special characters)	"\d"	Try it »
.	Any character (except newline character)	"he..o"	Try it »
^	Starts with	"^hello"	Try it »
$	Ends with	"planet$"	Try it »
*	Zero or more occurrences	"he.*o"	Try it »
+	One or more occurrences	"he.+o"	Try it »
?	Zero or one occurrences	"he.?o"	Try it »
{}	Exactly the specified number of occurrences	"he.{2}o"	Try it »
\|	Either or	"falls\|stays"	Try it »
()	Capture and group

Special Characters

^ | Matches the expression to its right at the start of a string. It matches every such instance before each \n in the string.

$ | Matches the expression to its left at the end of a string. It matches every such instance before each \n in the string.

. | Matches any character except line terminators like \n.

\ | Escapes special characters or denotes character classes.

A|B | Matches expression A or B. If A is matched first, B is left untried.

+ | Greedily matches the expression to its left 1 or more times.

* | Greedily matches the expression to its left 0 or more times.

? | Greedily matches the expression to its left 0 or 1 times. But if ? is added to qualifiers (+, *, and ? itself) it will perform matches in a non-greedy manner.

{m} | Matches the expression to its left m times, and not less.

{m,n} | Matches the expression to its left m to n times, and not less.

{m,n}? | Matches the expression to its left m times, and ignores n. See ? above.

Character Classes (a.k.a. Special Sequences)

\w | Matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _.

\d | Matches digits, which means 0-9.

\D | Matches any non-digits.

\s | Matches whitespace characters, which include the \t, \n, \r, and space characters.

\S | Matches non-whitespace characters.

\b | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W.

\B | Matches where \b does not, that is, the boundary of \w characters.

\A | Matches the expression to its right at the absolute start of a string whether in single or multi-line mode.

\Z | Matches the expression to its left at the absolute end of a string whether in single or multi-line mode.

Sets

[ ] | Contains a set of characters to match.

[amk] | Matches either a, m, or k. It does not match amk.

[a-z] | Matches any alphabet from a to z.

[a\-z] | Matches a, -, or z. It matches - because \ escapes it.

[a-] | Matches a or -, because - is not being used to indicate a series of characters.

[-a] | As above, matches a or -.

[a-z0-9] | Matches characters from a to z and also from 0 to 9.

[(+*)] | Special characters become literal inside a set, so this matches (, +, *, and ).

[^ab5] | Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 5.

Groups

( ) | Matches the expression inside the parentheses and groups it.

(? ) | Inside parentheses like this, ? acts as an extension notation. Its meaning depends on the character immediately to its right.

(?PAB) | Matches the expression AB, and it can be accessed with the group name.

(?aiLmsux) | Here, a, i, L, m, s, u, and x are flags:

a — Matches ASCII only
i — Ignore case
L — Locale dependent
m — Multi-line
s — Matches all
u — Matches unicode
x — Verbose

(?:A) | Matches the expression as represented by A, but unlike (?PAB), it cannot be retrieved afterwards.

(?#...) | A comment. Contents are for us to read, not for matching.

A(?=B) | Lookahead assertion. This matches the expression A only if it is followed by B.

A(?!B) | Negative lookahead assertion. This matches the expression A only if it is not followed by B.

(?<=B)A | Positive lookbehind assertion. This matches the expression A only if B is immediately to its left. This can only matched fixed length expressions.

(?<!B)A | Negative lookbehind assertion. This matches the expression A only if B is not immediately to its left. This can only matched fixed length expressions.

(?P=name) | Matches the expression matched by an earlier group named “name”.

(...)\1 | The number 1 corresponds to the first group to be matched. If we want to match more instances of the same expresion, simply use its number instead of writing out the whole expression again. We can use from 1 up to 99 such groups and their corresponding numbers.

Popular Python `re` Module Functions

re.findall(A, B) | Matches all instances of an expression A in a string B and returns them in a list.

re.search(A, B) | Matches the first instance of an expression A in a string B, and returns it as a re match object.

re.split(A, B) | Split a string B into a list using the delimiter A.

re.sub(A, B, C) | Replace A with B in the string C.

Introduction to the Python regex capturing groups

Suppose you have the following path that shows the news with the id 100 on a website:

news/100
Code language: Python (python)

The following regular expression matches the above path:

\w+/\d+
Code language: Python (python)

Note that the above regular expression also matches any path that starts with one or more word characters, e.g., posts, todos, etc. not just news.

In this pattern:

\w+ is a word character set with a quantifier (+) that matches one or more word characters.
/ mathes the forward slash / character.
\d+ is digit character set with a quantfifer (+) that matches one or more digits.

The following program uses the \w+/\d+ pattern to match the string ‘news/100':

import re

s = 'news/100'
pattern = '\w+/\d+'

matches = re.finditer(pattern,s)
for match in matches:
    print(match)
Code language: Python (python)

Output:

<re.Match object; span=(0, 8), match='news/100'>
Code language: Python (python)

It shows one match as expected.

To get the id from the path, you use a capturing group. To define a capturing group for a pattern, you place the rule in parentheses:

(rule)
Code language: Python (python)

For example, to create a capturing group that captures the id from the path, you use the following pattern:

'\w+/(\d+)'
Code language: Python (python)

In this pattern, we place the rule \d+ inside the parentheses (). If you run the program with the new pattern, you’ll see that it displays one match:

import re

s = 'news/100'
pattern = '\w+/(\d+)'

matches = re.finditer(pattern, s)
for match in matches:
    print(match)
Code language: Python (python)

Output:

<re.Match object; span=(0, 8), match='news/100'>
Code language: Python (python)

To get the capturing groups from a match, you the group() method of the Match object:

match.group(index)
Code language: Python (python)

The group(0) will return the entire match while the group(1), group(2), etc., return the first, second, … group.

The lastindex property of the Match object returns the last index of all subgroups. The following program shows the entire match (group(0)) and all the subgroups:

import re

s = 'news/100'
pattern = '\w+/(\d+)'

matches = re.finditer(pattern, s)
for match in matches:
    for index in range(0, match.lastindex + 1):
        print(match.group(index))
Code language: Python (python)

Output:

news/100
100
Code language: Python (python)

In the output, the news/100 is the entire match while 100 is the subgroup.

If you want to capture also the resource (news) in the path (news/100), you can create an additional capturing group like this:

'(\w+)/(\d+)'
Code language: Python (python)

In this pattern, we have two capturing groups one for \w+ and the other for \d+ . The following program shows the entire match and all the subgroups:

import re

s = 'news/100'
pattern = '(\w+)/(\d+)'

matches = re.finditer(pattern, s)
for match in matches:
    for index in range(0, match.lastindex + 1):
        print(match.group(index))
Code language: Python (python)

Output:

news/100
news
100
Code language: Python (python)

In the output, the news/100 is the entire match while news and 100 are the subgroups.

Named capturing groups

By default, you can access a subgroup in a match using an index, for example, match.group(1). Sometimes, accessing a subgroup by a meaningful name is more convenient.

You use the named capturing group to assign a name to a group. The following shows the syntax for assigning a name to a capturing group:

(?P<name>rule)
Code language: Python (python)

In this syntax:

() indicates a capturing group.
?P<name> specifies the name of the capturing group.
rule is a rule in the pattern.

For example, the following creates the names:

'(?P<resource>\w+)/(?P<id>\d+)'
Code language: Python (python)

In this syntax, the resource is the name for the first capturing group and the id is the name for the second capturing group.

To get all the named subgroups of a match, you use the groupdict() method of the Match object. For example:

import re

s = 'news/100'
pattern = '(?P<resource>\w+)/(?P<id>\d+)'

matches = re.finditer(pattern, s)
for match in matches:
    print(match.groupdict())
Code language: Python (python)

Output:

{'resource': 'news', 'id': '100'}
Code language: Python (python)

In this example, the groupdict() method returns a dictionary where the keys are group names and values are the subgroups.

More named capturing group example

The following pattern:

\w+/d{4}/d{2}/d{2}
Code language: Python (python)

matches this path:

news/2021/12/31
Code language: Python (python)

And you can add the named capturing groups to the pattern like this:

'(?P<resource>\w+)/(?P<year>\d{4})/(?P<month>\d{1,2})/(?P<day>\d{1,2})'
Code language: Python (python)

This program uses the patterns to match the path and shows all the subgroups:

import re

s = 'news/2021/12/31'
pattern = '(?P<resource>\w+)/(?P<year>\d{4})/(?P<month>\d{1,2})/(?P<day>\d{1,2})'

matches = re.finditer(pattern, s)
for match in matches:
    print(match.groupdict())
Code language: Python (python)

Output:

{'resource': 'news', 'year': '2021', 'month': '12', 'day': '31'}
Code language: Python (python)

Summary

Place a rule of a pattern inside parentheses () to create a capturing group.
Use the group() method of the Match object to get the subgroup by an index.
Use the (?P<name>rule) to create a named capturing group for the rule in a pattern.
Use the groupdict() method of the Match object to get the named subgroups as a dictionary.

Source : HackerRank, W3School, https://www.dataquest.io/blog/regex-cheatsheet/, https://www.pythontutorial.net/python-regex/python-regex-capturing-group/

Tuesday, 25 October 2022