Regular expressions
Last updated
Last updated
Regular expressions are special universal language for defining text parsing rules. It is widely used for matching parts of the text, finding some information etc.
It is very complex subject to cover so we are going to just look at basic concepts.
The best online tool for testing/writing/experimenting with regular expressions: Regex101
More information: https://docs.python.org/3/library/re.html
Regular expression consists of two parts (examples: a+
, [a-z]*
, .*
):
character group (examples: a
, [a-z]
, \w
, [\w\d\S]
, [^a-z]
, .
)
symbol or group of symbol that should be mathched
quantifier (examples: *
, +
, ?
, {5}
, {1,3}
)
specification of how many of previous characters should be matched
Email regexp pattern example
[\w\.\+\-]+@[a-z0-9\-]+(\.[a-z0-9\-]+)*
Examples:
. - one any character
a - just character "a"
[abc] - one from three characters
[a-z] - one from all lowercased charactes
[^abc] - anything except "a", "b" or "c"
\w - any alphabetical character ([a-zA-Z0-9_]
), (\W - everything except alphabeticals)
\d - any digit ([0-9]
) (\D - anything except digits)
\s - any whitespace (\S - anything except whitespaces)
Number of occurences
Examples:
* - any number (0 or any number) of times
+ - any positive number (1 or more) of times
? - 1 or 0 times
{n1} - exactly n1 times
{n1,n2} -from n1 to n2 times
Anchors
Character classes
Quantifiers
Groups and Ranges
To mark some part of the text
To describe target characters to match/search
To specify a number of target character class occurence
To set custom characters range or list
.at
matches any three-character string ending with "at", including "hat", "cat", and "bat".
[hc]at
matches "hat" and "cat".
[^b]at
matches all strings matched by .at except "bat".
[^hc]at
matches all strings matched by .at other than "hat" and "cat".
^[hc]at
matches "hat" and "cat", but only at the beginning of the string or line.
[hc]at$
matches "hat" and "cat", but only at the end of the string or line.
\[.\]
matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
s.*
matches s followed by zero or more characters, for example: "s" and "saw" and "seed".
^\d+$
- positive integer
^-\d+$
- negative integer
^\+?[\d\s]+(?:\(\d{3}\))?[\d\s-]{4,}$
- phone
^(?:19|20)\d{2}$
- year 1900-2099
^[\w\d_\.]{4,}$
- username
^[\w.']{2,}(\s[\w.']{2,})+$
- personal name
^.{6,}$
- password 6 chars min
^.{6,}$|^$
- password or empty string
^([a-z][a-z0-9-]+\.)+([a-z]{2,6})$
- domain name
^[_]*([a-z0-9]+(\.|_*)?)+@([a-z][a-z0-9-]+\.)+[a-z]{2,6}$
- email
Module re
has all regexp-related methods:
🪄 Code:
📟 Output:
Pattern is one or more regular expressions describing structure if the text that is needed to be parsed.
Patterns in Python are defined as raw string like:
r"\d+[abc]{2:3}"
. By doing this we can use \ for regular expressions without escaping.
An example that illustrates raw strings:
🪄 Code:
📟 Output:
Module re
can compile regex pattern making it's repeated usage faster.
Also it is worth to understand the difference between match
and search
methods:
match
will try to match string with given pattern from the beginning of the string.
search
will try to find the part of the text to match the given pattern.
That's why in most cases when we are looking for some text the usage search
is preferred.
Main re
methods:
compile(pattern)
📟 Output:
match(pattern, text)
📟 Output:
search(pattern, text)
📟 Output:
findall(pattern, text)
📟 Output:
finditer(pattern, text)
📟 Output:
Needed to refer to parts of matched text to obtain needed information from it
(...)
- creates numbered group (starting from 1
)
(?P<gr_name>...)
- creates named group ("gr_name" in example)
(?:...)
- creates unnamed group (usually used for grouping in regexp)
If re.match
or re.search
do not match the pattern - they will return None
object.
If they successfully match the pattern - they will return special re.Match
object that has information about found groups, positions of matched text etc.
Match object has the following methods:
groups()
📟 Output:
group()
📟 Output:
start()
, end()
📟 Output:
span()
📟 Output:
An example of using groups and match object:
🪄 Code:
📟 Output:
An example of re-using the previously found group in the regexp. Here we try to find the username and password for the main account (which is defined by main_user
config option):
NOTE: we use
re.S
(singleline) flag to make.
to match any characters including too.
🪄 Code:
📟 Output:
### Multiple matching
If we want to find all occurence of the text matching the given pattern - we should use re.findall
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
🪄 Code:
Anchor | Description |
---|---|
Character class | Description |
---|---|
Quantifier | Description |
---|---|
Group/range | Description |
---|---|
Group/range | Description |
---|---|
^
Start of string, or start of line in multi-line pattern
$
End of string, or end of line in multi-line pattern
\b
Word boundary
.
Any character except new line ()
\s
White space
\S
Not white space
\d
Digit ([0-9]
)
\D
Not digit
\w
Word ([a-zA-Z0-9_]
)
\W
Not word
*
0
or more ({0,}
)
+
1
or more ({1,}
)
?
0
or 1
({0, 1}
)
{5}
Exactly 5
{3,}
3
or more
{3,5}
3
, 4
or 5
(a|b)
a
or b
[abc]
a
or b
or c
[^abc]
Not (a
or b
or c
)
[a-z]
Lower case letter from a
to z
(all lowercases)
[A-Z]
Upper case letter from A
to Z
(all uppercases)
[0-9]
Digit from 0
to 9
(…)
Numbered capturing Group (starting from 1
)
(?:…)
Non-capturing Group (usually used for grouping in regexp)
(?P<group_name>:…)
Named capturing Group with name: "group_name"
Group number n