Regular Expression (RegEx) in Python
Regular Expression (RegEx), also known as Regular Expression, is a segment of special characters that follow certain patterns.
The next chapter of the Advanced Python, TipsMake.com lesson will work with you to learn about Regular Expression (RegEx) with the module re
with specific examples to make it easier for you to understand and grasp the knowledge of RegEx. Let's follow it!
Regular Expression (RegEx) , also known as a regular expression, is a segment of special characters that follow certain patterns, representing strings or a set of strings. For example:
^a.s$
The above code specifies the RegEx rule: any string with five letters, starting with a
and ending with s
.
^a.s$
abs
Not suitable for only 3 characters alias
Match abyss
Match Alias
Inappropriate because the initial capital letters An abacus
Not suitable because the first letter capitalizes A
and more than 5 characters Regular Expression in Python is expressed through the module, so the first thing when you want to use Regular Expression is to import the module into the program. Try with the example above:
import re pattern = '^a.s$' test_string = 'abyss' result = re.match(pattern, test_string) if result: print("Tim kiem thanh cong.") else: print("Tim kiem khong thanh cong.")
Here we have just used the re.match()
function to search for the test_string
corresponding to the pattern
. The method returns the corresponding object if the search is successful, returns None
if not found.
Most languages support Regular Expression, including JavaScript, C #, Java, PHP, Ruby, SQL, Oracle, Perl . but most commonly used in Unix / Linux.
There are some other functions in the module to operate with RegEx. Before diving into these functions, learn more about RegEx's regular expression.
Syntax pattern used in RegEx Python
The pattern we understand is a sample object, a compiled version of a regular expression. To specify a regular expression, we use special characters, including:
[]. ^ $ * +? {} () |
In the example above is ^
and $
.
Square brackets []
Square brackets are used to represent the set of characters you want to match.
[abc]
a
Match the character a
ac
Match the character a
or c
Hey Jude
Does not match Here, [abc]
will match if the string you passed contains any characters a
, b
or c
.
You can also specify a range of characters using -
within square brackets.
[ae]
similar to[abcde]
.[1-4]
similar to[1234]
.[0-39]
similar to[01239]
.
If the first character of the set is ^
then all characters not defined in the set will be matched.
[^abc]
means matching strings without charactersa
,b
orc
.[^0-9
] means matching strings without any alphanumeric characters.
Special characters in []
will be treated as regular characters.
[(+)]
matches any string with characters(
,+
or)
.
Dots .
The dot matches any regular single character except the new line character 'n'
.
.
a
Doesn't match because there is only one ac
character Match because there are two acd
characters Match because there are two or more characters Hats ^
The caret symbol ^
is used to match the character topping a string.
^a
a
Match because starting with a
abc
Match because starting with a
bac
Does not match because a
not the first ^ab
abc
Match because starting with ab
acb
Does not match, starting with a
but not the next character b
Symbol of Dollar $
The Dollar $
symbol is used to match the character that ends a string.
a$
a
Match because ending in a
formula
Matches because of ending with a
cab
Does not match because a
not in the last position Asterisk *
The asterisk symbol *
may match a string with or without a predefined character. This character can be repeated many times without any limit.
ma*n
mn
Match because of the previous character *
may not appear man
Match because of the full occurrence of maaaan
characters Match because of the previous character *
may appear more often main
Does not match because of a pattern, n
does not lie woman
matching because there is a full appearance of characters Plus sign +
The plus symbol +
can match the string with one or more characters defined before it. This character can be repeated many times without any limit.
ma+n
mn
Does not match because the previous character +
does not appear man
joints because of the full occurrence of maaaan
characters Match because the previous character +
may appear many times main
not match because unlike the pattern, n
does not lie woman
matching because there is a full appearance of characters Question mark ?
The question mark icon may match a string with or without a predefined character. This character can not be repeated many times, only the amount is limited with one occurrence .
ma?n
mn
Match the previous letter ?
Can not appear man
Match because of the full occurrence of maaaan
characters Do not match because of the previous character ?
can only appear 1 time main
not match because does not resemble pattern, does not lie a
woman
Matches because of the full appearance of characters Brace {}
The curly brackets use the general formula: {n,m}
, representing the character before it can appear at least n times up to m times. n
and m
are positive integers and n <= m.
- If
n
left blank, this value defaults to 0. - If left blank
m
, this value is infinite by default.
a{2,3}
abc dat
Does not match because it does not meet the condition of the abc daat
because there are 2 characters a
( a
) of the aabc daaat
because there are 2 and 3 characters a
( aa bc
and d aaa t
) aabc daaaat
Match because there appear 2 and 3 characters a
( aa bc
and d aaa at
) Try another example: RegEx [0-9] {2, 4}
This matches the string with at least 2 digits and no more than 4 digits.
[0-9]{2,4}
ab123csde
Matches because of the conditions: ab 123 csde
12 and 345673
Matches because the conditions: 12
and 3456 73
1 and 2
Not matched because the string has only 1 digit Vertical markers |
Vertical window icon |
This may match the string that exists 1 in 2 characters defined before and after it.
a|b
cde
Does not match because a
, b
does not appear ade
Match because the condition is satisfied, a
appears: a de
acdbea
Matches because the condition is satisfied, a
and b
appear: a cd b e a
Here, a|b
matches any string containing a
or b
.
Parentheses ()
Parentheses ()
are used to group patterns together, the string will match the regular expression inside these brackets.
For example: (a|b|c)xz
matches any string with a
or b
or c
preceding xz
.
(a|b|c)xz
ab xz
Does not match because a
or b
is preceded, but not in abxz
with xz
abxz
Matches because of the condition, b
appears before xz
: a bxz
axz cabxz
Matches because the condition is satisfied , both a
and b
appear before xz
: axz ca bxz
Backslash
A backslash is used to exit special characters, ie when standing before a special character, will turn this character into a lowercase character, you can search for this special character in the string like other common characters.
For example: $a
will match the string containing the preceding character $
a
. Here, the Dollar $
symbol is not used to match a string that ends with the character that comes with it as in the RegEx tool, $
is just a normal character.
However, a backslash will also turn a character often adjacent to the back into a special character.
For example, the case of the letter b
without a backslash will match the lowercase b
characters, but when it adds a backslash, it becomes a special character, does not match any any more.
Some patterns go with
1. A
- Matches the characters that follow it at the beginning of the string.
In the sun
sequence. The match is not at the beginning of the chain 2. b
- Match the characters specified at the beginning or end of the word.
bfoo
football
Match for qualified condition, foo
is at the beginning of a football
because of the condition, foo
located at the second word in the string afootball
Does not match because foo
is in the middle of the word in the string. foob the foo
Match because of the condition, foo
is at the end of the string the afoo test
Match because of the condition, foo
is at the end of the 2nd word in the afootest
string the afootest
not match because foo
is in the middle of the word in the string. 3. B
- Contrary to b
, matches the designated characters not at the beginning or end of the word.
bfoo
football
Does not match because foo
is at the beginning of a football
string a football
not match because foo
is at the beginning of the 2nd word in the string afootball
Match because foo
is in the middle of the word in the string. foob the foo
Doesn't match because foo
is at the end of the afoo test
string the afoo test
not match because foo
is at the end of the 2nd word in the afootest
string because foo
is in the middle of the word in the string. 4. d
- Matches with alphanumeric characters, equivalent to [0-9]
.
d
12abc3
Matches because of the condition: 12 abc 3
Python
Does not match because no integers appear 5. D
- Matches non-numeric characters, equivalent to [^0-9]
.
D
1ab34"50
Matches because of the conditions: 1 ab 34 " 50
1345
Does not match because the whole sequence of integers appears 6. s
- Matches any space character, equivalent to [ tnrfv]
.
Python RegEx
Matches because the string has PythonRegEx
whitespace PythonRegEx
not match because the string has no spaces 7. S
- Matches any non-whitespace character, equivalent to [^ tnrfv]
.
S
ab
Matches because the string has a b
character a b
Does not match because the whole string is whitespace 8. w
- Match with any alphabetic and alphanumeric character, equivalent to [a-zA-Z0-9_]
.
Note: The underscore _
also considered an alphanumeric character.
w
12&": ;c
Matches because the string has alphanumeric characters 12&": ;c
%"> !
Does not match because the string has no alphanumeric characters 9. W
- Match any character other than letters and numbers, equivalent to [^a-zA-Z0-9_]
.
w
1a2%c
Matches because the string has non-alphanumeric characters 1a2 % c
Python
Does not match because the string has only alphanumeric characters Note: The underscore _
also considered an alphanumeric character.
Tips : To build RegEx regular expressions, you can use the RegEx test tool like regex101
. This tool not only creates regular expressions but also helps you learn it more thoroughly.
Now that you understand the basics of RegEx, let's discuss how to use RegEx in Python code.
Regular Expression in Python
Regular Expression in Python is expressed via the re
module, so when you first want to use the regular expression, you need to import the module into the program.
import re
This module has a lot of methods, functions and constants to work with RegEx. TipsMake.com will list some of the best used examples along with an example so you can easily visualize and capture.
re.findall ()
The re.findall () method returns a list of strings containing all the matching pattern results.
Syntax:
findall(partern, string)
Inside:
pattern
is RegEx.string
is the string to match.
Example : Extract the numbers from the given string: "hello 12 hi 89. Howdy 34"
import re string = 'hello 12 hi 89. Howdy 34' pattern = 'd+' result = re.findall(pattern, string) print(result)
Results returned:
['12', '89', '34']
re.split ()
The method re.split () uses regular expressions to break strings into substring and return the list of these substring.
Syntax:
re.split(pattern, string, maxsplit)
Inside:
pattern
is RegEx.string
is the string to match.maxsplit
(integer) is the maximum number of strings to be interrupted. If left blank, Python will match and cut all the strings that meet the criteria.
Example: Interrupting at a location with a space character:
import re string = 'The rain in Vietnam.' pattern = 's' result = re.split(pattern, string) print(result)
Results returned:
['The', 'rain', 'in', 'Vietnam.']
Example: Breaking the string in the first space character:
import re string = 'The rain in Vietnam.' pattern = 's' result = re.split(pattern, string, 1) print(result)
Result:
['The', 'rain in Vietnam.']
If the pattern is not found, re.split () returns the list containing the empty string.
re.sub ()
This is one of the most important methods used with Regular Expression
Re.sub () will replace all the results that match the pattern in the string with another content passed and return the modified string.
Syntax:
re.sub(pattern, replace, string, count)
Inside:
pattern
is RegEx.replace
is the content instead of the result string matching thepattern
.string
is the string to match.count
(integer) is the number of substitutions. If left blank, Python will treat this value to 0, match and replace all qualified strings.
For example : The program code deletes all spaces
import re # chuỗi nhiều dòng string = 'abc 12 de 23 n f45 6' # so khớp các ký tự khoảng trắng pattern = 's+' # chuỗi rỗng replace = '' new_string = re.sub(pattern, replace, string) print(new_string)
Results returned:
abc12de23f456
If no matching result is found, re.sub()
will return an empty string.
For example: The program code deletes the first 2 spaces
import re # chuỗi nhiều dòng string = 'abc 12 de 23 n f45 6 n quantrimang website' # so khớp các ký tự khoảng trắng pattern = 's+' replace = '' new_string = re.sub(r's+', replace, string, 2) print(new_string)
Output returns:
abc12de23 f45 6 quantrimang website
re.subn ()
The re.subn () method uses the same as re.sub()
above, but the returned result includes a tuple containing two values: the new string after being replaced and the number of replacements performed.
import re # chuỗi nhiều dòng string = 'abc 12 de 23 n f45 6 n quantrimang website' # so khớp các ký tự khoảng trắng pattern = 's+' # chuỗi rỗng replace = '' new_string = re.subn(pattern, replace, string) print(new_string)
Results returned:
('abc12de23f456quantrimangwebsite', 6)
re.search ()
The re.search () method is used to search for strings matching the RegEx pattern. If the search succeeds, re.search () returns the matching object, otherwise, it returns None
.
Syntax:
search(pattern, string)
Inside:
pattern
is RegEx.string
is the string to match.
import re string = "TipsMake.com la website ban co the hoc Python" # Kiem tra xem 'Quantrimang' co nam o dau chuoi khong match = re.search('AQuantrimang', string) if match: # nếu tồn tại chuỗi khớp print("Tim thay 'Quantrimang' nam o dau chuoi") # in ra thong bao nay else: print("'Quantrimang' khong nam o dau chuoi") # khong thi in ra thong bao nay
Results returned:
Tim thay 'Quantrimang' nam o dau chuoi
In this example, match
contains the appropriate object that matches the pattern.
Object match
Some methods and properties are often used with match objects.
match.group ()
The group () method returns the parts of the string that match the pattern.
import re string = '39801 356, 2102 1111' pattern = '(d{3}) (d{2})' match = re.search(pattern, string) if match: #nếu tồn tại chuỗi khớp print(match.group()) # in ra kết quả else: print("Không khớp") # Không thì hiện thông báo # Output: 801 35
Here, the variable match
contains the match object.
We have a pattern (d{3}) (d{2})
divided into two small groups (d{3})
and (d{2})
. You can get part of the string corresponding to subgroups in this parenthesis as follows:
>>> match.group(1) '801' >>> match.group(2) '35' >>> match.group(1, 2) ('801', '35') >>> match.groups() ('801', '35')
match.start (), match.end () and match.span ()
The start () function returns the start index of the appropriate substring. Similarly, end () returns the end index of the appropriate substring.
>>> match.start() 2 >>> match.end() 8
The span () function returns tuple containing the start and end index of the appropriate string section.
>>> match.span() (2, 8)
match.re and match.string
The re
attribute of the match object will return a regular expression. Similarly, the string attribute returns the string passed in the code.
>>> match.re re.compile('(d{3}) (d{2})') >>> match.string '39801 356, 2102 1111'
Above are all the most commonly used methods in module re.
Use the prefix r before RegEx
When the prefix r
or R
is used before a regular expression that represents the next string it is only normal characters.
For example: 'n'
is a newline, and r'n'
means that the string consists of two characters: backslash and
n
.
Backslash used to exit the characters as mentioned above. However, use prefix
r
first then it is just a normal character.
import re string = 'n and r are escape sequences.' result = re.findall(r'[nr]', string) print(result) # Output: ['n', 'r']
Previous article: Declare @property in Python
You should read it
- Regular Expression in MongoDB
- Regular Expression in C #
- Regular Expression in PHP
- Regular Expression in Unix / Linux
- More than 100 Python exercises have solutions (sample code)
- Bookmark 5 best Python programming learning websites
- For in Python loop
- Manage files and folders in Python
- Multiple choice quiz about Python - Part 3
- 5 choose the best Python IDE for you
- Regular Expression and RegExp in JavaScript
- What is Python? Why choose Python?
Maybe you are interested
This program will help you become invisible before the webcam lens is recording live Brain research explains the difference in the impact of technology on men and women 'Eyes glued' to the screen before going to bed is not harmful to the mental health of the bar, teenager! How will the world change in the next 1000 years? Can ants survive if they fall from the roof of the building? Discover the process of manufacturing terracotta army nearly 8,000 mysterious soldiers of Qin Shihuang