Introduction
Text preprocessing is one of the most important tasks in Natural Language Processing (NLP). For instance, you may want to remove all punctuation marks from text documents before they can be used for text classification. Similarly, you may want to extract numbers from a text string. Writing manual scripts for such preprocessing tasks requires a lot of effort and is prone to errors. Keeping in view the importance of these preprocessing tasks, the Regular Expressions (aka Regex) have been developed in different languages in order to ease these text preprocessing tasks.
A Regular Expression is a text string that describes a search pattern which can be used to match or replace patterns inside a string with a minimal amount of code. In this tutorial, we will implement different types of regular expressions in the Python language.
To implement regular expressions, the Python's re
package can be used. Import the Python's re
package with the following command:
import re
Searching Patterns in a String
One of the most common NLP tasks is to search if a string contains a certain pattern or not. For instance, you may want to perform an operation on the string based on the condition that the string contains a number.
To search a pattern within a string, the match
and findall
function of the re
package is used.
The match Function
Initialize a variable text
with a text string as follows:
text = "The film Titanic was released in 1998"
Let's write a regex expression that matches a string of any length and any character:
result = re.match(r".*", text)
The first parameter of the match
function is the regex expression that you want to search. Regex expression starts with the alphabet r
followed by the pattern that you want to search. The pattern should be enclosed in single or double quotes like any other string.
The above regex expression will match the text string, since we are trying to match a string of any length and any character. If a match is found, the match
function returns _sre.SRE_Match
object as shown below:
type(result)
Output:
_sre.SRE_Match
Now to find the matched string, you can use the following command:
result.group(0)
Output:
'The film Titanic was released in 1998'
In case if no match is found by the match
function, a null
object is returned.
Now the previous regex expression matches a string with any length and any character. It will also match an empty string of length zero. To test this, update the value of text variable with an empty string:
text = ""
Now, if you again execute the following regex expression, a match will be found:
result = re.match(r".*", text)
Since we specified to match the string with any length and any character, even an empty string is being matched.
To match a string with a length of at least 1, the following regex expression is used:
result = re.match(r".+", text)
Here the plus sign specifies that the string should have at least one character.
Searching Alphabets
The match
function can be used to find any alphabet letters within a string. Let's initialize the text variable with the following text:
text = "The film Titanic was released in 1998"
Now to find all the alphabet letter, both uppercase and lowercase, we can use the following regex expression:
result = re.match(r"[a-zA-z]+", text)
This regex expression states that match the text string for any alphabets from small a
to small z
or capital A
to capital Z
. The plus sign specifies that string should have at least one character. Let's print the match found by the above expression:
print(result.group(0))
Output:
The
In the output, you can see that the first word i.e. The
is returned. This is because the match
function only returns the first match found. In the regex we specified that find the patterns with both small and capital alphabets from a
to z
. The first match found was The
. After the word The
there is a space, which is not treated as an alphabet letter, therefore the matching stopped and the expression returned just The
, which is the first match.
However, there is a problem with this. If a string starts with a number instead of an alphabet, the match
function will return null even if there are alphabets after the number. Let's see this in action:
text = "1998 was the year when the film titanic was released"
result = re.match(r"[a-zA-z]+", text)
type(result)
Output:
NoneType
In the above script, we have updated the text variable and now it starts with a digit. We then used the match
function to search for alphabets in the string. Though the text string contains alphabets, null will be returned since match
function only matches the first element in the string.
To solve this problem we can use the search
function.
The search Function
The search
function is similar to the match
function i.e. it tries to match the specified pattern. However, unlike the match
function, it matches the pattern globally instead of matching only the first element. Therefore, the search
function will return a match even if the string doesn't contain an alphabet at the start of the string but contains an alphabet elsewhere in the string, as shown below:
text = "1998 was the year when the film titanic was released"
result = re.search(r"[a-zA-z]+", text)
print(result.group(0))
Output:
was
The search
function returns "was" since this is the first match that is found in the text string.
Matching String from the Start
To check if a string starts with a specific word, you can use the carrot key i.e. ^
followed by the word to match with the search
function as shown below. Suppose we have the following string:
text = "XYZ 1998 was the year when the film titanic was released"
If we want to find out whether the string starts with "1998", we can use the search
function as follows:
result = re.search(r"^1998", text)
type(result)
In the output, null
will be returned since the text string doesn't contain "1998" directly at the start.
Now let's change the content text variable and add "1998" at the beginning and then check if "1998" is found at the beginning or not. Execute the following script:
text = "1998 was the year when the film titanic was released"
if re.search(r"^1998", text):
print("Match found")
else:
print("Match not found")
Output:
Match found
Matching Strings from the End
To check whether a string ends with a specific word or not, we can use the word in the regular expression, followed by the dollar sign. The dollar sign marks the end of the statement. Take a look at the following example:
text = "1998 was the year when the film titanic was released"
if re.search(r"1998$", text):
print("Match found")
else:
print("Match not found")
In the above script, we tried to find if the text string ends with "1998", which is not the case.
Output:
Match not found
Now if we update the string and add "1998" at the end of the text string, the above script will return ‘Match found' as shown below:
text = "was the year when the film titanic was released 1998"
if re.search(r"1998$", text):
print("Match found")
else:
print("Match not found")
Output:
Match found
Substituting text in a String
Till now we have been using regex to find if a pattern exists in a string. Let's move forward with another advanced regex function i.e. substituting text in a string. The sub
function is used for this purpose.
Let's take a simple example of the substitute function. Suppose we have the following string:
text = "The film Pulp Fiction was released in year 1994"
To replace the string "Pulp Fiction" with "Forrest Gump" (another movie released in 1994) we can use the sub
function as follows:
result = re.sub(r"Pulp Fiction", "Forrest Gump", text)
The first parameter to the sub
function is the regular expression that finds the pattern to substitute. The second parameter is the new text that you want as a replacement for the old text and the third parameter is the text string on which the substitute operation will be performed.
If you print the result variable, you will see the new string.
Now let's substitute all the alphabets in our string with character "X". Execute the following script:
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "X", text)
print(result)
Output:
TXX XXXX PXXX FXXXXXX XXX XXXXXXXX XX XXXX 1994
It can be seen from the output that all the characters have been replaced except the capital ones. This is because we specified a-z
only and not A-Z
. There are two ways to solve this problem. You can either specify A-Z
in the regular expression along with a-z
as follows:
result = re.sub(r"[a-zA-Z]", "X", text)
Or you can pass the additional parameter flags
to the sub function and set its value to re.I
which refers to case insensitive, as follows:
result = re.sub(r"[a-z]", "X", text, flags=re.I)
More details about different types of flags can be found at Python regex official documentation page.
Shorthand Character Classes
There are different types of shorthand character classes that can be used to perform a variety of different string manipulation functions without having to write complex logic. In this section we will discuss some of them:
Removing Digits from a String
The regex expression to find digits in a string is \d
. This pattern can be used to remove digits from a string by replacing them with an empty string of length zero as shown below:
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"\d", "", text)
print(result)
Output:
The film Pulp Fiction was released in year
Removing Alphabet Letters from a String
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text, flags=re.I)
print(result)
Output:
1994
Removing Word Characters
If you want to remove all the word characters (letters and numbers) from a string and keep the remaining characters, you can use the \w
pattern in your regex and replace it with an empty string of length zero, as shown below:
text = "The film, '@Pulp Fiction' was ? released in % $ year 1994."
result = re.sub(r"\w","", text, flags = re.I)
print(result)
Output:
, '@ ' ? % $ .
The output shows that all the numbers and alphabets have been removed.
Removing Non-Word Characters
To remove all the non-word characters, the \W
pattern can be used as follows:
text = "The film, '@Pulp Fiction' was ? released in % $ year 1994."
result = re.sub(r"\W", "", text, flags=re.I)
print(result)
Output:
ThefilmPulpFictionwasreleasedinyear1994
From the output, you can see that everything has been removed (even spaces), except the numbers and alphabets.
Grouping Multiple Patterns
You can group multiple patterns to match or substitute in a string using the square bracket. In fact, we did this when we matched capital and small letters. Let's group multiple punctuation marks and remove them from a string:
text = "The film, '@Pulp Fiction' was ? released _ in % $ year 1994."
result = re.sub(r"[,@\'?\.$%_]", "", text, flags=re.I)
print(result)
Output:
The film Pulp Fiction was released in year 1994
You can see that the string in the text variable had multiple punctuation marks, we grouped all these punctuations in the regex expression using square brackets. It is important to mention that with a dot and a single quote we have to use the escape sequence i.e. backward slash. This is because by default the dot operator is used for any character and the single quote is used to denote a string.
Removing Multiple Spaces
Sometimes, multiple spaces appear between words as a result of removing words or punctuation. For instance, in the output of the last example, there are multiple spaces between in
and year
. These spaces can be removed using the \s
pattern, which refers to a single space.
text = "The film Pulp Fiction was released in year 1994."
result = re.sub(r"\s+","", text, flags = re.I)
print(result)
Output:
The film Pulp Fiction was released in year 1994.
In the script above we used the expression \s+
which refers to single or multiple spaces.
Removing Spaces from Start and End
Sometimes we have a sentence that starts or ends with a space, which is often not desirable. The following script removes spaces from the beginning of a sentence:
text = " The film Pulp Fiction was released in year 1994"
result = re.sub(r"^\s+", "", text)
print(result)
Output:
The film Pulp Fiction was released in year 1994
Similarly, to remove space at the end of the string, the following script can be used:
text = "The film Pulp Fiction was released in year 1994 "
result = re.sub(r"\s+$", "", text)
print(result)
Removing a Single Character
Sometimes removing punctuation marks, such as an apostrophe, results in a single character which has no meaning. For instance, if you remove the apostrophe from the word Jacob's
and replace it with space, the resultant string is Jacob s
. Here the s
makes no sense. Such single characters can be removed using regex as shown below:
text = "The film Pulp Fiction s was b released in year 1994"
result = re.sub(r"\s+[a-zA-Z]\s+", "", text)
print(result)
Output:
The film Pulp Fiction was released in year 1994
The script replaces any small or capital letter between one or more spaces, with a single space.
Splitting a String
String splitting is another very important function. Strings can be split using split
function from the re package. The split
function returns a list of split tokens. Let's split a string of words where one or more space characters are found, as shown below:
text = "The film Pulp Fiction was released in year 1994 "
result = re.split(r"\s+", text)
print(result)
Output:
['The', 'film', 'Pulp', 'Fiction', 'was', 'released', 'in', 'year', '1994', '']
Similarly, you can use other regex expressions to split a string using the split
functions. For instance, the following split
function splits string of words when a comma is found:
text = "The film, Pulp Fiction, was released in year 1994"
result = re.split(r"\,", text)
print(result)
Output:
['The film', ' Pulp Fiction', ' was released in year 1994']
Finding All Instances
The match
function conducts a match on the first element while the search
function conducts a global search on the string and returns the first matched instance.
For instance, if we have the following string:
text = "I want to buy a mobile between 200 and 400 euros"
We want to search all the digits from this string. If we use the search
function, only the first occurrence of digits i.e. 200 will be returned as shown below:
result = re.search(r"\d+", text)
print(result.group(0))
Output:
200
On the other hand, the findall
function returns a list that contains all the matched utterances as shown below:
text = "I want to buy a mobile between 200 and 400 euros"
result = re.findall(r"\d+", text)
print(result)
Output:
['200', '400']
You can see from the output that both "200" and "400" is returned by the findall
function.
Conclusion
In this article we studied some of the most commonly used regex functions in Python. Regular expressions are extremely useful for preprocessing text that can be further used for a variety of applications, such as topic modeling, text classification, sentimental analysis, and text summarization, etc.