Regular Expressions¶
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern, and extract it.
Quick start¶
Suppose you have a column like this:
Address |
---|
876 Highland St Atlanta GA 30301 |
296 Maple St New York City NY 10001 |
906 7th St Portland OR 97035 |
and you want to extract all the city short forms i.e. GA, NY, and OR. It’s tough to extract this using one of the pre-defined extract rules.
Extracting such sub-strings is much easier with RegEx, you just need to write a search pattern.
Here, we can use the expression - [A-Z]{2}. This search pattern looks for two consecutive capital letters in each row of the Address text column and extracts them into a new column or an existing one as specified by the user.
Note
Here, [A-Z] specifies the UPPERCASE alphabet letter range and {2} specifies the frequency, here two.
The result would look something like this:
Address |
City Short Forms |
---|---|
876 Highland St Atlanta GA 30301 |
GA |
296 Maple St New York City NY 10001 |
NY |
906 7th St Portland OR 97035 |
OR |
In Mammoth, follow these steps to extract strings using RegEx:
Open the tasks menu → “Text Functions” → “Extract Text”
In the “What to extract” drop down, select “Regular Expression”
Enter the RegEx in the field below it. It should look something like this:
Click Apply and all city short forms would get extracted.
How to create Regular Expressions?¶
We can create regular expressions using words and special characters, which we refer to as meta-characters.
For example, if we write “^apple” as a RegEx, it will match all the words starting with “apple”. Here, ^ means that the word should start with whatever follows, in this case, “apple”.
The tables below list some of the most common meta-characters in RegEx.
The first column talks about what symbol we are discussing. The second column talks about what it is supposed to do. And the third column shows an illustration of its usage.
1. Anchors & Wildcards¶
Symbol |
Description |
Example |
---|---|---|
^ |
Matches characters from the beginning |
^123 matches 123456, 123abc, 123#@! |
$ |
Matches characters from the end |
123$ matches abc123, 257123 |
| |
Creates an OR match |
1|10 matches 1, 10 |
. |
Matches any single character (letter, number or symbol) |
1. matches 10, 1A; 1.1 matches 111, 1A1 |
2. Quantifiers¶
Symbol |
Description |
Example |
---|---|---|
? |
Matches the preceding character 0 or 1 times |
10? matches 1, 10 |
+ |
Matches the preceding character 1 or more times |
10+ matches 10, 100 |
* |
Matches the preceding character 0 or more times |
1* matches 1, 10 |
{2} |
Matches the specified characters exactly twice |
[0-9]{2} matches 256abc, abcd2567 |
{2, 10} |
Matches the specified characters anywhere between 2 and 10 |
[A-Z]{2, 10} matches AB1234, ABCDefg, etc. |
[2,] |
Matches the specifies characters at least twice |
[0-9][2,] will match 26abc, kl420xhj, etc. |
3. Groups¶
Symbol |
Function |
Example |
---|---|---|
() |
Matches the enclosed characters in exact order anywhere in a string |
(10) matches 10, 101, 1011, ([0-9]|[a-z]) matches any number or lower-case letter |
[] |
Matches the enclosed characters in any order anywhere in a string |
[10] matches 012, 120, 210 |
- |
Creates a range of characters within brackets to match anywhere in a string |
[0-9] matches any number 0 through 9 (equivalent to d) |
(?:xyz) |
Non-capturing group of characters |
(?:abc) matches everything except a, b, & c |
[^xyz] |
Matches a character other than x or y or z |
[^abc] will match everything else other than a, b, & c |
[a-z] |
Matches any smallcase alphabet between a & z |
[a-z]$ will match f in ABCdef, k in 123jk, etc. |
[A-Z] |
Matches any uppercase alphabet between A & Z |
^[A-Z] will match A in ABCdef, etc. |
[a-j] |
Matches any smallcase alphabet between a & j |
[a-j] will match a in ABCabc, g in g1234, etc. |
4. Escapes¶
Symbol |
Function |
Example |
---|---|---|
\ |
\ indicates that the adjacent character should be interpreted literally rather than as a regex metacharacter |
\. indicates that the adjacent dot should be interpreted as a period or decimal rather than as a wildcard. For e.g. 216\.239\.32\.34 matches 216.239.32.34 |
5. Character classes¶
Symbol |
Function |
Example |
---|---|---|
\d |
\d matches any numeric character (equivalent to [0-9]) |
\d matches any single digit from 0,1,2….,9; \d+ matches one or more digit character |
\D |
\D is the inverse operation, doesn’t match digit ([^\d]) |
\D* matches zero or more non-digit characters |
\w |
\w matches any alphanumeric character plus underscore (equivalent to [a-zA-Z0-9_]) |
\w matches 7,a,_,10; \w+ matches abcdef09ge_wf |
\W |
\W matches anything except the above (equivalent to [^\w]) |
In @$@%@%@ \W+ matches @$@%@%@ |
\s |
Matches any whitespace characters, like space, tab, new line |
\s matches the space in Apple Ball |
\S |
\S matches everything except whitespace characters |
\S matches Apple Ball |
Some common regular expressions¶
Description |
Expression |
---|---|
Starts with a positive number |
^d+ |
Starts with a positive or a negative number |
^-?d+ |
Ends with a positive number |
d+$ |
Starts with a lowercase letter |
^[a-z]+ |
Ends with a lowercase letter |
[a-z]+$ |
Starts with an uppercase letter |
^[A-Z]+ |
Ends with an uppercase letter |
[A-Z]+$ |
Contains a (positive) number without decimals (5,2837, etc) |
d+ |
Contains a (positive) decimal number (5.00, 10.1, 9999.99) |
d+.d+ |
Contains either of integer or decimal number (including negatives) |
^-?(d*)(.d+)?$ |
All except regular expressions¶
All except regular expressions is used to remove substrings that matches with the regex and extract the rest of the string.
Suppose you have a column like this:
Product |
---|
Charging Cable |
Google Phone |
Wired Headphones |
Laptop |
You want to extract all items except Laptop, you can write a regex expression like ^Laptop, choose All except regular expression as the extraction type and it’ll give you all results minus Laptop.
Like this:
Product |
Product Extract |
---|---|
Charging Cable |
Charging Cable |
Google Phone |
Google Phone |
Wired Headphones |
Wired Headphones |
Laptop |
The expression ^Laptop specifies strings starting with Laptop (denoted by ^).