Regular Expressions

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern, and extract it.

Quick start

Suppose you have a column like this:

Address

876 Highland St Atlanta GA 30301

296 Maple St New York City NY 10001

906 7th St Portland OR 97035

and you want to extract all the city short forms i.e. GA, NY, and OR. It’s tough to extract this using one of the pre-defined extract rules.

Extracting such sub-strings is much easier with RegEx, you just need to write a search pattern.

Here, we can use the expression - [A-Z]{2}. This search pattern looks for two consecutive capital letters in each row of the Address text column and extracts them into a new column or an existing one as specified by the user.

Note

Here, [A-Z] specifies the UPPERCASE alphabet letter range and {2} specifies the frequency, here two.

The result would look something like this:

Address

City Short Forms

876 Highland St Atlanta GA 30301

GA

296 Maple St New York City NY 10001

NY

906 7th St Portland OR 97035

OR

In Mammoth, follow these steps to extract strings using RegEx:

  • Open the tasks menu → “Text Functions” → “Extract Text”

  • In the “What to extract” drop down, select “Regular Expression”

  • Enter the RegEx in the field below it. It should look something like this:

regular expressions

Writing a regular expression to extract substring

  • Click Apply and all city short forms would get extracted.

regular expressions

Extracted data

How to create Regular Expressions?

We can create regular expressions using words and special characters, which we refer to as meta-characters.

For example, if we write “^apple” as a RegEx, it will match all the words starting with “apple”. Here, ^ means that the word should start with whatever follows, in this case, “apple”.

The tables below list some of the most common meta-characters in RegEx.

The first column talks about what symbol we are discussing. The second column talks about what it is supposed to do. And the third column shows an illustration of its usage.

1. Anchors & Wildcards

Anchors & Wildcards

Symbol

Description

Example

^

Matches characters from the beginning

^123 matches 123456, 123abc, 123#@!

$

Matches characters from the end

123$ matches abc123, 257123

|

Creates an OR match

1|10 matches 1, 10

.

Matches any single character (letter, number or symbol)

1. matches 10, 1A; 1.1 matches 111, 1A1

2. Quantifiers

Quantifiers

Symbol

Description

Example

?

Matches the preceding character 0 or 1 times

10? matches 1, 10

+

Matches the preceding character 1 or more times

10+ matches 10, 100

*

Matches the preceding character 0 or more times

1* matches 1, 10

{2}

Matches the specified characters exactly twice

[0-9]{2} matches 256abc, abcd2567

{2, 10}

Matches the specified characters anywhere between 2 and 10

[A-Z]{2, 10} matches AB1234, ABCDefg, etc.

[2,]

Matches the specifies characters at least twice

[0-9][2,] will match 26abc, kl420xhj, etc.

3. Groups

Groups

Symbol

Function

Example

()

Matches the enclosed characters in exact order anywhere in a string

(10) matches 10, 101, 1011, ([0-9]|[a-z]) matches any number or lower-case letter

[]

Matches the enclosed characters in any order anywhere in a string

[10] matches 012, 120, 210

-

Creates a range of characters within brackets to match anywhere in a string

[0-9] matches any number 0 through 9 (equivalent to d)

(?:xyz)

Non-capturing group of characters

(?:abc) matches everything except a, b, & c

[^xyz]

Matches a character other than x or y or z

[^abc] will match everything else other than a, b, & c

[a-z]

Matches any smallcase alphabet between a & z

[a-z]$ will match f in ABCdef, k in 123jk, etc.

[A-Z]

Matches any uppercase alphabet between A & Z

^[A-Z] will match A in ABCdef, etc.

[a-j]

Matches any smallcase alphabet between a & j

[a-j] will match a in ABCabc, g in g1234, etc.

4. Escapes

Escapes

Symbol

Function

Example

\

\ indicates that the adjacent character should be interpreted literally rather than as a regex metacharacter

\. indicates that the adjacent dot should be interpreted as a period or decimal rather than as a wildcard. For e.g. 216\.239\.32\.34 matches 216.239.32.34

5. Character classes

Character Classes

Symbol

Function

Example

\d

\d matches any numeric character (equivalent to [0-9])

\d matches any single digit from 0,1,2….,9; \d+ matches one or more digit character

\D

\D is the inverse operation, doesn’t match digit ([^\d])

\D* matches zero or more non-digit characters

\w

\w matches any alphanumeric character plus underscore (equivalent to [a-zA-Z0-9_])

\w matches 7,a,_,10; \w+ matches abcdef09ge_wf

\W

\W matches anything except the above (equivalent to [^\w])

In @$@%@%@ \W+ matches @$@%@%@

\s

Matches any whitespace characters, like space, tab, new line

\s matches the space in Apple Ball

\S

\S matches everything except whitespace characters

\S matches Apple Ball

Some common regular expressions

Common RegEx

Description

Expression

Starts with a positive number

^d+

Starts with a positive or a negative number

^-?d+

Ends with a positive number

d+$

Starts with a lowercase letter

^[a-z]+

Ends with a lowercase letter

[a-z]+$

Starts with an uppercase letter

^[A-Z]+

Ends with an uppercase letter

[A-Z]+$

Contains a (positive) number without decimals (5,2837, etc)

d+

Contains a (positive) decimal number (5.00, 10.1, 9999.99)

d+.d+

Contains either of integer or decimal number (including negatives)

^-?(d*)(.d+)?$

All except regular expressions

All except regular expressions is used to remove substrings that matches with the regex and extract the rest of the string.

Suppose you have a column like this:

Product

Charging Cable

Google Phone

Wired Headphones

Laptop

You want to extract all items except Laptop, you can write a regex expression like ^Laptop, choose All except regular expression as the extraction type and it’ll give you all results minus Laptop.

Like this:

Product

Product Extract

Charging Cable

Charging Cable

Google Phone

Google Phone

Wired Headphones

Wired Headphones

Laptop

The expression ^Laptop specifies strings starting with Laptop (denoted by ^).

Note

All except regular expressions only removes the first group match. So if you have a string Apple Apple Apple and you enter Apple to be removed, it leave us with Apple Apple. For instance, see the following images:

note
note 2