Regex in Excel

RegEx, or Regular Expressions, is now in Excel! Regex functions simplify text manipulation, like search, match, clean, and extract text.

In this article, we’ll discover three Regular Expression functions in Excel through useful examples.

Table of contents:

  1. What are Regular Expressions?
  2. REGEXEXTRACT
  3. REGEXREPLACE
  4. REGEXTEST
  5. Excel Regex Function Examples
  6. How to use regular expressions in Excel?
  7. Regex Add-in for Excel

Can you imagine how useful it would be to standardize a list of email addresses in different formats? Or how about extracting dates from free-form text fields? Introducing: RegEx! RegEx, short for Regular Expressions, are powerful and flexible patterns. You can use patterns to match strings within blocks of text. In this article, we’ll talk about common regex syntax. Furthermore, you will learn how to enable regex in Excel, and finally, how to use them effectively for data cleaning.

What are Regular Expressions?

Regular expressions, often called “regex” or “regexp,” are like special formulas for finding and changing text. Over time, regex has become an essential tool in various fields for tasks such as searching, parsing, and replacing text. Regex is widely used in many programming languages, including Python, JavaScript, Perl, Java, and Ruby. These languages support regex because it is incredibly useful for text processing and manipulation.

One of the most exciting developments is that Excel now includes native support for regex, making data-cleaning tasks more efficient. In Microsoft 365, you can use three functions: REGEXEXTRACT, REGEXTEST, and REGEXREPLACE. By using regex patterns in Excel, you can now standardize formats, extract specific information, and clean up large datasets directly in Excel without needing additional software or programming skills.

REGEXEXTRACT Function in Excel

The REGEXEXTRACT function extracts specific parts of the text that match a pattern you define. You provide the text or a cell reference containing the text you want to search, and then you specify the regular expression (regex) that describes the pattern of the text you want to extract.

For example, you can use REGEXEXTRACT in Excel to find URLs, postal codes, or specific keywords in a list of text.

Syntax:

=REGEXEXTRACT(text, pattern, [return_mode], [case_sensitivity])

Arguments:

The function extracts strings within the provided text that match the pattern. The REGEXEXTRACT function uses two required and two optional arguments.

  • text: The text or the reference to a cell containing the text you want to extract strings from. Required argument.
  • pattern: The regular expression (“regex”) that describes the pattern of text you want to extract. Required argument.
  • [return_mode]: A number that specifies what strings you want to extract. The default value is 0.
    • 0“: Extracts the first match.
    • 1“: Extracts all matches as an array.
    • “2”: Extracts parts of the first match (capturing groups) as an array.
  • [case_sensitivity]: Specifies if the match is case-sensitive. By default, matching is case-sensitive.
    • 0“: Case sensitive.
    • 1“: Case insensitive.

REGEXREPLACE Function in Excel

The REGEXREPLACE function replaces strings within the provided text that matches the pattern with replacement. This functionality is particularly helpful for several reasons. For example, it can automate the correction of common errors or formatting issues in data. From now on, it is an easy task to fix date formats, standardize phone numbers, or correct misspelled words.

Syntax:

REGEXREPLACE(text, pattern, replacement, [occurrence], [case_sensitivity])

Arguments:

The REGEXREPLACE function uses three required and two optional arguments.

  • text: The text or cell reference containing the text you want to replace.
  • pattern: The regular expression that describes the pattern of text you want to replace.
  • replacement: The text you want to replace instances of pattern.
  • occurrence: This sets which specific occurrence of the pattern you want to change. Normally, if you set this to 0, it replaces every instance of the pattern. If you use a negative number, it starts replacing from the end and targets that specific occurrence.
  • case_sensitivity: Specifies if the match is case-sensitive. By default, matching is case-sensitive.
    • 0“: Case sensitive.
    • 1“: Case insensitive.

REGEXTEST Function in Excel

The REGEXTEST function allows you to check whether any part of the supplied text matches a regular expression (“regex”). It will return TRUE if there is a match and FALSE if there is not.

The REGEXTEST function is my personal favorite. It offers quick and simple solutions to many problems that were previously difficult to solve. This includes writing formulas like “if cell contains”. Checking these used to be a cumbersome task. I emphasize, used to be…

Syntax:

REGEXTEST(text, pattern, [case_sensitivity])

Arguments:

The function uses two required and one optional argument.

  • text: The text or the reference to a cell containing the text you want to match against.
  • pattern: The regular expression that describes the pattern of text you want to match.
  • case_sensitivity: Determines whether the match is case-sensitive. By default, the match is case-sensitive.
    • 0“: Case sensitive.
    • 1“: Case insensitive.

Excel Regex Function Examples

Example 1: Extract dates from a cell containing text

In the first example, I want to extract all dates from a cell that contains various text using an Excel regex formula.

The formula looks like this:

=REGEXEXTRACT(B3,"(\d{2}/\d{2}/\d{4})",1)
Regex in Excel - How to extract dates from a text

Explanation: Here’s a breakdown of each component in the Excel regex formula:

  1. Function: REGEXEXTRACT is an Excel function that searches for a text pattern defined by a regular expression and extracts it.
  2. Target Cell (B3): This is where the function looks for the text to extract. The text in cell B3 is scanned for any sequence that matches the regular expression pattern provided.
  3. Regular Expression (“(\d{2}/\d{2}/\d{4})”):
    • \d: This is a regex token that matches any digit (0-9).
    • {2}: This specifies that exactly two digits should be found (hence, \d{2} matches two consecutive digits).
    • /: This is a literal character in the regex pattern. It must appear exactly as it is.
    • \d{4}: This matches exactly four consecutive digits.
    • The overall pattern (\d{2}/\d{2}/\d{4}) matches dates in the format DD/MM/YYYY, where D and M are digits of the day and month, and Y are digits of the year.
  4. Return Mode (1): This argument in REGEXEXTRACT specifies what part of the match to return. 1 here indicates that all matches of the pattern within the text should be returned as an array. If there is more than one date in the format specified, all will be returned.

So, the formula =REGEXEXTRACT(B3,”(\d{2}/\d{2}/\d{4})”,1) will extract every occurrence of a date in the format DD/MM/YYYY from the text in cell B3. If there’s more than one date in the specified format, it will return them all as an array, assuming array formulas are properly enabled and supported in your version of Excel.

Example 2: IF Cell contains specific text

in this example, the goal is to check if a cell contains specific text. We’ll use the REGEXTEST function. The search string is “green”.

if cell contains specific text.

The first argument is the cell you want to check. Next, add the regex pattern you want to check. The third argument of the REGEXTEST function is case sensitivity. In the example, I’m using a non-case-sensitive search, so type 1. If you want to use a case-sensitive test, use 0 as a third argument.

Non-case sensitive matches:

=IF(REGEXTEST(B3,"green",1),"Found","Not found") //partial match
=IF(REGEXTEST(B3,"\bgreen\b",1),"Found","Not found") //exact match

Case-sensitive matches:

=IF(REGEXTEST(B3,"green",0),"Found","Not found") //partial match
=IF(REGEXTEST(B3,"\bgreen\b",0),"Found","Not found") //exact match

How to use Regular Expressions in Excel

When crafting regex patterns in Excel, you utilize specific symbols known as ‘tokens’ that correspond with various character types. These tokens simplify the process of pattern creation by representing broad categories of characters, rather than individual characters. Here’s a more detailed look at some simple tokens commonly used in regex:

  • Dot (.): Matches any single character except newline characters. For example, “a.c” matches “abc”, “acc”, “adc”, and so on.
  • Asterisk (*): Matches zero or more occurrences of the preceding element. For instance, “ba” could match “b”, “ba”, “baa”, “baaa”, etc.
  • Plus (+): Similar to the asterisk, but requires at least one occurrence of the preceding element. So, “ba+” would match “ba”, “baa”, “baaa”, etc., but not “b”.
  • Question Mark (?): Makes the preceding element optional, meaning it can occur at zero or one time. For example, “colo?r” matches both “color” and “color”.
  • Caret (^): Used at the start of a pattern to denote the beginning of a line or string. “^A” matches any string that starts with “A”.
  • Dollar Sign ($): Placed at the end of a pattern to indicate the end of a line or string. “end$” would match any string that ends with “end”.
  • Square Brackets ([ ]): Used to specify a set or range of characters to match. “[abc]” matches “a”, “b”, or “c”. You can specify a range using a hyphen, like “[a-z]” to match any lowercase letter.
  • Backslash (\): Used as an escape character to allow special characters to be used in patterns. For instance, “\$” would be used to match the dollar sign character itself.
  • Pipe (|): Acts as a logical OR operator, allowing you to specify alternative patterns to match. For example, “cat|dog” matches either “cat” or “dog”.
  • Parentheses (()): Used for grouping characters or subpatterns together. For instance, “(ab)+” matches one or more repetitions of “ab”.

Character classes

Character classes in regular expressions (regex) are special notations used to match any one character from a specific set of characters. They allow you to specify a group of characters that you want to match within a text, making your patterns more flexible and powerful.

Here are some commonly used character classes in regex:

  • Dot (.): This is the most basic character class that matches any single character except for newline characters.
  • [abc]: A custom character class that matches any single character listed inside the square brackets. For example, [abc] matches “a”, “b”, or “c”.
  • [^abc]: A negated character class that matches any character that is not listed inside the square brackets. For instance, [^abc] matches any character except “a”, “b”, or “c”.
  • [a-z]: A range character class that matches any single lowercase letter from “a” to “z”. Similarly, [A-Z] matches any uppercase letter, and [0-9] matches any digit from “0” to “9”.
  • \d: Matches any digit, equivalent to [0-9].
  • \D: Matches any non-digit character, equivalent to [^0-9].
  • \w: Matches any word character, which includes letters, digits, and underscores. It is equivalent to [a-zA-Z0-9_].
  • \W: Matches any non-word character, which is anything not included in \w, such as punctuation or spaces. It’s equivalent to [^a-zA-Z0-9_].
  • \s: Matches any whitespace character including spaces, tabs, and line breaks.
  • \S: Matches any non-whitespace character.

Anchors

Anchors are special characters that do not match any character themselves but instead assert a position in the text. They are used to specify where a pattern is supposed to start or end within the string. Here are the main types of anchors used in Excel regex:

  • Caret (^): This anchor is used to match the beginning of a string or the beginning of a line in multiline mode. For example, ^Hello matches “Hello” only if it appears at the start of the string.
  • Dollar Sign ($): This anchor matches the end of a string or the end of a line in multiline mode. For instance, end$ matches “end” only if it is at the end of the string.
  • Word Boundary (\b): This anchor matches a position where a word character is next to a non-word character. It is used to ensure that the characters to be matched are at the beginning, middle, or end of a word. For example, \bword\b matches “word” in “word is the word” but not in “swordfish” or “keywords”.
  • Non-word Boundary (\B): This is the opposite of a word boundary. \B matches at positions where there are no word boundaries. This can be useful for matching patterns that should not occur at the beginning or end of words. For instance, \Bain\B matches “ain” in “Spain” and “complaint” but not in “pain” or “main”.