Regular Expressions

What Regex Is Good For

Regex is a way of identifying a class of strings according to a template. There are many times more than a strict equality check is needed. Regex can help. For instance, if we wanted to check to see if a string might be a phone number we could do:

// is string in 555-555-5555 format
const isPhone = (str) => /^\d{3}-\d{3}-\d{4}$/.test(str);

Trying to achieve the above without regex is needlessly messy. Regex also makes modifying the template easier. Say we wanted our isPhone function to also recognize strings without the dashes:

// is string in 555-555-5555/5555555555 format
const isPhone = (str) => /^\d{3}(-?)\d{3}\1\d{4}$/.test(str);

Say we wanted to relax the delimiter to be a period or a space. No problem:

// is string in
// 555-555-5555/555.555.5555/555 555 5555/5555555555 format
const isPhone = (str) =>
  /^\d{3}([-.\s]?)\d{3}\1\d{4}$/.test(str);

Regex can also consolidate the number of passes over a given string. Here’s one example of a function that aims to extract the attribute name from a data attribute CSS pseudo-selector. The regex-free version:

// [data-test_attribute] -> test_attribute
const getDataAttributeKey = (selector) =>
  selector
    .replace("[data-", "")
    .replace("]", "");

This works, but involves three passes over the input string. Now with regex:

// [data-test_attribute] -> test_attribute
const getDataAttributeKey = (selector) =>
  selector.replace(/\[data-([^\]]+)\]/, "$1");

The regex isn’t easier to read, but we’ve cut down the number of passes from 2 to 1.

In general, the clearer the template the easier it is to write the regex. The following are good use cases for regex:

What Regex Is Not Good For

With great power… etc, etc, etc.

At first people tend to avoid regex because the syntax is difficult to read or write. However, the bigger problems often arise from those (like myself) who try to solve every problem with regex. And it’s easy to understand why that’s so tempting. Consider:

Prime string length

Powers of 2

Binary Divisibility by 3

But there are many instances where regex quickly falls apart. One of the more common cases is using regex to validate or match HTML. Under ideal circumstances this might not be so bad, but HTML is almost always malformed. The browser does a lot of heavy lifting to fill in the gaps when certain tags that should be closed aren’t. Or when tags contain certain unescaped characters.

Regex also struggles with larger texts spanning multiple lines. Consider the catastrophic backtracking risk. Innocent assumptions can bring a regex engine to a crawl.

Regex Readability and Testing

A common complaint levied against regex is how unreadable it is. There is no real counterargument. Regex can be parsed by those familiar with it, but there’s no guarantee you’ll be able to catch errors in the regex just by looking at it. For example, consider the RFC Standard regex for email validation:

const EMAIL_REGEX = /(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])/;

No one should be expected to validate this on sight. Instead we should test these with known pass/fail cases:

assert(
  EMAIL_REGEX.test("aaron@aarondilley.com"),
  true,
);

assert(
  EMAIL_REGEX.test("aaron@aarondilley@com"),
  false,
);

We should avoid writing a regex as complex as this from scratch where possible. But regardless of the origin, we should have coverage for all regex authored.

Regex Basics

Character ranges

Single tokens

Group constructs

Non-capturing group constructs

Quantifiers

Anchors

Flags

Regex Building

Start with test cases

Always easiest to compile the cases the pattern should match along with the cases the pattern should not match. For example, if we have

# Matches
afoot
catfoot
dogfoot
fanfoot
foody
foolery
foolish
fooster
footage
foothot
footle
footpad
footway
hotfoot
jawfoot
mafoo
nonfood
padfoot
prefool
sfoot
unfool

# Doesn't match
Atlas
Aymoro
Iberic
Mahran
Ormazd
Silipan
altared
chandoo
crenel
crooked
fardo
folksy
forest
hebamic
idgah
manlike
marly
palazzi
sixfold
tarrock
unfold

We can probably surmise that the common thread in all the match cases is they have the substring foo. Therefore my regex is as easy as /foo/.

Exercises

FP Regex

Consider our earlier example

// is string in 555-555-5555 format
const isPhone = (str) => /^\d{3}-\d{3}-\d{4}$/.test(str);

It would probably be pretty handy if we could eliminate the need to specify the str param. FP to the rescue:

import { curry } from "lodash";
import { filter } from "lodash/fp";

const regexTest = curry((regex, str) => regex.test(str));

// is string in 555-555-5555 format
const isPhone = regexTest(/^\d{3}-\d{3}-\d{4}$/);

isPhone("555-555-5555"); // true

const onlyVowels = filter(regexTest(/^[aeiou]+$/i));

onlyVowels([
  "Hello",
  "EIEIO",
  "aaaaaa",
  "aaaaah",
]); // ["EIEIO", "aaaaaa"]

Resources