Regular Expressions (Regex)


Regular expression (regex) is a tool for expressing patterns in text.

  • a group of characters that describe how to execute a specific search pattern on a given text
  • two forms:
    • basic
    • extended
  • programs determine which form is supported
    • differences are complex and subtle
  • use cases:
    • validate data
    • find patterns in large amounts of text
    • search and replace text
    • validate email addresses
    • search for phone numbers
ElementPurpose
[ABC]Character set
[A-Z]Range
\wWord
\dDigit
\sWhitespace
^Beginning
$End
?May or may not exist
{1,3}Quantifier

Basic Features

Bracket Expressions

Bracket expressions utilize characters enclosed in brackets [] which match any one character within the brackets.

  • e.g., b[aeiou]g matches bag, beg, big, bog, and bug
  • brackets represent a single character in the word
  • using a caret ^ after the opening bracket matches against any character except the ones specified
    • e.g., b[^aeiou]g matches bbg or bAg, but not bag or beg

Range Expressions

A range expression is a variant on a bracket expression that uses a range of start and end points separated by a dash -.

  • e.g., a[2-4]z matches a2z. a3z, and a4z

Any Single Character

The dot . represents any single character except a newline.

  • e.g, a.z matches a2z, abz, etc.

Start and End of Line

A text line (aka a record) consists of all the characters before the line is terminated with a newline.

  • caret ^ represents the start of a line
    • when not used inside of brackets
  • dollar sign $ represents the end of a line

Repetition

A full or partial regular expression may be followed by a special symbol to denote repetition of the matched item.

  • an asterisk * denotes zero or more matches
  • often combined with dot .* to specify a match with any substring
    • e.g., A.*Lincoln matches Abe Lincoln and Abraham Lincoln

Escaping

To match a special character literally, you need to escape it.

  • precede the character with a backslash \
  • e.g., to match www.test.com need to do www\.test\.com

Extended Features

Additional Repetition Operators

A plus sign + matches one or more occurrences.

A question mark ? matches zero or one match.

Multiple Possible Strings

The vertical bar | separates two possible matches.

  • e.g., car|truck matches either car or truck

Parentheses

Parentheses () surround subexpressions.

  • often used to specify how to apply operators
  • e.g., file(one|two|three)\.txt

Grep with Regular Expressions

  • to use an extended regular expression with grep, you need to include the -E option