What regular expressions are (and aren’t)
In this section we will explore the similarities and differences between the terms “regular expressions”, “wildcards”, and “shell globbing”, and learn some of the basic syntax of regular expressions.
Regular expressions
A regular expression is a series of special characters that describe a pattern in a string of text according to a standardized set of rules.
Some regular expressions can be difficult to read at first because every character in the expression has a specific rule-defined meaning. For example, the regular expression below contains letters, numbers, and punctuation to describe a typical username consisting of a specific number of lowercase characters and numbers as well as (optionally) a hyphen or underscore:
(Image from Learn Regex the Easy Way)
Specifically, this pattern will match the text john_doe
, jo-hn_doe
and john12_as
, but not Jo
because that string contains an uppercase letter (and is also too short).
Wildcards
A wildcard is a special character that represents one or more unknown other characters. It is very common in a wide range of contexts and applications – particularly when searching through text – to have one or more wildcards available to expand the range of possible results.
For example, in MS Word, you can use the ?
character to find any single character (including spaces). So, searching for c?t
would find each of the following results: cat
, cot
, cut
, c t
. Other wildcard characters match any number of characters, the beginning of a word, the end of a word, and so on.
Often, such wildcards are different in every application (in other words, they are not standardized or “regularized”). This can be confusing if you switch between applications.
Regular expressions, on the other hand, are a standardized set of conventions for finding patterns in strings of text where the definition of each special character typically does not differ between applications and implementations. (Other aspects of regex behaviour may differ between “flavours” of regex, however, but the syntax usually remains the same.)
The wildcard symbol for matching a single character in regex is .
(a single dot or period). This is equivalent to ?
in MS Word, while the ?
in regex has a very different usage: it represents either zero or one instances of the preceding character.
Shell globbing
If you use the UNIX command-line (or shell), it is important to note that the shell itself is not capable of interpreting regular expressions. Instead, separate shell programs, for example, grep
, sed
, or awk
, are used to parse regex.
This is an important distinction to make, because the shell has a feature called filename expansion or shell globbing that in some ways can look very similar. For example, the wildcard for matching any number of characters or all files in the directory in a shell context is *
:
ls * # list all files in directory and subdirectories
ls *.txt # list all files in the directory with the extension .txt
ls test.* # list any file named "test", e.g.: test.txt, test.doc, test.png, etc
ls t*.* # list all files beginning with "t", with any extension
This is different from the way that the *
character is used in regex, where it means zero or more instances of the previous character.
Another similar-but-different feature of shell globbing and regex is the ability to define a range or class of characters using square brackets. For example, [a-z]
means all lowercase letters from a
to z
, and [0-9]
means all numbers from 0 to 9. So if you have a group of files in a directory with filenames test1.txt
, test2.txt
, test3.txt
, test4.txt
and so on, you could list them all with the following command:
ls test[0-9].txt
Square brackets operate in a very similar way in regular expressions, however combining them with other shell globbing characters can lead to unexpected results. For example, the following command lists all files with the extension .md
in a directory whose filenames begin with any letter between a
and l
:
ls -l [a-l]*.md
This is because the shell expands the expression [a-l]
as any character between a
and l
, and then expands *
to any sequence of characters following that initial letter, followed by the literal string .md
(the file extension). If [a-l]*.md
were interpreted as a regular expression, however, it would match zero or more instances (*
) of a character between a
and l
([a-l]
), followed by any single character (.
) and finally the literal string md
.
In order to make sure that regexes you are providing to a program like grep
on the command-line are not accidentally expanded by the shell, it is always a good idea to escape any regex you are using by surrounding it in quotation marks.