B.15
Regular Expressions
Regular expressions are, in effect,
an extra language that lives
inside the Perl language. In Perl, they have quite a lot of features.
First, I'll summarize how regular expressions work in Perl;
then, I'll present some of their many features.
B.15.1
Overview
Regular expressions describe patterns in strings. The pattern
described by a single regular expression may match many different
strings.
Regular expressions are used in pattern matching, that is, when you
look to see if a certain pattern exists in a string. They can also
change strings, as with the s/// operator that
substitutes the pattern, if found, for a replacement. Additionally,
they are used in the tr function that can
transliterate several characters into replacement characters
throughout a string. Regular expressions are case-sensitive, unless
explicitly told otherwise.
The simplest pattern match is a string that matches itself. For
instance, to see if the pattern 'abc' appears in
the string 'abcdefghijklmnopqrstuvwxyz', write the
following in Perl:
$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abc/ ) {
print $&;
}
The =~ operator binds a pattern match to a string.
/abc/ is the pattern abc,
enclosed in forward slashes // to indicate that
it's a regular-expression pattern. $& is
set to the matched pattern, if any. In this case, the match succeeds,
since 'abc' appears in the string
$alphabet, and the code just given prints out
abc.
Regular expressions are made from two kinds of characters. Many
characters, such as 'a' or 'Z',
match themselves. Metacharacters have a special meaning in the
regular-expression language. For instance, parentheses (
) are used to group other characters and don't match
themselves. If you want to match a metacharacter such as
( in a string, you have to precede it with the
backslash metacharacter \( in the pattern.
There are three basic ideas behind regular expressions. The first is
concatenation: two items next to each other in a regular-expression
pattern (that's the string between the forward slashes
// in the examples) must match two items next to
each other in the string being matched (the
$alphabet in the examples). So to match
'abc' followed by 'def',
concatenate them in the regular expression:
$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abcdef/ ) {
print $&;
}
This prints:
abcdef
The second major idea is alternation. Items separated by the
| metacharacter match any one of the items. For
example:
$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /a(b|c|d)c/ ) {
print $&;
}
prints as:
abc.
The example also shows how parentheses group things in a regular
expression. The parentheses are metacharacters that aren't
matched in the string; rather, they group the alternation, given as
b|c|d, meaning any one of b,
c, or d at that position in the
pattern. Since b is actually in
$alphabet at that position, the alternation, and
indeed the entire pattern a(b|c|d)c, matches in
the $alphabet. (One additional point:
ab|cd means (ab)|(cd), not
a(b|c)d.)
The third major idea of regular expressions is repetition (or
closure). This is indicated in a pattern with the quantifier
metacharacter *, sometimes called the Kleene star
after one of the inventors of regular expressions. When
* appears after an item, it means that the item
may appear 0, 1, or any number of times at that place in the string.
So, for example, all of the following pattern matches will succeed:
'AC' =~ /AB*C/;
'ABC' =~ /AB*C/;
'ABBBBBBBBBBBC' =~ /AB*C/;
B.15.2
Metacharacters
The following
are
metacharacters:
\ | ( ) [ { ^ $ * + ? .
B.15.2.1
Escaping with \
A backslash \ before a metacharacter causes
it to match itself; for instance, \\ matches a
single \ in the string.
B.15.2.2
Alternation with |
The
pipe
| indicates alternation, as described previously.
B.15.2.3
Grouping with ( )
The parentheses ( ) provide grouping, as described
previously.
B.15.2.4
Character classes
Square brackets [ ] specify a
character
class. A character class matches one character, which can be any
character specified. For instance, [abc] matches
either a, or b, or
c at that position (so it's the same as
a|b|c). A -Z is a range that
matches any uppercase letter, a-z matches any
lowercase letter, and 0-9 matches any digit. For
instance, [A-Za-z0-9] matches any single letter or
digit at that position. If the first character in a character class
is ^, any character except those specified match;
for instance, [^0-9] matches any character that
isn't a digit.
B.15.2.5
Matching any character with .
The period or dot . represents
any character except a newline. (The pattern modifier
/s makes it also match a newline.) So,
. is like a character class that specifies every
character.
B.15.2.6
Beginning and end of strings with ^ and $
The ^ metacharacter doesn't match
a character; rather, it asserts that the item that follows must be at
the beginning of the string. Similarly, the $
metacharacter doesn't match a character but asserts that the
item that precedes it must be at the end of the string (or before the
final newline). For example: /^Watson and Crick/
matches if the string starts with Watson and
Crick; and /Watson and Crick$/ matches
if the string ends with Watson and Crick or
Watson and Crick\n.
B.15.2.7
Quantifiers: * + {MIN,} {MIN,MAX} ?
These
metacharacters indicate the
repetition of an item. The * metacharacter
indicates zero, one, or more of the preceding item. The +
metacharacter indicates one or more of the preceding item. The brace
{
} metacharacters let you specify exactly the number of
previous items, or a range. For instance, {3}
means exactly three of the preceding item; {3,7}
means three, four, five, six, or seven of the preceding item; and
{3,} means three or more of the preceding item.
The ? matches none or one of the preceding item.
B.15.2.8
Making quantifiers match minimally with ?
The quantifiers just shown are greedy (or
maximal) by default, meaning that they match as many items as
possible. Sometimes, you want a minimal match that will match as few
items as possible. You get that by following each of
* + {}
? with a ?. So, for instance,
*? tries to match as few as possible, perhaps even
none, of the preceding item before it tries to match one or more of
the preceding item. Here's a maximal match:
'hear ye hear ye hear ye' =~ /hear.*ye/;
print $&;
This matches 'hear' followed by
.* (as many characters as possible), followed by
'ye', and prints:
hear ye hear ye hear ye
Here is a minimal match:
'hear ye hear ye hear ye' =~ /hear.*?ye/;
print $&;
This matches 'hear' followed by
.*? (the fewest number of characters possible),
followed by 'ye', and prints:
hear ye
B.15.3
Capturing Matched Patterns
You
can place parentheses around parts of
the pattern for which you want to know the matched string. For
example:
$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /k(lmnop)q/;
print $1;
prints:
lmnop
You can place as many pairs of parentheses in a regular expression as
you like; Perl automatically stores their matched substrings in
special variables named $1, $2,
and so on. The matches are numbered in order of the left-to-right
appearance of their opening parenthesis.
Here's a more intricate example of capturing parts of a matched
pattern in a
string:
$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /(((a)b)c)/;
print "First pattern = ", $1,"\n";
print "Second pattern = ", $2,"\n";
print "Third pattern = ", $3,"\n";
This prints:
First pattern = abc
Second pattern = ab
Third pattern = a
B.15.4
Metasymbols
Metasymbols are sequences of two or more
characters
consisting of backslashes before normal characters. These
metasymbols have special meanings in Perl regular expressions (and in
double-quoted strings for most of them). There are quite a few of
them, but that's because they're so useful. Table B-3 lists most of these metasymbols. The column
"Atomic" indicates Yes if the metasymbol matches an item,
No if the metasymbol just makes an assertion, and - if it takes some
other action.
Table B-3. Alphanumeric metasymbols
Symbol
|
Atomic
|
Meaning
|
\0
|
Yes
|
Match the null character (ASCII NULL)
|
\NNN
|
Yes
|
Match the character given in octal, up to 377
|
\n
|
Yes
|
Match nth previously captured string
(decimal)
|
\a
|
Yes
|
Match the alarm character (BEL)
|
\A
|
No
|
true at the beginning of a string
|
\b
|
Yes
|
Match the backspace character (BS)
|
\b
|
No
|
True at word boundary
|
\B
|
No
|
True when not at word boundary
|
\cX
|
Yes
|
Match the control character Control-X
|
\d
|
Yes
|
Match any digit character
|
\D
|
Yes
|
Match any nondigit character
|
\e
|
Yes
|
Match the escape character (ASCII ESC, not backslash)
|
\E
|
-
|
End case (\L, \U) or metaquote (\Q) translation
|
\f
|
Yes
|
Match the formfeed character (FF)
|
\G
|
No
|
true at end-of-match position of prior m//g
|
\l
|
-
|
Lowercase the next character only
|
\L
|
-
|
Lowercase till \E
|
\n
|
Yes
|
Match the newline character (usually NL, but CR on Macs)
|
\Q
|
-
|
Quote (do-meta) metacharacters till \E
|
\r
|
Yes
|
Match the return character (usually CR, but NL on Macs)
|
\s
|
Yes
|
Match any whitespace character
|
\S
|
Yes
|
Match any nonwhitespace character
|
\t
|
Yes
|
Match the tab character (HT)
|
\u
|
-
|
Titlecase the next character only
|
\U
|
-
|
Uppercase (not titlecase) till \E
|
\w
|
Yes
|
Match any "word" character (alphanumerics plus _ )
|
\W
|
Yes
|
Match any nonword character
|
\x{abcd}
|
Yes
|
Match the character given in hexadecimal
|
\z
|
No
|
true at end of string only
|
\Z
|
No
|
true at end of string or before optional newline
|
B.15.5
Extending Regular-Expression Sequences
Table B-4 includes several useful features that
have been added to Perl's regular-expression capabilities.
Table B-4. Extended regular-expression sequences
Extension
|
Atomic
|
Meaning
|
(?#...)
|
No
|
Comment, discard
|
(?:...)
|
Yes
|
Cluster-only parentheses, no capturing
|
(?imsx-imsx)
|
No
|
Enable/disable pattern modifiers
|
(?imsx-imsx:...)
|
Yes
|
Cluster-only parentheses plus modifiers
|
(?=...)
|
No
|
True if lookahead assertion succeeds
|
(?!...)
|
No
|
True if lookahead assertion fails
|
(?<=...)
|
No
|
True if lookbehind assertion succeeds
|
(?<!...)
|
No
|
True if lookbehind assertion fails
|
(?>...)
|
Yes
|
Match nonbacktracking subpattern
|
(?{...})
|
No
|
Execute embedded Perl code
|
(??{...})
|
Yes
|
Match regex from embedded Perl code
|
(?(...)...|...)
|
Yes
|
Match with if-then-else pattern
|
(?(...)...)
|
Yes
|
Match with if-then pattern
|
B.15.6
Pattern Modifiers
Pattern modifiers
are single-letter commands placed after the forward slashes. They are
used to delimit a regular expression or a substitution and change the
behavior of some regular-expression features. Table B-5 lists the most common pattern modifiers,
followed by an example.
Table B-5. Pattern modifiers
Modifier
|
Meaning
|
/i
|
Ignore upper- or lowercase distinctions
|
/s
|
Let . match newline
|
/m
|
Let ^ and $ match next to embedded \n
|
/x
|
Ignore (most) whitespace and permit comments in patterns
|
/o
|
Compile pattern once only
|
/g
|
Find all matches, not just the first one
|
As an example, say you were looking for a name in text, but you
didn't know if the name had an initial capital letter or was
all capitalized. You can use the
/i modifier, like
so:
$text = "WATSON and CRICK won the Nobel Prize";
$text =~ /Watson/i;
print $&;
This matches (since /i causes upper- and lowercase
distinctions to be ignored) and prints out the matched string
WATSON.