Safari | Beginning Perl for Bioinformatics -> B.15 Regular Expressions

Beginning Perl for Bioinformatics
	Copyright
	Table of Contents
	Preface
	1. Biology and Computer Science
	2. Getting Started with Perl
	3. The Art of Programming
	4. Sequences and Strings
	5. Motifs and Loops
	6. Subroutines and Bugs
	7. Mutations and Randomization
	8. The Genetic Code
	9. Restriction Maps and Regular Expressions
	10. GenBank
	11. Protein Data Bank
	12. BLAST
	13. Further Topics
	A. Resources
	B. Perl Summary
		B.1 Command Interpretation
		B.2 Comments
		B.3 Scalar Values and Scalar Variables
		B.4 Assignment
		B.5 Statements and Blocks
		B.6 Arrays
		B.7 Hashes
		B.8 Operators
		B.9 Operator Precedence
		B.10 Basic Operators
		B.11 Conditionals and Logical Operators
		B.12 Binding Operators
		B.13 Loops
		B.14 Input/Output
		B.15 Regular Expressions
		B.16 Scalar and List Context
		B.17 Subroutines and Modules
		B.18 Built-in Functions
	Colophon
	Index

Beginning Perl for Bioinformatics > B. Perl Summary > B.15 Regular Expressions

< BACK

CONTINUE >

B.15 Regular Expressions

Regular expressions are, in effect, an extra language that lives inside the Perl language. In Perl, they have quite a lot of features. First, I'll summarize how regular expressions work in Perl; then, I'll present some of their many features.

B.15.1 Overview

Regular expressions describe patterns in strings. The pattern described by a single regular expression may match many different strings.

Regular expressions are used in pattern matching, that is, when you look to see if a certain pattern exists in a string. They can also change strings, as with the s/// operator that substitutes the pattern, if found, for a replacement. Additionally, they are used in the tr function that can transliterate several characters into replacement characters throughout a string. Regular expressions are case-sensitive, unless explicitly told otherwise.

The simplest pattern match is a string that matches itself. For instance, to see if the pattern 'abc' appears in the string 'abcdefghijklmnopqrstuvwxyz', write the following in Perl:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abc/ ) {
	print $&;
}

The =~ operator binds a pattern match to a string. /abc/ is the pattern abc, enclosed in forward slashes // to indicate that it's a regular-expression pattern. $& is set to the matched pattern, if any. In this case, the match succeeds, since 'abc' appears in the string $alphabet, and the code just given prints out abc.

Regular expressions are made from two kinds of characters. Many characters, such as 'a' or 'Z', match themselves. Metacharacters have a special meaning in the regular-expression language. For instance, parentheses ( ) are used to group other characters and don't match themselves. If you want to match a metacharacter such as ( in a string, you have to precede it with the backslash metacharacter \( in the pattern.

There are three basic ideas behind regular expressions. The first is concatenation: two items next to each other in a regular-expression pattern (that's the string between the forward slashes // in the examples) must match two items next to each other in the string being matched (the $alphabet in the examples). So to match 'abc' followed by 'def', concatenate them in the regular expression:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abcdef/ ) {
        print $&; 
}

This prints:

abcdef

The second major idea is alternation. Items separated by the | metacharacter match any one of the items. For example:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /a(b|c|d)c/ ) {
        print $&;
}

prints as:

abc.

The example also shows how parentheses group things in a regular expression. The parentheses are metacharacters that aren't matched in the string; rather, they group the alternation, given as b|c|d, meaning any one of b, c, or d at that position in the pattern. Since b is actually in $alphabet at that position, the alternation, and indeed the entire pattern a(b|c|d)c, matches in the $alphabet. (One additional point: ab|cd means (ab)|(cd), not a(b|c)d.)

The third major idea of regular expressions is repetition (or closure). This is indicated in a pattern with the quantifier metacharacter *, sometimes called the Kleene star after one of the inventors of regular expressions. When * appears after an item, it means that the item may appear 0, 1, or any number of times at that place in the string. So, for example, all of the following pattern matches will succeed:

'AC' =~ /AB*C/;
'ABC' =~ /AB*C/;
'ABBBBBBBBBBBC' =~ /AB*C/;

B.15.2 Metacharacters

The following are metacharacters:

\ | ( ) [ { ^ $ * + ? .

B.15.2.1 Escaping with \

A backslash \ before a metacharacter causes it to match itself; for instance, \\ matches a single \ in the string.

B.15.2.2 Alternation with |

The pipe | indicates alternation, as described previously.

B.15.2.3 Grouping with ( )

The parentheses ( ) provide grouping, as described previously.

B.15.2.4 Character classes

Square brackets [ ] specify a character class. A character class matches one character, which can be any character specified. For instance, [abc] matches either a, or b, or c at that position (so it's the same as a|b|c). A -Z is a range that matches any uppercase letter, a-z matches any lowercase letter, and 0-9 matches any digit. For instance, [A-Za-z0-9] matches any single letter or digit at that position. If the first character in a character class is ^, any character except those specified match; for instance, [^0-9] matches any character that isn't a digit.

B.15.2.5 Matching any character with .

The period or dot . represents any character except a newline. (The pattern modifier /s makes it also match a newline.) So, . is like a character class that specifies every character.

B.15.2.6 Beginning and end of strings with ^ and $

The ^ metacharacter doesn't match a character; rather, it asserts that the item that follows must be at the beginning of the string. Similarly, the $ metacharacter doesn't match a character but asserts that the item that precedes it must be at the end of the string (or before the final newline). For example: /^Watson and Crick/ matches if the string starts with Watson and Crick; and /Watson and Crick$/ matches if the string ends with Watson and Crick or Watson and Crick\n.

B.15.2.7 Quantifiers: * + {MIN,} {MIN,MAX} ?

These metacharacters indicate the repetition of an item. The * metacharacter indicates zero, one, or more of the preceding item. The + metacharacter indicates one or more of the preceding item. The brace { } metacharacters let you specify exactly the number of previous items, or a range. For instance, {3} means exactly three of the preceding item; {3,7} means three, four, five, six, or seven of the preceding item; and {3,} means three or more of the preceding item. The ? matches none or one of the preceding item.

B.15.2.8 Making quantifiers match minimally with ?

The quantifiers just shown are greedy (or maximal) by default, meaning that they match as many items as possible. Sometimes, you want a minimal match that will match as few items as possible. You get that by following each of * + {} ? with a ?. So, for instance, *? tries to match as few as possible, perhaps even none, of the preceding item before it tries to match one or more of the preceding item. Here's a maximal match:

'hear ye hear ye hear ye' =~ /hear.*ye/;
print $&;

This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', and prints:

hear ye hear ye hear ye

Here is a minimal match:

'hear ye hear ye hear ye' =~ /hear.*?ye/;
print $&;

This matches 'hear' followed by .*? (the fewest number of characters possible), followed by 'ye', and prints:

hear ye

B.15.3 Capturing Matched Patterns

You can place parentheses around parts of the pattern for which you want to know the matched string. For example:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /k(lmnop)q/;
print $1;

prints:

lmnop

You can place as many pairs of parentheses in a regular expression as you like; Perl automatically stores their matched substrings in special variables named $1, $2, and so on. The matches are numbered in order of the left-to-right appearance of their opening parenthesis.

Here's a more intricate example of capturing parts of a matched pattern in a string:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /(((a)b)c)/;
print "First pattern = ", $1,"\n";
print "Second pattern = ", $2,"\n";
print "Third pattern = ", $3,"\n";

This prints:

First pattern = abc
Second pattern = ab
Third pattern = a

B.15.4 Metasymbols

Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. These metasymbols have special meanings in Perl regular expressions (and in double-quoted strings for most of them). There are quite a few of them, but that's because they're so useful. Table B-3 lists most of these metasymbols. The column "Atomic" indicates Yes if the metasymbol matches an item, No if the metasymbol just makes an assertion, and - if it takes some other action.

Table B-3. Alphanumeric metasymbols

Symbol

Atomic

Meaning

\0

Yes

Match the null character (ASCII NULL)

\NNN

Yes

Match the character given in octal, up to 377

\n

Yes

Match nth previously captured string (decimal)

\a

Yes

Match the alarm character (BEL)

\A

No

true at the beginning of a string

\b

Yes

Match the backspace character (BS)

\b

No

True at word boundary

\B

No

True when not at word boundary

\cX

Yes

Match the control character Control-X

\d

Yes

Match any digit character

\D

Yes

Match any nondigit character

\e

Yes

Match the escape character (ASCII ESC, not backslash)

\E

-

End case (\L, \U) or metaquote (\Q) translation

\f

Yes

Match the formfeed character (FF)

\G

No

true at end-of-match position of prior m//g

\l

-

Lowercase the next character only

\L

-

Lowercase till \E

\n

Yes

Match the newline character (usually NL, but CR on Macs)

\Q

-

Quote (do-meta) metacharacters till \E

\r

Yes

Match the return character (usually CR, but NL on Macs)

\s

Yes

Match any whitespace character

\S

Yes

Match any nonwhitespace character

\t

Yes

Match the tab character (HT)

\u

-

Titlecase the next character only

\U

-

Uppercase (not titlecase) till \E

\w

Yes

Match any "word" character (alphanumerics plus _ )

\W

Yes

Match any nonword character

\x{abcd}

Yes

Match the character given in hexadecimal

\z

No

true at end of string only

\Z

No

true at end of string or before optional newline

B.15.5 Extending Regular-Expression Sequences

Table B-4 includes several useful features that have been added to Perl's regular-expression capabilities.

Table B-4. Extended regular-expression sequences

Extension

Atomic

Meaning

(?#...)

No

Comment, discard

(?:...)

Yes

Cluster-only parentheses, no capturing

(?imsx-imsx)

No

Enable/disable pattern modifiers

(?imsx-imsx:...)

Yes

Cluster-only parentheses plus modifiers

(?=...)

No

True if lookahead assertion succeeds

(?!...)

No

True if lookahead assertion fails

(?<=...)

No

True if lookbehind assertion succeeds

(?<!...)

No

True if lookbehind assertion fails

(?>...)

Yes

Match nonbacktracking subpattern

(?{...})

No

Execute embedded Perl code

(??{...})

Yes

Match regex from embedded Perl code

(?(...)...|...)

Yes

Match with if-then-else pattern

(?(...)...)

Yes

Match with if-then pattern

B.15.6 Pattern Modifiers

Pattern modifiers are single-letter commands placed after the forward slashes. They are used to delimit a regular expression or a substitution and change the behavior of some regular-expression features. Table B-5 lists the most common pattern modifiers, followed by an example.

Table B-5. Pattern modifiers

Modifier

Meaning

/i

Ignore upper- or lowercase distinctions

/s

Let . match newline

/m

Let ^ and $ match next to embedded \n

/x

Ignore (most) whitespace and permit comments in patterns

/o

Compile pattern once only

/g

Find all matches, not just the first one

As an example, say you were looking for a name in text, but you didn't know if the name had an initial capital letter or was all capitalized. You can use the /i modifier, like so:

$text = "WATSON and CRICK won the Nobel Prize";
$text =~ /Watson/i;
print $&;

This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints out the matched string WATSON.

< BACK

CONTINUE >

Index terms contained in this section

$ (dollar sign)
      metacharacter
() (parentheses)
      for capturing in patterns
* (asterisk)
      quantifier
+ (plus sign)
      quantifier
. (dot)
      character wildcard
/i (case-insensitive) matching
? (question mark), in quantifiers
\\\\ (backslash)
      escaping metacharacters
      metasymbols, use in
^ (caret)
      metacharacter in regular expressions
{} (curly braces)
      quantifier
| (vertical bar)
      alternation
alternation
capturing in patterns
case-insensitive matching
character classes
greedy matching
maximal (greedy) matching
metacharacters
metasymbols
minimal matching
patterns (and regular expressions)
      metacharacters
      metasymbols
      modifiers
Perl
      regular expressions
quantifiers
      maximal and minimal
strings
      capturing matched patterns in