9.1
Regular Expressions
We've been dealing with regular expressions for a while now.
This section fills in some background an.d ties together the somewhat
scattered discussions of regular expressions from earlier parts of
the book.
Regular expressions are interesting, important, and rich in
capabilities. Jeffrey Friedl's book Mastering Regular
Expressions (O'Reilly) is entirely devoted to them.
Perl makes particularly good use of regular expressions, and the Perl
documentation explains them well. Regular expressions are useful when
programming with biological data such as sequence, or with GenBank,
PDB, and BLAST files.
Regular expressions are ways of representing—and searching
for—many strings with one string. Although they are not
strictly the same thing, it's useful to think of regular
expressions as a kind of highly developed set of
wildcards. The
special characters in regular expressions are more properly known as
metacharacters.
Most people are familiar with wildcards, which are found in search
engines or in the game of poker. You might find the reference to
every word that starts with biolog by typing
biolog*, for instance. Or you may find yourself
holding five aces. (Different situations may use different wildcards.
Perl regular expressions use * to mean "0 or
more of the preceding item," not "followed by
anything" as in the wildcard example just given.)
In computer science, these kinds of wildcards or metacharacters have
an important history, both practically and theoretically. The
asterisk character in particular is called the
Kleene
closure after the eminent logician who invented it. As a nod to the
theory, I'll mention there is a simple model of a computer,
less powerful than a Turing machine, that can deal with exactly the
same kinds of languages that can be described by regular expressions.
This machine model is called a finite state
automaton. But enough theory for now.
We've already seen many examples that use regular expressions
to find things in a DNA or protein sequence. Here I'll talk
briefly about the fundamental ideas behind regular expressions as an
introduction to some terminology. There is a useful summary of
regular-expression features in Appendix B. Finally,
we'll see how to learn more about them in the Perl
documentation.
So let's start with a practical example that should be familiar
by now to those who have been reading this text sequentially: using
character classes to search
DNA. Let's say there is a small motif you'd like to find
in your library of DNA that is six basepairs long: CT followed by C
or G or T followed by ACG. The third nucleotide in this motif is
never A, but it can be C, G, or T. You can make a regular expression
by letting the character class [CGT] stand for the variable position.
The motif can then be represented by a regular expression that looks
like this: CT[CGT]ACG. This is a motif that is six base pairs long
with a C,G, or T in the third position. If your DNA was in a scalar
variable $dna, you can test for the presence of
the motif by using the regular expression as a conditional test in a
pattern-matching statement, like so:
if( $dna =~ /CT[CGT]ACG/ ) {
print "I found the motif!!\n";
}
Regular
expressions are based on three fundamental ideas:
-
Repetition (or closure)
-
The asterisk (*), also called Kleene closure or star, indicates 0 or
more repetitions of the character just before it. For example,
abc* matches any of these strings:
ab, abc,
abcc, abccc,
abcccc, and so on. The regular expression matches
an infinite number of strings.
-
Alternation
-
In Perl, the pattern (a|b) (read: a or
b) matches the
string a or the string
b.
-
Concatenation
-
This is a
real
obvious one. In Perl, the string ab means the
character a followed by (concatenated with) the
character b.
The use of parentheses for
grouping is important: they are also
metacharacters. So, for instance, the string
(abc|def)z*x matches such strings as
abcx, abczx,
abczzx, defx,
defzx, defzzzzzx, and so on. In
English, it matches either abc or
def followed by zero or more
z's, and ending with an
x. This example combines the ideas of grouping,
alternation, closure, and concatenation. The real power of regular
expressions is seen in this combining of the three fundamental ideas.
Perl has many regular-expression features. They are basically
shortcuts for the three fundamental ideas we've just
seen—repetition, alternation, and concatenation. For instance,
the character class shown earlier can be written using alternation as
(C|G|T). Another common feature is the period,
which can stand for any character, except a newline. So
ACG.*GCA stands for any DNA that starts with
ACG and ends with GCA. In
English, this reads as: ACG followed by 0 or more
characters followed by GCA.
In Perl, regular expressions are usually enclosed within
forward slashes and are used as
pattern-matching specifiers. Check
the documentation (or Appendix B), for
m//, which includes some options that affect the
behavior of the regular expressions. Regular expressions are also
used in many of Perl's built-in commands, as you will see.
The Perl
documentation is essential: start
with the perlre section of the Perl manual at
http://www.perldoc.com/perl5.6/pod/perlre.html#top.