< BACKCONTINUE >

9.1 Regular Expressions

We've been dealing with regular expressions for a while now. This section fills in some background an.d ties together the somewhat scattered discussions of regular expressions from earlier parts of the book.

Regular expressions are interesting, important, and rich in capabilities. Jeffrey Friedl's book Mastering Regular Expressions (O'Reilly) is entirely devoted to them. Perl makes particularly good use of regular expressions, and the Perl documentation explains them well. Regular expressions are useful when programming with biological data such as sequence, or with GenBank, PDB, and BLAST files.

Regular expressions are ways of representing—and searching for—many strings with one string. Although they are not strictly the same thing, it's useful to think of regular expressions as a kind of highly developed set of wildcards. The special characters in regular expressions are more properly known as metacharacters.

Most people are familiar with wildcards, which are found in search engines or in the game of poker. You might find the reference to every word that starts with biolog by typing biolog*, for instance. Or you may find yourself holding five aces. (Different situations may use different wildcards. Perl regular expressions use * to mean "0 or more of the preceding item," not "followed by anything" as in the wildcard example just given.)

In computer science, these kinds of wildcards or metacharacters have an important history, both practically and theoretically. The asterisk character in particular is called the Kleene closure after the eminent logician who invented it. As a nod to the theory, I'll mention there is a simple model of a computer, less powerful than a Turing machine, that can deal with exactly the same kinds of languages that can be described by regular expressions. This machine model is called a finite state automaton. But enough theory for now.

We've already seen many examples that use regular expressions to find things in a DNA or protein sequence. Here I'll talk briefly about the fundamental ideas behind regular expressions as an introduction to some terminology. There is a useful summary of regular-expression features in Appendix B. Finally, we'll see how to learn more about them in the Perl documentation.

So let's start with a practical example that should be familiar by now to those who have been reading this text sequentially: using character classes to search DNA. Let's say there is a small motif you'd like to find in your library of DNA that is six basepairs long: CT followed by C or G or T followed by ACG. The third nucleotide in this motif is never A, but it can be C, G, or T. You can make a regular expression by letting the character class [CGT] stand for the variable position. The motif can then be represented by a regular expression that looks like this: CT[CGT]ACG. This is a motif that is six base pairs long with a C,G, or T in the third position. If your DNA was in a scalar variable $dna, you can test for the presence of the motif by using the regular expression as a conditional test in a pattern-matching statement, like so:

if( $dna =~ /CT[CGT]ACG/ ) {
    print "I found the motif!!\n";
}

Regular expressions are based on three fundamental ideas:

Repetition (or closure)

The asterisk (*), also called Kleene closure or star, indicates 0 or more repetitions of the character just before it. For example, abc* matches any of these strings: ab, abc, abcc, abccc, abcccc, and so on. The regular expression matches an infinite number of strings.

Alternation

In Perl, the pattern (a|b) (read: a or b) matches the string a or the string b.

Concatenation

This is a real obvious one. In Perl, the string ab means the character a followed by (concatenated with) the character b.

The use of parentheses for grouping is important: they are also metacharacters. So, for instance, the string (abc|def)z*x matches such strings as abcx, abczx, abczzx, defx, defzx, defzzzzzx, and so on. In English, it matches either abc or def followed by zero or more z's, and ending with an x. This example combines the ideas of grouping, alternation, closure, and concatenation. The real power of regular expressions is seen in this combining of the three fundamental ideas.

Perl has many regular-expression features. They are basically shortcuts for the three fundamental ideas we've just seen—repetition, alternation, and concatenation. For instance, the character class shown earlier can be written using alternation as (C|G|T). Another common feature is the period, which can stand for any character, except a newline. So ACG.*GCA stands for any DNA that starts with ACG and ends with GCA. In English, this reads as: ACG followed by 0 or more characters followed by GCA.

In Perl, regular expressions are usually enclosed within forward slashes and are used as pattern-matching specifiers. Check the documentation (or Appendix B), for m//, which includes some options that affect the behavior of the regular expressions. Regular expressions are also used in many of Perl's built-in commands, as you will see.

The Perl documentation is essential: start with the perlre section of the Perl manual at http://www.perldoc.com/perl5.6/pod/perlre.html#top.

< BACKCONTINUE >

Index terms contained in this section

() (parentheses)
      grouping in regular expressions
* (asterisk)
      in regular expressions 2nd
/ (forward slash)
     // (double slash)
            enclosing regular expressions
alternation
character classes
      searching DNA with
concatenating strings
      in regular expressions
documentation
     Perl
            regular expressions
grouping in regular expressions
Kleene closure (*)
metacharacters
motifs
      searching for with character classes
patterns (and regular expressions)
      alternation in
      closure in
      concatenation in
     matching
            // specifiers
            regular expression, using as conditional test
      metacharacters
strings
     concatenating
            in regular expressions
wildcards
      regular expressions, comparison to

© 2002, O'Reilly & Associates, Inc.