< BACKCONTINUE >

5.3 Finding Motifs

One of the most common things we do in bioinformatics is to look for motifs, short segments of DNA or protein that are of particular interest. They may be regulatory elements of DNA or short stretches of protein that are known to be conserved across many species. (The PROSITE web site at http://www.expasy.ch/prosite/ has extensive information about protein motifs.)

The motifs you look for in biological sequences are usually not one specific sequence. They may have several variants—for example, positions in which it doesn't matter which base or residue is present. They may have variant lengths as well. They can often be represented as regular expressions, which you'll see more of in the discussion following Example 5-3, in Chapter 9, and elsewhere in the book.

Perl has a handy set of features for finding things in strings. This, as much as anything, has made it a popular language for bioinformatics. Example 5-3 introduces this string-searching capability; it does something genuinely useful, and similar programs are used all the time in biology research. It does the following:

  • Reads in protein sequence data from a file

  • Puts all the sequence data into one string for easy searching

  • Looks for motifs the user types in at the keyboard

Example 5-3. Searching for motifs
#!/usr/bin/perl -w
# Searching for motifs

# Ask the user for the filename of the file containing
# the protein sequence data, and collect it from the keyboard
print "Please type the filename of the protein sequence data: ";

$proteinfilename = <STDIN>;

# Remove the newline from the protein filename
chomp $proteinfilename;

# open the file, or exit
unless ( open(PROTEINFILE, $proteinfilename) ) {

    print "Cannot open file \"$proteinfilename\"\n\n";
    exit;
}

# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = <PROTEINFILE>;

# Close the file - we've read all the data into @protein now.
close PROTEINFILE;

# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
$protein = join( '', @protein);

# Remove whitespace
$protein =~ s/\s//g;

# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
# Exit if no motif is entered.
do {
    print "Enter a motif to search for: ";

    $motif = <STDIN>;

    # Remove the newline at the end of $motif

    chomp $motif;

    # Look for the motif

    if ( $protein =~ /$motif/ ) {

        print "I found it!\n\n";

    } else {

        print "I couldn\'t find it.\n\n";
    }

# exit on an empty user input
} until ( $motif =~ /^\s*$/ );

# exit the program
exit;

Here's some typical output from Example 5-3:

Please type the filename of the protein sequence data: NM_021964fragment.pep
Enter a motif to search for: SVLQ
I found it!

Enter a motif to search for: jkl
I couldn't find it.

Enter a motif to search for: QDSV
I found it!

Enter a motif to search for: HERLPQGLQ
I found it!

Enter a motif to search for: 
I couldn't find it. 

As you see from the output, this program finds motifs that the user types in at the keyboard. With such a program, you no longer have to search manually through potentially huge amounts of data. The computer does the work and does it much faster and more accurately than a human.

It'd be nice if this program not only reported it found the motif but at what position. You'll see how this can be accomplished in Chapter 9. An exercise in that chapter challenges you to modify this program so that it reports the positions of the motifs.

The following sections examine and discuss the parts of Example 5-3 that are new:

  • Getting user input from the keyboard

  • Joining lines of a file into a single scalar variable

  • Regular expressions and character classes

  • do-until loops

  • Pattern matching

5.3.1 Getting User Input from the Keyboard

You first saw filehandles in Example 4-5. In Example 5-3 (as was true in Example 4-3), a filehandle and the angle bracket input operator are used to read in data from an opened file into an array, like so:

@protein = <PROTEINFILE>;

Perl uses the same syntax to get input that is typed by the user at the keyboard. In Example 5-3, a special filehandle called STDIN (short for standard input), is used for this purpose, as in this line that collects a filename from the user:

$proteinfilename = <STDIN>;

So, a filehandle can be associated with a file; it can also be associated with the keyboard where the user types responses to questions the program asks.

If the variable you're using to save the input is a scalar variable starts with a dollar sign $), as in this fragment, only one line is read, which is almost always what you want in this case.

In Example 5-3, the user is requested to enter the filename of a file containing protein sequence data. After getting a filename in this fashion, there's one more step before you can open the file. When the user types in a filename and sends a newline by hitting the Enter key (also known as the Return key), the filename also gets a newline character at the end as it is stored in the variable. This newline is not part of the filename and has to be removed before the open system call will work. The Perl function chomp removes newlines (or its cousins linefeeds and carriage returns) from the end of a string. (The older function chop removes the last character, no matter what it is; this caused trouble, so chomp was introduced and is almost always preferred.)

So this part of Perl requires a little bit extra: removing the newline from the input collected from the user at the keyboard. Try commenting out the chomp function, and you'll see that the open fails, because no filename has a newline at the end. (Operating systems have rules as to which characters are allowed in filenames.)

5.3.2 Turning Arrays into Scalars with join

It's common to find protein sequence data broken into short segments of 80 or so characters each. The reason is simple: when data is printed out on paper or displayed on the screen, it needs to be broken up into lines that fit into the space. Having your data broken into segments, however, is inconvenient for your Perl program. What if you're searching for a motif that's split by a newline character? Your program won't find it. In fact, some of the motifs searched for in Example 5-3 are split by line breaks. In Perl you deal with this sort of segmented data with the Perl function join. In Example 5-3 join collapses an array @protein by combining all the lines of data into a single string stored in a new scalar variable $protein:

$protein = join( '', @protein);

You specify a string to be placed between the elements of the array as they're joined. In this case, you specify the empty string to be placed between the lines of the input file. The empty string is represented with the pair of single quotes '' (double quotes "" also serve).

Recall that in Example 4-2, I introduced several equivalent ways to concatenate two fragments of DNA. The use of the join function is very similar. It takes the scalar values that are the elements of the array and concatenates them into a single scalar value. Recall the following statement from Example 4-2, which is one of the equivalent ways to concatenate two strings:

$DNA3 = $DNA1 . $DNA2;

Another way to accomplish the same concatenation uses the join function:

$DNA3 = join( "", ($DNA1, $DNA2) );

In this version, instead of giving an array name, I specify a list of scalar elements:

($DNA1, $DNA2)

5.3.3 do-until Loops

There's a new kind of loop in Example 5-3, the do-until loop, which first executes a block and then does a conditional test. Sometimes this is more convenient than the usual order in which you test first, then do the block if the test succeeds. Here, you want to prompt the user, get the user's input, search for the motif, and report the results. Before doing it again, you check the conditional test to see if the user has input an empty line. This means that the user has no more motifs to look for, so you exit the loop.

5.3.4 Regular Expressions

Regular expressions let you easily manipulate strings of all sorts, such as DNA and protein sequence data. What's great about regular expressions is that if there's something you want to do with a string, you usually can do it with Perl regular expressions.

Some regular expressions are very simple. For instance, you can just use the exact text of what you're searching for as a regular expression: if I was looking for the word "bioinformatics" in the text of this book, I could use the regular expression:

/bioinformatics/

Some regular expressions can be more complex, however. In this section, I'll explain their use in Example 5-3.

5.3.4.1 Regular expressions and character classes

Regular expressions are ways of matching one or more strings using special wildcard-like operators. Regular expressions can be as simple as a word, which matches the word itself, or they can be complex and made to match a large set of different words (or even every word!).

After you join the protein sequence data into the scalar variable $protein in Example 5-3, you also need to remove newlines and anything else that's not sequence data. This can include numbers on the lines, comments, informational or "header" lines, and so on. In this case, you want to remove newlines and any spaces or tabs that might be invisibly present. The following line of code in Example 5-3 removes this whitespace:

$protein =~ s/\s//g;

The sequence data in the scalar variable $protein is altered by this statement. You first saw the binding operator =~ and the substitute function s/// back in Example 4-3, where they were used to change one character into another. Here, they're used a little differently. You substitute any one of a set of whitespace characters, represented by \s with nothing and by the lack of anything between the second and third forward slashes. In other words, you delete any of a set of whitespace characters, which is done globally throughout the string by virtue of the g at the end of the statement.

The \s is one of several metasymbols. You've already seen the metasymbol \n. The \s metasymbol matches any space, tab, newline, carriage return, or formfeed. \s can also be written as:

[ \t\n\f\r]

This expression is an example of a character class and is enclosed in square brackets. A character class matches one character, any one of the characters named within the square brackets. A space is just typed as a space; other whitespace characters have their own metasymbols: \t for tab, \n for newline, \f for formfeed, and \r for carriage return. A carriage return causes the next character to be written at the beginning of the line, and a formfeed advances to the next line. The two of them together amount to the same thing as a newline character.

Each s/// command I've detailed has some kind of regular expression between the first two forward slashes /. You've seen single letters as the C in s/C/G/g in that position. The C is an example of a valid regular expression.

There's another use of regular expressions in Example 5-3. The line of code:

if ( $motif =~ /^\s*$/ ) {

is, in English, testing for a blank line in the variable $motif. If the user input is nothing except for perhaps some whitespace, represented as \s*, the match succeeds, and the program exits. The whole regular expression is:

/^\s*$/ 

which translates as: match a string that, from the beginning (indicated by the ^), is zero or more (indicated by the *) whitespace characters (indicated by the \s) until the end of the string (indicated by the $).

If this seems somewhat cryptic, just hang in there and you'll soon get familiar with the terminology. Regular expressions are a great way to manipulate sequence and other text-based data, and Perl is particularly good at making regular expressions relatively easy to use, powerful, and flexible. Many of the references in Appendix A contain material on regular expressions, and there's a concise summary in Appendix B.

5.3.4.2 Pattern matching with =~ and regular expressions

The actual search for the motif happens in this line from Example 5-3:

if ( $protein =~ /$motif/ ) {

Here, the binding operator =~ searches for the regular expression stored as the value of the variable $motif in the protein $protein. Using this feature, you can interpolate the value of a variable into a string match. (Interpolation in Perl strings means inserting the value of a variable into a string, as you first saw in Example 4-2 when you were concatenating strings). The actual motif, that is, the value of the string variable $motif, is your regular expression. The simplest regular expressions are just strings of characters, such as the motif AQQK, for example.

You can use Example 5-3 to play with some more features of regular expressions. You can type in any regular expression to search for in the protein. Try starting up the program, referring to the documentation on regular expressions, and play! Here are some examples of typing in regular expressions:

  • Search for an A followed by a D or S, followed by a V:

    Enter a motif to search for: A[DS]V
    I couldn't find it.
  • Search for K, N, zero or more D's, and two or more E's (note that {2,} means "two or more"):

    Enter a motif to search for: KND*E{2,}
    I found it!
  • Search for two E's, followed by anything, followed by another two E's:

    Enter a motif to search for: EE.*EE
    I found it!

In that last search, notice that a period stands for any character except a newline, and ".*" stands for zero or more such characters. (If you want to actually match a period, you have to escape it with a backslash.)

< BACKCONTINUE >

Index terms contained in this section

" (quotes, double)
      empty strings, representing with
' (quotes, single)
      empty strings, representing with
(angle brackets)
      line input (angle) operator
* (asterisk)
      character wildcard
. (dot)
      character wildcard
= (equal sign)
      =~ (pattern binding) operator
[] (brackets)
      enclosing character class in
angle operator
arrays
      converting to scalar variables
binding operators
      (=~)
blank line (user input), testing for
carriage returns
      matching with metasymbols
      removing with chomp function
character classes
      regular expressions and
chomp function, removing newlines with
chop function, chomp vs.
concatenating DNA fragments
      join function, using
conditional statements
      testing loops with
DNA
      concatenating fragments with join function
do-until loops
empty string
filehandles
filenames, removing newlines from
formfeeds, matching with metasymbols
input
      user, getting from keyboard
interpolating variables
      into string matches
join function, converting arrays to scalars
line input (angle) operator
linefeeds, removing with chomp function
loops
      do-until
matching patterns
metasymbols
motifs
      finding in strings
      newlines, eliminating with join
      protein (PROSITE web site)
names
      filenames, operating system
newlines
      matching with metasymbols
      motifs split by
      removing from filenames
open system call
      newlines in filenames and
operating systems
      filenames, characters not allowed in
operators
      binding
      input
      s/// (substitution) 2nd
patterns (and regular expressions)
      character classes and
      example of regular expressions
     matching
            =~ (binding) operator
period , under Symbols)
proteins
      motifs, finding
            PROSITE web site
removing
      carriage returns, linefeeds, and newlines
      newlines and other nonsequence data
      newlines from filenames
s/// (substitution) operator
      regular expressions, use with
scalar variables
      arrays, converting to
      user input, saving as
sequences
      removing nonsequence data from
strings
      empty
      finding motifs in
      newlines, removing from ends
substitution (s///) operator
tabs, matching with metasymbols
user input, getting from keyboard
      blank line, testing for 2nd
variables
     interpolating
            into string matches
web sites
      PROSITE, protein motif information
whitespace
      metasymbols, matching with
wildcards
      in regular expressions

© 2002, O'Reilly & Associates, Inc.