5.3
Finding Motifs
One of the most common things we do in
bioinformatics is to look for motifs,
short segments of DNA or protein that are of particular interest.
They may be regulatory elements of DNA or short stretches of protein
that are known to be conserved across many species. (The PROSITE web
site at http://www.expasy.ch/prosite/ has extensive
information about
protein motifs.)
The motifs you look for in biological sequences are usually not one
specific sequence. They may have several variants—for example,
positions in which it doesn't matter which base or residue is
present. They may have variant lengths as well. They can often be
represented as regular expressions, which you'll see more of in
the discussion following Example 5-3, in Chapter 9, and elsewhere in the book.
Perl has a handy set of features for finding things in strings. This, as much as anything,
has made it a popular language for bioinformatics. Example 5-3 introduces this string-searching capability;
it does something genuinely useful, and similar programs are used all
the time in biology research. It does the following:
-
Reads in protein sequence data from a file
-
Puts all the sequence data into one string for easy searching
-
Looks for motifs the user types in at the keyboard
Example 5-3. Searching for motifs
#!/usr/bin/perl -w
# Searching for motifs
# Ask the user for the filename of the file containing
# the protein sequence data, and collect it from the keyboard
print "Please type the filename of the protein sequence data: ";
$proteinfilename = <STDIN>;
# Remove the newline from the protein filename
chomp $proteinfilename;
# open the file, or exit
unless ( open(PROTEINFILE, $proteinfilename) ) {
print "Cannot open file \"$proteinfilename\"\n\n";
exit;
}
# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = <PROTEINFILE>;
# Close the file - we've read all the data into @protein now.
close PROTEINFILE;
# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
$protein = join( '', @protein);
# Remove whitespace
$protein =~ s/\s//g;
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
# Exit if no motif is entered.
do {
print "Enter a motif to search for: ";
$motif = <STDIN>;
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $protein =~ /$motif/ ) {
print "I found it!\n\n";
} else {
print "I couldn\'t find it.\n\n";
}
# exit on an empty user input
} until ( $motif =~ /^\s*$/ );
# exit the program
exit;
Here's some typical output from Example 5-3:
Please type the filename of the protein sequence data: NM_021964fragment.pep
Enter a motif to search for: SVLQ
I found it!
Enter a motif to search for: jkl
I couldn't find it.
Enter a motif to search for: QDSV
I found it!
Enter a motif to search for: HERLPQGLQ
I found it!
Enter a motif to search for:
I couldn't find it.
As you see from the output, this program finds motifs that the user
types in at the keyboard. With such a program, you no longer have to
search manually through potentially huge amounts of data. The
computer does the work and does it much faster and more accurately
than a human.
It'd be nice if this program not only reported it found the
motif but at what position. You'll see how this can be
accomplished in Chapter 9. An exercise in that
chapter challenges you to modify this program so that it reports the
positions of the motifs.
The following sections examine and discuss the parts of Example 5-3 that are new:
-
Getting user input from the keyboard
-
Joining lines of a file into a single scalar variable
-
Regular expressions and character classes
-
do-until loops
-
Pattern matching
5.3.1
Getting User Input from the Keyboard
You first
saw filehandles in Example 4-5. In Example 5-3
(as was true in Example 4-3), a filehandle and the
angle bracket
input operator are used to read in data from an opened file into an
array, like so:
@protein = <PROTEINFILE>;
Perl uses the same syntax to get input that is typed by the user at
the keyboard. In Example 5-3, a special filehandle
called STDIN (short for standard input), is used for this purpose, as
in this line that collects a filename from the user:
$proteinfilename = <STDIN>;
So, a filehandle can be associated with a file; it can also be
associated with the keyboard where the user types responses to
questions the program asks.
If the variable you're using to save the input is a
scalar variable starts with a dollar
sign $), as in this fragment, only one line is
read, which is almost always what you want in this case.
In Example 5-3, the user is requested to enter the
filename of a file containing protein sequence data. After getting a
filename in this fashion, there's one more step before you can
open the file. When the user types in a filename and sends a newline
by hitting the Enter key (also known as the Return key), the filename
also gets a newline character at the end as it is stored in the
variable. This newline is not part of the filename and has to be
removed before the open
system call will work. The Perl
function chomp
removes newlines (or
its cousins linefeeds and carriage returns) from the end of a
string. (The older function
chop removes the last character, no matter what
it is; this caused trouble, so chomp was
introduced and is almost always preferred.)
So this part of Perl requires a little bit extra: removing the
newline from the input collected from the user at the keyboard. Try
commenting out the chomp function, and
you'll see that the open fails, because no
filename has a newline at the end. (Operating systems have rules as
to which characters are allowed in filenames.)
5.3.2
Turning Arrays into Scalars with join
It's common to find protein
sequence data broken into
short segments of 80 or so characters each. The reason is simple:
when data is printed out on paper or displayed on the screen, it
needs to be broken up into lines that fit into the space. Having your
data broken into segments, however, is inconvenient for your Perl
program. What if you're searching for a motif that's
split by a
newline character? Your program
won't find it. In fact, some of the motifs searched for in
Example 5-3 are split by line breaks. In Perl you
deal with this sort of segmented data with the Perl function
join. In Example 5-3
join collapses an array
@protein by combining all the lines of data into a
single string stored in a new scalar variable
$protein:
$protein = join( '', @protein);
You specify a string to be placed between the elements of the array
as they're joined. In this case, you specify the
empty string to be
placed between the lines of the input file. The empty string is
represented with the pair of single quotes ''
(double quotes "" also serve).
Recall that in Example 4-2, I introduced several
equivalent ways to concatenate two fragments of DNA. The use of the
join function is very similar. It takes the
scalar values that are the elements of the array and concatenates
them into a single scalar value. Recall the following statement from
Example 4-2, which is one of the equivalent ways to
concatenate two strings:
$DNA3 = $DNA1 . $DNA2;
Another way to accomplish the same concatenation uses the
join function:
$DNA3 = join( "", ($DNA1, $DNA2) );
In this version, instead of giving an array name, I specify a list of
scalar elements:
($DNA1, $DNA2)
5.3.3
do-until Loops
There's a new kind of loop in Example 5-3, the
do-until
loop, which first executes a block and then does a
conditional test. Sometimes this is more
convenient than the usual order in which you test first, then do the
block if the test succeeds. Here, you want to prompt the
user, get the user's input, search for the motif, and report
the results. Before doing it again, you check the conditional test to
see if the user has input an empty line. This means that the user has
no more motifs to look for, so you exit the loop.
5.3.4
Regular Expressions
Regular expressions let you easily manipulate
strings of all
sorts, such as DNA and protein sequence data. What's great
about regular expressions is that if there's something you want
to do with a string, you usually can do it with Perl regular
expressions.
Some regular expressions are very simple. For instance, you can just
use the exact text of what you're searching for as a regular
expression: if I was looking for the word
"bioinformatics" in the text of this book, I could use
the regular expression:
/bioinformatics/
Some regular expressions can be more complex, however. In this
section, I'll explain their use in Example 5-3.
5.3.4.1
Regular expressions and character classes
Regular expressions are ways of matching one or more strings using
special wildcard-like operators. Regular
expressions can be as simple as a word, which matches the word
itself, or they can be complex and made to match a large set of
different words (or even every word!).
After you join the protein sequence data into the scalar variable
$protein in Example 5-3, you also
need to
remove newlines and
anything else that's not sequence data. This can include
numbers on the lines, comments, informational or "header"
lines, and so on. In this case, you want to remove newlines and any
spaces or tabs that might be invisibly present. The following line of
code in Example 5-3 removes this whitespace:
$protein =~ s/\s//g;
The sequence data in the scalar variable $protein
is altered by this statement. You first saw the binding operator
=~ and the
substitute function
s/// back in Example 4-3, where
they were used to change one character into another. Here,
they're used a little differently. You substitute any one of a
set of whitespace characters, represented by \s
with nothing and by the lack of anything between the second and third
forward slashes. In other words, you delete any of a set of
whitespace characters, which is done globally throughout the string
by virtue of the g at the end of the statement.
The \s is one of several
metasymbols.
You've already seen the metasymbol \n. The
\s metasymbol matches any space, tab, newline,
carriage return, or formfeed. \s can also be
written as:
[ \t\n\f\r]
This expression is an example of a character class and is enclosed in
square brackets. A
character class matches one character, any one of the characters
named within the square brackets. A space is just typed as a space;
other whitespace characters have their own metasymbols:
\t for tab, \n for newline,
\f for formfeed, and \r for
carriage return. A carriage return causes the next character to be
written at the beginning of the line, and a formfeed advances to the
next line. The two of them together amount to the same thing as a
newline character.
Each s///
command I've detailed
has some kind of regular expression between the first two forward
slashes /. You've seen single letters as the
C in s/C/G/g in that
position. The C is an example of a valid regular
expression.
There's another use of regular expressions in Example 5-3. The line of code:
if ( $motif =~ /^\s*$/ ) {
is, in English, testing for a blank line in the variable
$motif. If the user input is nothing except for
perhaps some whitespace, represented as \s*, the
match succeeds, and the program exits. The whole regular expression
is:
/^\s*$/
which translates as: match a string that, from the beginning
(indicated by the ^), is zero or more (indicated
by the *) whitespace characters (indicated by the
\s) until the end of the string (indicated by the
$).
If this seems somewhat cryptic, just hang in there and you'll
soon get familiar with the terminology. Regular expressions are a
great way to manipulate sequence and other text-based data, and Perl
is particularly good at making regular expressions relatively easy to
use, powerful, and flexible. Many of the references in Appendix A contain material on regular expressions, and
there's a concise summary in Appendix B.
5.3.4.2
Pattern matching with =~ and regular expressions
The actual search for the motif happens in this line from Example 5-3:
if ( $protein =~ /$motif/ ) {
Here, the
binding
operator =~ searches for the regular expression
stored as the value of the variable $motif in the
protein $protein. Using this feature, you can
interpolate
the value of a variable into a string match. (Interpolation in Perl
strings means inserting the value of a variable into a string, as you
first saw in Example 4-2 when you were concatenating
strings). The actual motif, that is, the value of the string variable
$motif, is your regular expression. The simplest
regular expressions are just strings of characters, such as the motif
AQQK, for example.
You can use Example 5-3 to play with some more
features of regular expressions. You can type in any regular
expression to search for in the protein. Try starting up the program,
referring to the documentation on regular expressions, and play! Here
are some examples of typing in regular expressions:
-
Search for an A followed by a D or S, followed by a V:
Enter a motif to search for: A[DS]V
I couldn't find it.
-
Search for K, N, zero or more D's, and two or more E's
(note that {2,} means "two or more"):
Enter a motif to search for: KND*E{2,}
I found it!
-
Search for two E's, followed by anything, followed by another
two E's:
Enter a motif to search for: EE.*EE
I found it!
In that last search, notice that a
period stands for any
character except a newline, and ".*" stands for zero or
more such characters. (If you want to actually match a period, you
have to escape it with a backslash.)