10.3
Separating Sequence and Annotation
In previous chapters you saw how to
examine the lines of a file using Perl's array operations.
Usually, you do this by saving the data in an array with each
appearing as an element of the array.
Let's look at two methods to extract the annotation and the DNA
from a GenBank file. In the first method, you'll slurp the file
into an array and look through the lines, as in previous programs. In
the second, you'll put the whole GenBank record into a scalar
variable and use regular expressions to parse the information. Is one
approach better than the other? Not necessarily: it depends on the
data. There are advantages and disadvantages to each, but both get
the job done.
I've put five GenBank records in a file called
library.gb. As before, you can download the file
from this book's web site. You'll use this datafile and
the file record.gb in the next few examples.
10.3.1
Using Arrays
Example 10-1 shows the first method, which operates
on an
array containing the lines of the
GenBank record. The main program is followed by a subroutine that
does the real work.
Example 10-1. Extract annotation and sequence from GenBank file
#!/usr/bin/perl
# Extract annotation and sequence from GenBank file
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# declare and initialize variables
my @annotation = ( );
my $sequence = '';
my $filename = 'record.gb';
parse1(\@annotation, \$sequence, $filename);
# Print the annotation, and then
# print the DNA in new format just to check if we got it okay.
print @annotation;
print_sequence($sequence, 50);
exit;
################################################################################
# Subroutine
################################################################################
# parse1
#
# --parse annotation and sequence from GenBank record
sub parse1 {
my($annotation, $dna, $filename) = @_;
# $annotation--reference to array
# $dna --reference to scalar
# $filename --scalar
# declare and initialize variables
my $in_sequence = 0;
my @GenBankFile = ( );
# Get the GenBank data into an array from a file
@GenBankFile = get_file_data($filename);
# Extract all the sequence lines
foreach my $line (@GenBankFile) {
if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n,
last; #break out of the foreach loop.
} elsif( $in_sequence) { # If we know we're in a sequence,
$$dna .= $line; # add the current line to $$dna.
} elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence,
$in_sequence = 1; # set the $in_sequence flag.
} else{ # Otherwise
push( @$annotation, $line); # add the current line to @annotation.
}
}
# remove whitespace and line numbers from DNA sequence
$$dna =~ s/[\s0-9]//g;
}
Here's the beginning and end of Example 10-1's
output
of the sequence data:
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
...
cggtgcccgtgtgtccgttcctccactcatctgtttctccggttctccct
gtgcccatccaccggttgaccgcccatctgcctttatcagagggactgtc
cccgtcgacatgttcagtgcctggtggggctgcggagtccactcatcctt
gcctcctctccctgggttttgttaataaaattttgaagaaaccaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
The foreach loop in subroutine
parse1 in Example 10-1 moves one
by one through the lines from the GenBank file stored in the array
@GenBankFile. It takes advantage of the structure
of a GenBank file, which begins with annotation and runs until the
line:
ORIGIN
is found, after which sequence appears until the end-of-record line
// is reached. The loop uses a flag variable
$in_sequence to remember that it has found the
ORIGIN line and is now reading sequence lines.
The foreach loop has a new feature: the Perl
built-in function last, which breaks out of the
nearest enclosing loop. It's triggered by the end-of-record
line //, which is reached when the entire record
has been seen.
A regular expression is used to find the end-of-record line. To
correctly match the end-of-record (forward) slashes, you must escape
them by putting a backslash in front of each one, so that Perl
doesn't interpret them as prematurely ending the pattern. The
regular expression also ends with a newline
\/\/\n, which is then placed inside the usual
delimiters: /\/\/\n/. (When you have a lot of
forward slashes in a regular expression, you can use another
delimiter around the regular expression and precede it with an
m, thus avoiding having to backslash the forward
slashes. It's done like so: m!//\n!).
An interesting point about subroutine parse1 is
the order of the tests in the foreach loop that
goes through the lines of the GenBank record. As you read through the
lines of the record, you want to first gather the annotation lines,
set a flag when the ORIGIN start-of-sequence line
is found, and then collect the lines until the end-of-record
// line is found.
Notice that the order of the tests is exactly the opposite. First,
you test for the end-of-record line, collect the sequence if the
$in_sequence flag is set, and then test for the
start-of-sequence ORIGIN line. Finally, you
collect the annotation.
The technique of reading lines one by one and using flag variables to
mark what section of the file you're in, is a common
programming technique. So, take a moment to think about how the loop
would behave if you changed the order of the tests. If you collected
sequence lines before testing for the end-of-record, you'd
never get to the end-of-record test!
Other methods of collecting annotation and sequence lines are
possible, especially if you go through the lines of the array more
than once. You can scan through the array, keeping track of the
start-of-sequence and end-of-record line numbers, and then go back
and extract the annotation and sequence using an array splice (which
was described in the parseREBASE subroutine in
Example 9-2). Here's an example:
# find line numbers of ORIGIN and // in the GenBank record
$linenumber = 0;
foreach my $line (@GenBankFile) {
if ( $line =~ /^//\n/ ) { # end-of-record // line
$end = $linenumber;
last;
} elsif ( $line =~ /^ORIGIN/ ) { # end annotation, begin sequence
$origin = $linenumber;
}
$linenumber++;
}
# extract annotation and sequence with "array splice"
@annotation = @GenBankFile[0..($origin-1)];
@sequence = @GenBankFile[($origin+1)..($end-1)];
</programlisting>
10.3.2
Using Scalars
A second
way to separate annotations from
sequences in GenBank records is to read the entire record into a
scalar variable and operate on it with regular expressions. For some
kinds of data, this can be a more convenient way to parse the input
(compared to scanning through an array, as in Example 10-1).
Usually string data is stored one line per scalar variable with its
newlines, if any, at the end of the string. Sometimes, however, you
store several lines concatenated together in one string that is, in
turn, stored in a single scalar variable. These multiline strings
aren't uncommon; you used them to gather the sequence from a
FASTA file in Examples Example 6-2 and Example 6-3. Regular expressions have pattern modifiers
that can be used to make multiline
strings with their embedded newlines easy to
use.
10.3.2.1
Pattern modifiers
The
pattern modifiers we've used so
far are /g, for global matching, and
/i, for case-insensitive matching. Let's
take a look at two more that affect the way regular expressions
interact with the newlines
in scalars.
Recall that previous regular expressions have used the
caret (^), dot
(.), and dollar sign ($)
metacharacters. The ^ anchors a regular expression
to the beginning of a string, by default, so that /^THE
BEGUINE/ matches a string that begins with "THE
BEGUINE". Similarly, $ anchors an expression
to the end of the string, and the dot (.) matches
any character except a newline.
The following pattern modifiers affect these three metacharacters:
-
The /s
modifier assumes you want to treat the whole string as a single line,
even with embedded newlines, so it makes the dot metacharacter match
any character including newlines.
-
The /m modifier
assumes you want to treat the whole string as a multiline, with
embedded newlines, so it extends the ^ and the
$ to match after, or before, a newline, embedded
in the string.
10.3.2.2
Examples of pattern modifiers
Here's an example of the default behavior of caret
(^), dot (.), and dollar sign
($):
use warnings;
"AAC\nGTT" =~ /^.*$/;
print $&, "\n";
This demonstrates the default behavior without the
/m or /s modifiers and prints
the warning:
Use of uninitialized value in print statement at line 3.
The print statement tries to print
$&
,
a special variable that is always set to the last successful pattern
match. This time, since the pattern doesn't match, the variable
$& isn't set, and you get a warning
message for attempting to print an uninitialized value.
Why doesn't the match succeed? First, let's examine the
^.*$ pattern. It begins with a
^, which means it must match from the beginning of
the string. It ends with a $, which means it must
also match at the end of the string (the end of the string may
contain a single newline, but no other newlines are allowed). The
.* means it must match zero or more
(*) of any characters (.)
except the newline. So, in other words, the pattern
^.*$ matches any string that doesn't contain
a newline except for a possible single newline as the last character.
But since the string in question, "ACC\nGTT" does
contain an embedded newline \n that isn't
the last character, the pattern match fails.
In the next examples, the pattern modifiers /m and
/s change the default behaviors for the
metacharacters ^, and $, and
the dot:
"AAC\nGTT" =~ /^.*$/m;
print $&, "\n";
This snippet prints out AAC and demonstrates the
/m modifier. The /m extends the
meaning of the ^ and the $ so
they also match around embedded newlines. Here, the pattern matches
from the beginning of the string up to the first embedded newline.
The next snippet of code demonstrates the /s
modifier:
"AAC\nGTT" =~ /^.*$/s;
print $&, "\n";
which produces the output:
AAC
GTT
The /s modifier changes the meaning of the dot
metacharacter so that it matches any character including newlines.
With the /s modifier, the pattern matches
everything from the beginning of the string to the end of the string,
including the newline. Notice when it prints, it prints the embedded
newline.
10.3.2.3
Separating annotations from sequence
Now that you've met the pattern-matching modifiers and regular
expressions that will be your main tools for parsing a GenBank file
as a scalar, let's try separating the annotations from the
sequence.
The first step is to get the GenBank record stored as a scalar
variable. Recall that a GenBank record starts with a line beginning
with the word "LOCUS" and ends with the
end-of-record separator: a line
containing two forward slashes.
First you want to read a GenBank record and store it in a scalar
variable. There's a device called an input record
separator denoted by the special variable
$/ that lets you do exactly that. The input record
separator is usually set to a newline, so each call to read a scalar
from a filehandle gets one line. Set it to the GenBank end-of-record
separator like so:
$/ = "//\n";
A call to read a scalar from a filehandle takes all the data up to
the GenBank end-of-record separator. So the line
$record =
<GBFILE> in Example 10-2
stores the multiline GenBank record into the scalar variable
$record. Later, you'll see that you can keep
repeating this call in order to read in successive GenBank records
from a GenBank library file.
After reading in the record, you'll parse it into the
annotation and sequence parts making use of
/s and
/m pattern modifiers. Extracting the
annotation and sequence is the easy part; parsing the annotation will
occupy most of the remainder of the chapter.
Example 10-2. Extract annotation and sequence from Genbank record
#!/usr/bin/perl
# Extract the annotation and sequence sections from the first
# record of a GenBank library
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my $annotation = '';
my $dna = '';
my $record = '';
my $filename = 'record.gb';
my $save_input_separator = $/;
# Open GenBank library file
unless (open(GBFILE, $filename)) {
print "Cannot open GenBank file \"$filename\"\n\n";
exit;
}
# Set input separator to "//\n" and read in a record to a scalar
$/ = "//\n";
$record = <GBFILE>;
# reset input separator
$/ = $save_input_separator;
# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
# Print the two pieces, which should give us the same as the
# original GenBank file, minus the // at the end
print $annotation, $dna;
exit;
The output from this program is the same as the GenBank file listed
previously, minus the last line, which is the end-of-record separator
//.
Let's focus on the regular expression that parses the
annotation and sequence out of the $record
variable. This is the most complicated regular expression so far:
$record = /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s.
There are two pairs of parentheses in the regular
expression: (LOCUS.*ORIGIN\s*\n) and
(.*). The parentheses are
metacharacters whose
purpose is to remember the parts of the data that match the pattern
within the parentheses, namely, the annotation and the sequence. Also
note that the pattern match returns an array whose elements are the
matched parenthetical patterns. After you match the annotation and
the sequence within the pairs of parentheses in the regular
expression, you simply assign the matched patterns to the two
variables $annotation and $dna,
like so:
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
Notice that at the end of the pattern, we've added the
/s pattern matching modifier, which, as
you've seen earlier, allows a dot to match any character
including an embedded newline. (Of course, since we've got a
whole GenBank record in the $record scalar, there
are a lot of embedded newlines.)
Next, look at the first pair of parentheses:
(LOCUS.*ORIGIN\s*\n)
This whole expression is anchored at the beginning of the string by
preceding it with a ^ metacharacter. (/s
doesn't change the meaning of the ^
character in a regular expression.)
Inside the parentheses, you match from where the string
LOCUS appears at the beginning of the GenBank
record, followed by any number of characters including newlines with
.*, followed by the string
ORIGIN, followed by possibly some whitespace with
\s*, followed by a newline \n.
This matches the annotation part of the GenBank record.
Now, look at the second parentheses and the remainder:
(.*)\/\/\n
This is easier. The .* matches any character,
including newlines because of the /s pattern
modifier at the end of the pattern match. The parentheses are
followed by the end-of-record line, //, including
the newline at the end, with the slashes preceded by backslashes to
show that you want to match them exactly. They're not
delimiters of the pattern matching operator. The end result is the
GenBank record with the annotation and the sequence separated into
the variables $annotation and
$sequence. Although the regular expression I used
requires a bit of explanation, the attractive thing about this
approach is that it took only one line of Perl code to extract both
annotation and sequence.