< BACKCONTINUE >

10.3 Separating Sequence and Annotation

In previous chapters you saw how to examine the lines of a file using Perl's array operations. Usually, you do this by saving the data in an array with each appearing as an element of the array.

Let's look at two methods to extract the annotation and the DNA from a GenBank file. In the first method, you'll slurp the file into an array and look through the lines, as in previous programs. In the second, you'll put the whole GenBank record into a scalar variable and use regular expressions to parse the information. Is one approach better than the other? Not necessarily: it depends on the data. There are advantages and disadvantages to each, but both get the job done.

I've put five GenBank records in a file called library.gb. As before, you can download the file from this book's web site. You'll use this datafile and the file record.gb in the next few examples.

10.3.1 Using Arrays

Example 10-1 shows the first method, which operates on an array containing the lines of the GenBank record. The main program is followed by a subroutine that does the real work.

Example 10-1. Extract annotation and sequence from GenBank file
#!/usr/bin/perl
# Extract annotation and sequence from GenBank file

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# declare and initialize variables
my @annotation = (  );
my $sequence = '';
my $filename = 'record.gb';

parse1(\@annotation, \$sequence, $filename);

# Print the annotation, and then
#   print the DNA in new format just to check if we got it okay.
print @annotation;

print_sequence($sequence, 50);

exit;

################################################################################
# Subroutine
################################################################################

# parse1
#
# --parse annotation and sequence from GenBank record

sub parse1 {

    my($annotation, $dna, $filename) = @_;

    # $annotation--reference to array
    # $dna       --reference to scalar
    # $filename  --scalar
    
    # declare and initialize variables
    my $in_sequence = 0; 
    my @GenBankFile = (  );
    
    # Get the GenBank data into an array from a file
    @GenBankFile = get_file_data($filename);
    
    # Extract all the sequence lines
    foreach my $line (@GenBankFile) {

        if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n,
            last; #break out of the foreach loop.
        } elsif( $in_sequence) { # If we know we're in a sequence,
            $$dna .= $line; # add the current line to $$dna.
        } elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence,
            $in_sequence = 1; # set the $in_sequence flag.
        } else{ # Otherwise
            push( @$annotation, $line); # add the current line to @annotation.
        }
    }
    
    # remove whitespace and line numbers from DNA sequence
    $$dna =~ s/[\s0-9]//g;
}

Here's the beginning and end of Example 10-1's output of the sequence data:

agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
...
cggtgcccgtgtgtccgttcctccactcatctgtttctccggttctccct
gtgcccatccaccggttgaccgcccatctgcctttatcagagggactgtc
cccgtcgacatgttcagtgcctggtggggctgcggagtccactcatcctt
gcctcctctccctgggttttgttaataaaattttgaagaaaccaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

The foreach loop in subroutine parse1 in Example 10-1 moves one by one through the lines from the GenBank file stored in the array @GenBankFile. It takes advantage of the structure of a GenBank file, which begins with annotation and runs until the line:

ORIGIN

is found, after which sequence appears until the end-of-record line // is reached. The loop uses a flag variable $in_sequence to remember that it has found the ORIGIN line and is now reading sequence lines.

The foreach loop has a new feature: the Perl built-in function last, which breaks out of the nearest enclosing loop. It's triggered by the end-of-record line //, which is reached when the entire record has been seen.

A regular expression is used to find the end-of-record line. To correctly match the end-of-record (forward) slashes, you must escape them by putting a backslash in front of each one, so that Perl doesn't interpret them as prematurely ending the pattern. The regular expression also ends with a newline \/\/\n, which is then placed inside the usual delimiters: /\/\/\n/. (When you have a lot of forward slashes in a regular expression, you can use another delimiter around the regular expression and precede it with an m, thus avoiding having to backslash the forward slashes. It's done like so: m!//\n!).

An interesting point about subroutine parse1 is the order of the tests in the foreach loop that goes through the lines of the GenBank record. As you read through the lines of the record, you want to first gather the annotation lines, set a flag when the ORIGIN start-of-sequence line is found, and then collect the lines until the end-of-record // line is found.

Notice that the order of the tests is exactly the opposite. First, you test for the end-of-record line, collect the sequence if the $in_sequence flag is set, and then test for the start-of-sequence ORIGIN line. Finally, you collect the annotation.

The technique of reading lines one by one and using flag variables to mark what section of the file you're in, is a common programming technique. So, take a moment to think about how the loop would behave if you changed the order of the tests. If you collected sequence lines before testing for the end-of-record, you'd never get to the end-of-record test!

Other methods of collecting annotation and sequence lines are possible, especially if you go through the lines of the array more than once. You can scan through the array, keeping track of the start-of-sequence and end-of-record line numbers, and then go back and extract the annotation and sequence using an array splice (which was described in the parseREBASE subroutine in Example 9-2). Here's an example:

# find line numbers of ORIGIN and // in the GenBank record

$linenumber = 0;
foreach my $line (@GenBankFile) {
     if ( $line =~ /^//\n/ ) {  # end-of-record // line
         $end = $linenumber;
         last;
     } elsif ( $line =~ /^ORIGIN/ ) { # end annotation, begin sequence
         $origin = $linenumber;
     }
     $linenumber++;
}

# extract annotation and sequence with "array splice"

@annotation = @GenBankFile[0..($origin-1)];
@sequence   = @GenBankFile[($origin+1)..($end-1)];
</programlisting>

10.3.2 Using Scalars

A second way to separate annotations from sequences in GenBank records is to read the entire record into a scalar variable and operate on it with regular expressions. For some kinds of data, this can be a more convenient way to parse the input (compared to scanning through an array, as in Example 10-1).

Usually string data is stored one line per scalar variable with its newlines, if any, at the end of the string. Sometimes, however, you store several lines concatenated together in one string that is, in turn, stored in a single scalar variable. These multiline strings aren't uncommon; you used them to gather the sequence from a FASTA file in Examples Example 6-2 and Example 6-3. Regular expressions have pattern modifiers that can be used to make multiline strings with their embedded newlines easy to use.

10.3.2.1 Pattern modifiers

The pattern modifiers we've used so far are /g, for global matching, and /i, for case-insensitive matching. Let's take a look at two more that affect the way regular expressions interact with the newlines in scalars.

Recall that previous regular expressions have used the caret (^), dot (.), and dollar sign ($) metacharacters. The ^ anchors a regular expression to the beginning of a string, by default, so that /^THE BEGUINE/ matches a string that begins with "THE BEGUINE". Similarly, $ anchors an expression to the end of the string, and the dot (.) matches any character except a newline.

The following pattern modifiers affect these three metacharacters:

  • The /s modifier assumes you want to treat the whole string as a single line, even with embedded newlines, so it makes the dot metacharacter match any character including newlines.

  • The /m modifier assumes you want to treat the whole string as a multiline, with embedded newlines, so it extends the ^ and the $ to match after, or before, a newline, embedded in the string.

10.3.2.2 Examples of pattern modifiers

Here's an example of the default behavior of caret (^), dot (.), and dollar sign ($):

use warnings;
"AAC\nGTT" =~ /^.*$/;
print $&, "\n";

This demonstrates the default behavior without the /m or /s modifiers and prints the warning:

Use of uninitialized value in print statement at line 3.

The print statement tries to print $& , a special variable that is always set to the last successful pattern match. This time, since the pattern doesn't match, the variable $& isn't set, and you get a warning message for attempting to print an uninitialized value.

Why doesn't the match succeed? First, let's examine the ^.*$ pattern. It begins with a ^, which means it must match from the beginning of the string. It ends with a $, which means it must also match at the end of the string (the end of the string may contain a single newline, but no other newlines are allowed). The .* means it must match zero or more (*) of any characters (.) except the newline. So, in other words, the pattern ^.*$ matches any string that doesn't contain a newline except for a possible single newline as the last character. But since the string in question, "ACC\nGTT" does contain an embedded newline \n that isn't the last character, the pattern match fails.

In the next examples, the pattern modifiers /m and /s change the default behaviors for the metacharacters ^, and $, and the dot:

"AAC\nGTT" =~ /^.*$/m;
print $&, "\n";

This snippet prints out AAC and demonstrates the /m modifier. The /m extends the meaning of the ^ and the $ so they also match around embedded newlines. Here, the pattern matches from the beginning of the string up to the first embedded newline.

The next snippet of code demonstrates the /s modifier:

"AAC\nGTT" =~ /^.*$/s;
print $&, "\n";

which produces the output:

AAC
GTT

The /s modifier changes the meaning of the dot metacharacter so that it matches any character including newlines. With the /s modifier, the pattern matches everything from the beginning of the string to the end of the string, including the newline. Notice when it prints, it prints the embedded newline.

10.3.2.3 Separating annotations from sequence

Now that you've met the pattern-matching modifiers and regular expressions that will be your main tools for parsing a GenBank file as a scalar, let's try separating the annotations from the sequence.

The first step is to get the GenBank record stored as a scalar variable. Recall that a GenBank record starts with a line beginning with the word "LOCUS" and ends with the end-of-record separator: a line containing two forward slashes.

First you want to read a GenBank record and store it in a scalar variable. There's a device called an input record separator denoted by the special variable $/ that lets you do exactly that. The input record separator is usually set to a newline, so each call to read a scalar from a filehandle gets one line. Set it to the GenBank end-of-record separator like so:

$/ = "//\n";

A call to read a scalar from a filehandle takes all the data up to the GenBank end-of-record separator. So the line $record = <GBFILE> in Example 10-2 stores the multiline GenBank record into the scalar variable $record. Later, you'll see that you can keep repeating this call in order to read in successive GenBank records from a GenBank library file.

After reading in the record, you'll parse it into the annotation and sequence parts making use of /s and /m pattern modifiers. Extracting the annotation and sequence is the easy part; parsing the annotation will occupy most of the remainder of the chapter.

Example 10-2. Extract annotation and sequence from Genbank record
#!/usr/bin/perl
# Extract the annotation and sequence sections from the first
#   record of a GenBank library

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my $annotation = '';
my $dna = '';
my $record = '';
my $filename = 'record.gb';
my $save_input_separator = $/;

# Open GenBank library file
unless (open(GBFILE, $filename)) {
    print "Cannot open GenBank file \"$filename\"\n\n";
    exit;
}

# Set input separator to "//\n" and read in a record to a scalar
$/ = "//\n";

$record = <GBFILE>;

# reset input separator 
$/ = $save_input_separator;

# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

# Print the two pieces, which should give us the same as the
#  original GenBank file, minus the // at the end
print $annotation, $dna;

exit;

The output from this program is the same as the GenBank file listed previously, minus the last line, which is the end-of-record separator //.

Let's focus on the regular expression that parses the annotation and sequence out of the $record variable. This is the most complicated regular expression so far:

$record = /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s.

There are two pairs of parentheses in the regular expression: (LOCUS.*ORIGIN\s*\n) and (.*). The parentheses are metacharacters whose purpose is to remember the parts of the data that match the pattern within the parentheses, namely, the annotation and the sequence. Also note that the pattern match returns an array whose elements are the matched parenthetical patterns. After you match the annotation and the sequence within the pairs of parentheses in the regular expression, you simply assign the matched patterns to the two variables $annotation and $dna, like so:

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

Notice that at the end of the pattern, we've added the /s pattern matching modifier, which, as you've seen earlier, allows a dot to match any character including an embedded newline. (Of course, since we've got a whole GenBank record in the $record scalar, there are a lot of embedded newlines.)

Next, look at the first pair of parentheses:

(LOCUS.*ORIGIN\s*\n)

This whole expression is anchored at the beginning of the string by preceding it with a ^ metacharacter. (/s doesn't change the meaning of the ^ character in a regular expression.)

Inside the parentheses, you match from where the string LOCUS appears at the beginning of the GenBank record, followed by any number of characters including newlines with .*, followed by the string ORIGIN, followed by possibly some whitespace with \s*, followed by a newline \n. This matches the annotation part of the GenBank record.

Now, look at the second parentheses and the remainder:

(.*)\/\/\n

This is easier. The .* matches any character, including newlines because of the /s pattern modifier at the end of the pattern match. The parentheses are followed by the end-of-record line, //, including the newline at the end, with the slashes preceded by backslashes to show that you want to match them exactly. They're not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and the sequence separated into the variables $annotation and $sequence. Although the regular expression I used requires a bit of explanation, the attractive thing about this approach is that it took only one line of Perl code to extract both annotation and sequence.

< BACKCONTINUE >

Index terms contained in this section

$ (dollar sign)
      $& variables
      $/ (input record separator)
      metacharacter in regular expressions
() (parentheses)
      in regular expressions
. (dot)
      metacharacter in regular expressions
/ (forward slash)
     // (double slash)
            end-of-record separator
      in pattern modifiers
/g (global) pattern modifier
/i (case-insensitive) matching
/m pattern modifier
      program to extract annotation from sequence
/s pattern modifier
      program to extract annotation from sequence
^ (caret)
      beginning of string, anchoring to
      metacharacter in regular expressions
annotations, GenBank files
      separating from DNA sequences
arrays
      GenBank records, using with
case-insensitive matching
DNA
      separating from GenBank file annotations
end-of-record separator (//) 2nd
GenBank (Genetic Sequence Data Bank)
      separating annotations from sequences
            using arrays
            using scalars
global matching (/g)
input record separator ($/)
metacharacters
      parentheses, in regular expressions
modifiers, pattern matching
multiline strings
      /m pattern modifier and
newlines
      in scalars, regular expressions and
pattern modifiers
patterns (and regular expressions)
      GenBank records, operating on
      modifiers 2nd
            /m and /s 2nd
            /s
records, GenBank files
     separating annotation from sequence
            using arrays
            using scalars
scalar variables
      GenBank records, reading as
single line (/s pattern modifier)
strings
     multiline
            using regular expressions on
subroutines
      extracting annotation from sequence, GenBank files
variables
      $&

© 2002, O'Reilly & Associates, Inc.