< BACKCONTINUE >

12.4 Parsing BLAST Output

So why parse BLAST output? One reason is to see if your DNA has any new matches against the DNA stored in the constantly growing databases. You can write a program to automatically perform a daily BLAST search and compare its results with those of the previous day by parsing the summary list of hits and comparing it with the previous day's summary list. You can then have the program email you if something new has turned up.

12.4.1 Extracting Annotation and Alignments

Example 12-1 consists of a main program and two new subroutines. The subroutines—parse_blast and parse_blast_alignment—use regular expressions to extract the various bits of data from a scalar string. I chose this method because the data, although structured, does not clearly identify each line with its function. (See the discussion in Chapter 10 and Chapter 11.)

Example 12-1. Extract annotation and alignments from BLAST output file
#!/usr/bin/perl
# Extract annotation and alignments from BLAST output file

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# declare and initialize variables
my $beginning_annotation = '';
my $ending_annotation = '';
my %alignments = (  );
my $filename = 'blast.txt';

parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);

# Print the annotation, and then
#   print the DNA in new format just to check if we got it okay.
print $beginning_annotation;

foreach my $key (keys %alignments) {
    print "$key\nXXXXXXXXXXXX\n", $alignments{$key}, "\nXXXXXXXXXXX\n";
}

print $ending_annotation;

exit;

################################################################################
# Subroutines for Example 12-1
################################################################################

# parse_blast
#
# --parse beginning and ending annotation, and alignments,
#     from BLAST output file

sub parse_blast {

    my($beginning_annotation, $ending_annotation, $alignments, $filename) = @_;

    # $beginning_annotation--reference to scalar
    # $ending_annotation   --reference to scalar
    # $alignments          --reference to hash
    # $filename            --scalar
    
    # declare and initialize variables
    my $blast_output_file = '';
    my $alignment_section = '';
    
    # Get the BLAST program output into an array from a file
    $blast_output_file = join( '', get_file_data($filename));

    # Extract the beginning annotation, alignments, and ending annotation
    ($$beginning_annotation, $alignment_section, $$ending_annotation)
    = ($blast_output_file =~ /(.*^ALIGNMENTS\n)(.*)(^  Database:.*)/ms);
    
    # Populate %alignments hash
    # key = ID of hit
    # value = alignment section
    %$alignments = parse_blast_alignment($alignment_section);
}

# parse_blast_alignment
#
# --parse the alignments from a BLAST output file,
#       return hash with
#       key = ID
#       value = text of alignment

sub parse_blast_alignment {

    my($alignment_section) = @_;
    
    # declare and initialize variables
    my(%alignment_hash) = (  );

    # loop through the scalar containing the BLAST alignments,
    # extracting the ID and the alignment and storing in a hash
    #
    # The regular expression matches a line beginning with >,
    # and containing the ID between the first pair of | characters;
    # followed by any number of lines that don't begin with >

    while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {
        my($value) = $&;
        my($key) = (split(/\|/, $value)) [1];
        $alignment_hash{$key} = $value;
    }

    return %alignment_hash;
}

The main program does little more than call the parsing subroutine and print the results. The arguments, initialized as empty, are passed by reference (see Chapter 6).

The subroutine parse_blast does the top-level parsing job of separating the three sections of a BLAST output file: the annotation at the beginning, the alignments in the middle, and the annotation at the end. It then calls the parse_blast_alignment subroutine to extract the individual alignments from that middle alignment section. The data is first read in from the named file with our old friend the get_file_data subroutine from Chapter 8. Use the join function to store the array of file data into a scalar string.

The three sections of the BLAST output file are separated by the following statement:

($$beginning_annotation, $alignment_section, $$ending_annotation)

    = ($blast_output_file =~ /(.*^ALIGNMENTS\n)(.*)(^  Database:.*)/ms);

The pattern match contains three parenthesized expressions:

(.*^ALIGNMENTS\n) 

which is returned into $$beginning_annotation;

(.*) 

which is saved in $alignment_section; and:

(^  Database:.*) 

which is saved in $$ending_annotation.

The use of $$ instead of $ at the beginning of two of these variables indicates that they are references to scalar variables. Recall that they were passed in as arguments to the subroutine, where they were preceded by a backslash, like so:

parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);

You've seen references to variables before, starting in Chapter 6. Let's review them briefly. Within the parse_blast subroutine, those variables with only one $ are references to the scalar variables. They need an additional $ to represent actual scalar variables. This is how references work; they need an additional special character to indicate what kinds of variables they are references to. So a reference to a scalar variable needs to start with $$, a reference to an array variable needs to start with @$, and a reference to a hash variable needs to start with %$.

The regular expression in the previous code snippet matches everything up to the word ALIGNMENTS at the end of a line (.*^ALIGNMENTS\n); then everything for a while (.*); then a line that begins with two spaces and the word Database: followed by the rest of the file (^ Database:.*). These three expressions in parentheses correspond to the three desired parts of the BLAST output file; the beginning annotation, the alignment section, and the ending annotation.

The alignments saved in $alignment_section are separated out by the subroutine parse_blast_alignment. This subroutine has one important loop:

    while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {
        my($value) = $&;
        my($key) = (split(/\|/, $value)) [1];
        $alignment_hash{$key} = $value;
    }

You're probably thinking that this regular expression looks downright evil. At first glance, regular expressions do sometimes seem incomprehensible, so let's take a closer look. There are a few new things to examine.

The five lines comprise a while loop, which (due to the global /g modifier on the pattern match in the while loop) keeps matching the pattern as many times as it appears in the string. Each time the program cycles through the loop, the pattern match finds the value (the entire alignment), then determines the key. The key and values are saved in the hash %alignment_hash.

Here's an example of one of the matches that's found by this while loop when parsing the BLAST output shown in Section 12.3:

>emb|AL353748.13|AL353748 Human DNA sequence from clone RP11-317B17 on
chromosome 9, complete
             sequence [Homo sapiens]
          Length = 179155

 Score = 38.2 bits (19), Expect = 4.1
 Identities = 22/23 (95%)
 Strand = Plus / Plus

Query: 192   ggcgggggtcgtgagggagtgcg 214
             |||| ||||||||||||||||||
Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280

This text starts with a line beginning with a > character. In the complete BLAST output, sections like these follow one another. What you want to do is start matching from a line beginning with > and include all following adjacent lines that don't start with a > character. You also want to extract the identifier, which appears between the first and second vertical bar | characters on the first line (e.g., AL353748.13 in this alignment).

Let's dissect the regular expression:

$alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm

This pattern match, which appears in a while loop within the code, has the modifier m for multiline. The m modifier allows ^ to match any beginning-of-line inside the multiline string, and $ to match any end-of-line.

The regular expression breaks down as follows. The first part is:

^>.*\n

It looks for > at the beginning of the BLAST output line, followed by .*, which matches any quantity of anything (except newlines), up to the first newline. In other words, it matches the first line of the alignment.

Here's the rest of the regular expression:

(^(?!>).*\n)+

After the ^ which matches the beginning of the line, you'll see a negative lookahead assertion, (?!>), which ensures that a > doesn't follow. Next, the .* matches all non-newline characters, up to the final \n at the end of the line. All of that is wrapped in parentheses with a surrounding +, so that you match all the available lines.

Now that you've matched the entire alignment, you want to extract the key and populate the hash with your key and value. Within the while loop, the alignment that you just matched is automatically set by Perl as the value of the special variable $& and saved in the variable $value. Now you need to extract your key from the alignment. It can be found on the first line of the alignment stored in $value, between the first and second | symbols.

Extracting this identifying key is done using the split function, which breaks the string into an array. The call to split:

split(/\|/, $value)

splits $value into pieces delimited by | characters. That is, the | symbol is used to determine where one list element ends and the next one begins. (Remember that the vertical bar | is a metacharacter and must be escaped as \|.) By surrounding the call to split with parentheses and adding an array offset ([1]), you can isolate the key and save it into $key.

Let's step back now and look at Example 12-1 in its entirety. Notice that it's very short—barely more than two pages, including comments. Although it's not an easy program, due to the complexity of the regular expressions involved, you can make sense of it if you put a little effort into examining the BLAST output files and the regular expressions that parse it.

Regular expressions have lots of complex features, but as a result, they can do lots of useful things. As a Perl programmer, the effort you put into learning them is well worth it and can have significant payoffs down the road.

12.4.2 Parsing BLAST Alignments

Let's take the parsing of the BLAST output file a little further. Notice that some of the alignments include more than one aligned string—for instance, the alignment for ID AK017941.1, shown again here:

>dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN
full-length enriched
            library, clone:5830420C16, full insert sequence
          Length = 1461

 Score =  210 bits (106), Expect = 5e-52
 Identities = 151/166 (90%)
 Strand = Plus / Plus

Query: 235  ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294
            |||||||| |||||||  || ||||| ||||||||||| ||||||||||| |||||||||
Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107

Query: 295  gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354
            ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| ||
Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167

Query: 355  tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400
            |||||||||||||| |||||||||||||| ||||||||||||||||
Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213

 Score = 44.1 bits (22), Expect = 0.066
 Identities = 25/26 (96%)
 Strand = Plus / Plus

Query: 118 gcggaagtagttgtgggcgcctttgc 143
           ||||||||||||| ||||||||||||
Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260

To parse these alignments, we have to parse out each of the matched strings, which in BLAST terminology are called high-scoring pairs (HSPs).

Each HSP also contains some annotation, and then the HSP itself. Let's parse each HSP into annotation, query string, and subject string, together with the starting and ending positions of the strings. More parsing is possible; you can extract specific features of the annotation, as well as the locations of identical and nonidentical bases in the HSP, for instance.

Example 12-2 includes a pair of subroutines; one to parse the alignments into their HSPs, and the second to extract the sequences and their end positions. The main program extends Example 12-1 using these new subroutines.

Example 12-2. Parse alignments from BLAST output file
#!/usr/bin/perl
# Parse alignments from BLAST output file

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# declare and initialize variables
my $beginning_annotation = '';
my $ending_annotation = '';
my %alignments = (  );
my $alignment = '';
my $filename = 'blast.txt';
my @HSPs = (  );
my($expect, $query, $query_range, $subject, $subject_range) = ('','','','','');

parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);

$alignment = $alignments{'AK017941.1'};

@HSPs = parse_blast_alignment_HSP($alignment);

($expect, $query, $query_range, $subject, $subject_range)
 = extract_HSP_information($HSPs[1]);

# Print the results
print "\n-> Expect value:   $expect\n";
print "\n-> Query string:   $query\n";
print "\n-> Query range:    $query_range\n";
print "\n-> Subject String: $subject\n";
print "\n-> Subject range:  $subject_range\n";

exit;

################################################################################
# Subroutines for Example 12-2
################################################################################

# parse_blast_alignment_HSP
#
# --parse beginning annotation, and HSPs,
#     from BLAST alignment
#     Return an array with first element set to the beginning annotation,
#    and each successive element set to an HSP

sub parse_blast_alignment_HSP {

    my($alignment ) = @_;

    # declare and initialize variables
    my $beginning_annotation  = '';
    my $HSP_section  = '';
    my @HSPs = (  );
    
    # Extract the beginning annotation and HSPs
    ($beginning_annotation, $HSP_section )
        = ($alignment =~ /(.*?)(^ Score =.*)/ms);

    # Store the $beginning_annotation as the first entry in @HSPs
    push(@HSPs, $beginning_annotation);

    # Parse the HSPs, store each HSP as an element in @HSPs
    while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {
        push(@HSPs, $&);
    }

    # Return an array with first element = the beginning annotation,
    # and each successive element = an HSP
    return(@HSPs);
}

# extract_HSP_information
#
# --parse a HSP from a BLAST output alignment section
#        - return array with elements:
#    Expect value
#    Query string
#    Query range 
#    Subject string
#    Subject range

sub extract_HSP_information {

    my($HSP) = @_;
    
    # declare and initialize variables
    my($expect) = '';
    my($query) = '';
    my($query_range) = '';
    my($subject) = '';
    my($subject_range) = '';

    ($expect) = ($HSP =~ /Expect = (\S+)/);

    $query = join ( '' , ($HSP =~ /^Query.*\n/gm) );

    $subject = join ( '' , ($HSP =~ /^Sbjct.*\n/gm) );

    $query_range = join('..', ($query =~ /(\d+).*\D(\d+)/s));

    $subject_range = join('..', ($subject =~ /(\d+).*\D(\d+)/s));

    $query =~ s/[^acgt]//g;

    $subject =~ s/[^acgt]//g;

    return ($expect, $query, $query_range, $subject, $subject_range);
}

Example 12-2 gives the following output:

-> Expect value:   5e-52

-> Query string:   ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg
gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgacaactgcaatgagt
ggttccatggggactgcatccggatca

-> Query range:    235..400

-> Subject String: ctggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatg
ggctgagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattggacttgtgacaactgca
acgagtggttccatggagactgcatccggatca

-> Subject range:  1048..1213

Let's discuss the new features of Example 12-2 and its subroutines. First notice that the two new subroutines from Example 12-1 have been placed into the BeginPerlBioinfo.pm module, so they aren't printed again here.

The main program, Example 12-2, starts the same as Example 12-1; it calls the parse_blast subroutine to separate the annotation from the alignments in the BLAST output file.

The next line fetches one of the alignments from the %alignments hash, which is then used as the argument to the parse_blast_alignment_HSP subroutine, which then returns an array of annotation (as the first element) and HSPs in @HSPs. Here you see that not only can a subroutine return an array on a scalar value; it can also return a hash.

Finally, Example 12-2 does the lower-level parsing of an individual HSP by calling the extract_HSP_information subroutine, and the extracted parts of one of the HSPs are printed.

Example 12-2 shows a certain inconsistency in our design. Some subroutines call their arguments by reference; others call them by value (see Chapter 6). You may ask: is this a bad thing?

The answer is: not necessarily. The subroutine parse_blast mixes several arguments, and one of them is not a scalar type. Recall that this is a potentially good place to use call-by-reference in Perl. The other subroutines don't mix argument types this way. However, they can be designed to call their arguments by reference.

Continuing with the code, let's examine the subroutine parse_blast_alignment_HSP . This takes one of the alignments from the BLAST output and separates out the individual HSP string matches. The technique used is, once again, regular expressions operating on a single string that contains all the lines of the alignment given as the input argument.

The first regular expression parses out the annotation and the section containing the HSPs:

($beginning_annotation, $HSP_section )

= ($alignment =~ /(.*?)(^ Score =.*)/ms);

The first parentheses in the regular expression is (.*?) This is the nongreedy or minimal matching mentioned in Chapter 9. Here it gobbles up everything before the first line that begins Score = (without the ? after the *, it would gobble everything until the final line that begins Score =). This is the exact dividing line between the beginning annotation and the HSP string matches.

The next loop and regular expression separates the individual HSP string matches:

while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {

    push(@HSPs, $&);
}

This is the same kind of global string match in a while loop you've seen before; it keeps iterating as long as the match can be found. The other modifier /m is the multiline modifier, which enables the metacharacters $ and ^ to match before and after embedded newlines.

The expression within the first pair of parentheses—(^ Score =.*\n)—matches a line that begins Score =, which is the kind of line that begins an HSP string match section.

The code within the second pair of parentheses—(^(?! Score =).*\n)+—matches one or more (the + following the other parentheses) lines that do not begin with Score =. The ?! at the beginning of the embedded parentheses is the negative lookahead assertion you encountered in Example 12-1. So, in total, the regular expression captures a line beginning with Score = and all succeeding adjacent lines that don't begin with Score =.

< BACKCONTINUE >

Index terms contained in this section

$ (dollar sign)
      before embedded newlines
* (asterisk)
      *? quantifier
/g (global) pattern modifier
/m pattern modifier
^ (caret)
      end of line assertion (in matching)
alignments, BLAST
      extracting
      parsing
annotations, BLAST files
      extracting
      HSPs
BLAST (Basic Local Alignment Search Tool)
     output files
            parsing
by reference
      subroutines, calling arguments by
by value
      subroutines, calling arguments by
calling
      subroutine arguments by reference
extract_HSP_information subroutine
files
     BLAST output
            parsing
global matching (/g)
hashes
      %alignments (BLAST program)
high-scoring pairs (HSPs), BLAST program
multiline strings
      /m pattern modifier and
newlines
      strings containing, matches against
nongreedy or minimal matching
output
     BLAST files
            parsing
parse_blast subroutine 2nd
parse_blast_alignment subroutine
parse_blast_alignment_HSP subroutine 2nd
parsing
      BLAST output files
            alignments
patterns (and regular expressions)
      extracting annotations and alignments (BLAST files)
      minimal matching
     modifiers
            /m
references
      calling by
subroutines
      call by reference
      extract_HSP_information
      parse_blast 2nd
      parse_blast_alignment
      parse_blast_alignment_HSP 2nd
while loops
      global "/g" string match in

© 2002, O'Reilly & Associates, Inc.