12.4
Parsing BLAST Output
So why parse BLAST output? One reason
is to see if your DNA has any new matches against the DNA stored in
the constantly growing databases. You can write a program to
automatically perform a daily BLAST search and compare its results
with those of the previous day by parsing the summary list of hits
and comparing it with the previous day's summary list. You can
then have the program email you if something new has turned up.
12.4.1
Extracting Annotation and Alignments
Example 12-1 consists of
a main
program and two new subroutines. The
subroutines—parse_blast
and parse_blast_alignment—use regular
expressions to extract the various bits of data from a scalar string.
I chose this method because the data, although structured, does not
clearly identify each line with its function. (See the discussion in Chapter 10 and Chapter 11.)
Example 12-1. Extract annotation and alignments from BLAST output file
#!/usr/bin/perl
# Extract annotation and alignments from BLAST output file
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# declare and initialize variables
my $beginning_annotation = '';
my $ending_annotation = '';
my %alignments = ( );
my $filename = 'blast.txt';
parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);
# Print the annotation, and then
# print the DNA in new format just to check if we got it okay.
print $beginning_annotation;
foreach my $key (keys %alignments) {
print "$key\nXXXXXXXXXXXX\n", $alignments{$key}, "\nXXXXXXXXXXX\n";
}
print $ending_annotation;
exit;
################################################################################
# Subroutines for Example 12-1
################################################################################
# parse_blast
#
# --parse beginning and ending annotation, and alignments,
# from BLAST output file
sub parse_blast {
my($beginning_annotation, $ending_annotation, $alignments, $filename) = @_;
# $beginning_annotation--reference to scalar
# $ending_annotation --reference to scalar
# $alignments --reference to hash
# $filename --scalar
# declare and initialize variables
my $blast_output_file = '';
my $alignment_section = '';
# Get the BLAST program output into an array from a file
$blast_output_file = join( '', get_file_data($filename));
# Extract the beginning annotation, alignments, and ending annotation
($$beginning_annotation, $alignment_section, $$ending_annotation)
= ($blast_output_file =~ /(.*^ALIGNMENTS\n)(.*)(^ Database:.*)/ms);
# Populate %alignments hash
# key = ID of hit
# value = alignment section
%$alignments = parse_blast_alignment($alignment_section);
}
# parse_blast_alignment
#
# --parse the alignments from a BLAST output file,
# return hash with
# key = ID
# value = text of alignment
sub parse_blast_alignment {
my($alignment_section) = @_;
# declare and initialize variables
my(%alignment_hash) = ( );
# loop through the scalar containing the BLAST alignments,
# extracting the ID and the alignment and storing in a hash
#
# The regular expression matches a line beginning with >,
# and containing the ID between the first pair of | characters;
# followed by any number of lines that don't begin with >
while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {
my($value) = $&;
my($key) = (split(/\|/, $value)) [1];
$alignment_hash{$key} = $value;
}
return %alignment_hash;
}
The main program does little more than call the parsing subroutine
and print the results. The arguments, initialized as empty, are
passed by reference (see Chapter 6).
The subroutine parse_blast does the top-level
parsing job of separating the three sections of a BLAST output file:
the annotation at the beginning, the alignments in the middle, and
the annotation at the end. It then calls the
parse_blast_alignment subroutine to extract the
individual alignments from that middle alignment section. The data is
first read in from the named file with our old friend the
get_file_data subroutine from Chapter 8. Use the join function to
store the array of file data into a scalar string.
The three sections of the BLAST output file are separated by the
following statement:
($$beginning_annotation, $alignment_section, $$ending_annotation)
= ($blast_output_file =~ /(.*^ALIGNMENTS\n)(.*)(^ Database:.*)/ms);
The pattern match contains three parenthesized expressions:
(.*^ALIGNMENTS\n)
which is returned into $$beginning_annotation;
(.*)
which is saved in $alignment_section; and:
(^ Database:.*)
which is saved in $$ending_annotation.
The use of $$ instead of $ at
the beginning of two of these variables indicates that they are
references to scalar variables. Recall that they were passed in as
arguments to the subroutine, where they were preceded by a backslash,
like so:
parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);
You've seen references to variables before, starting in Chapter 6. Let's review them briefly. Within the
parse_blast subroutine, those variables with only
one $ are references to the scalar variables. They
need an additional $ to represent actual scalar
variables. This is how references work; they need an additional
special character to indicate what kinds of variables they are
references to. So a reference to a scalar variable needs to start
with $$, a reference to an array variable needs to
start with @$, and a reference to a hash variable
needs to start with %$.
The regular expression in the previous code snippet matches
everything up to the word ALIGNMENTS at the end of
a line (.*^ALIGNMENTS\n); then everything for a
while (.*); then a line that begins with two
spaces and the word Database: followed by the rest
of the file (^ Database:.*).
These three expressions in parentheses correspond to the three
desired parts of the BLAST output file; the beginning annotation, the
alignment section, and the ending annotation.
The alignments saved in $alignment_section are
separated out by the subroutine
parse_blast_alignment. This subroutine has one
important loop:
while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {
my($value) = $&;
my($key) = (split(/\|/, $value)) [1];
$alignment_hash{$key} = $value;
}
You're probably thinking that this regular expression looks
downright evil. At first glance, regular expressions do sometimes
seem incomprehensible, so let's take a closer look. There are a
few new things to examine.
The five lines comprise a while loop, which (due
to the global /g modifier on the pattern match in
the while loop) keeps matching the pattern as many
times as it appears in the string. Each time the program cycles
through the loop, the pattern match finds the value (the entire
alignment), then determines the key. The key and values are saved in
the hash %alignment_hash.
Here's an example of one of the matches that's found by
this while loop when parsing the BLAST output
shown in Section 12.3:
>emb|AL353748.13|AL353748 Human DNA sequence from clone RP11-317B17 on
chromosome 9, complete
sequence [Homo sapiens]
Length = 179155
Score = 38.2 bits (19), Expect = 4.1
Identities = 22/23 (95%)
Strand = Plus / Plus
Query: 192 ggcgggggtcgtgagggagtgcg 214
|||| ||||||||||||||||||
Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280
This text starts with a line beginning with a >
character. In the complete BLAST output, sections like these follow
one another. What you want to do is start matching from a line
beginning with > and include all following
adjacent lines that don't start with a >
character. You also want to extract the identifier, which appears
between the first and second vertical bar | characters on the first
line (e.g., AL353748.13 in this alignment).
Let's dissect the regular expression:
$alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm
This pattern match, which appears in a while loop
within the code, has the modifier m for multiline.
The m modifier allows ^ to
match any beginning-of-line inside the multiline string, and
$ to match any end-of-line.
The regular expression breaks down as follows. The first part is:
^>.*\n
It looks for > at the beginning of the BLAST
output line, followed by .*, which matches any
quantity of anything (except newlines), up to the first newline. In
other words, it matches the first line of the alignment.
Here's the rest of the regular expression:
(^(?!>).*\n)+
After the ^ which matches the beginning of the
line, you'll see a negative lookahead
assertion, (?!>), which ensures that
a > doesn't follow. Next, the
.* matches all non-newline characters, up to the
final \n at the end of the line. All of that is
wrapped in parentheses with a surrounding +, so
that you match all the available lines.
Now that you've matched the entire alignment, you want to
extract the key and populate the hash with your key and value. Within
the while loop, the alignment that you just
matched is automatically set by Perl as the value of the special
variable $& and saved in the variable
$value. Now you need to extract your key from the
alignment. It can be found on the first line of the alignment stored
in $value, between the first and second
| symbols.
Extracting this identifying key is done using the
split function, which breaks the string into an
array. The call to split:
split(/\|/, $value)
splits $value into pieces delimited by
| characters. That is, the |
symbol is used to determine where one list element ends and the next
one begins. (Remember that the vertical bar | is a
metacharacter and must be escaped as \|.) By
surrounding the call to split with parentheses and adding an array
offset ([1]), you can isolate the key and save it
into $key.
Let's step back now and look at Example 12-1 in
its entirety. Notice that it's very short—barely more
than two pages, including comments. Although it's not an easy
program, due to the complexity of the regular expressions involved,
you can make sense of it if you put a little effort into examining
the BLAST output files and the regular expressions that parse it.
Regular expressions have lots of complex features, but as a result,
they can do lots of useful things. As a Perl programmer, the effort
you put into learning them is well worth it and can have
significant payoffs down the road.
12.4.2
Parsing BLAST Alignments
Let's take the
parsing of
the BLAST output file a little further. Notice that some of the
alignments include more than one aligned string—for instance,
the alignment for ID AK017941.1, shown again here:
>dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN
full-length enriched
library, clone:5830420C16, full insert sequence
Length = 1461
Score = 210 bits (106), Expect = 5e-52
Identities = 151/166 (90%)
Strand = Plus / Plus
Query: 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294
|||||||| ||||||| || ||||| ||||||||||| ||||||||||| |||||||||
Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107
Query: 295 gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354
||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| ||
Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167
Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400
|||||||||||||| |||||||||||||| ||||||||||||||||
Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213
Score = 44.1 bits (22), Expect = 0.066
Identities = 25/26 (96%)
Strand = Plus / Plus
Query: 118 gcggaagtagttgtgggcgcctttgc 143
||||||||||||| ||||||||||||
Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260
To parse these alignments, we have to parse out each of the matched
strings, which in BLAST terminology are called
high-scoring pairs (HSPs).
Each HSP also contains some annotation, and
then the HSP itself. Let's parse each HSP into annotation,
query string, and subject string, together with the starting and
ending positions of the strings. More parsing is possible; you can
extract specific features of the annotation, as well as the locations
of identical and nonidentical bases in the HSP, for instance.
Example 12-2 includes a pair of subroutines; one to
parse the alignments into their HSPs, and the second to extract the
sequences and their end positions. The main program extends Example 12-1 using these new subroutines.
Example 12-2. Parse alignments from BLAST output file
#!/usr/bin/perl
# Parse alignments from BLAST output file
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# declare and initialize variables
my $beginning_annotation = '';
my $ending_annotation = '';
my %alignments = ( );
my $alignment = '';
my $filename = 'blast.txt';
my @HSPs = ( );
my($expect, $query, $query_range, $subject, $subject_range) = ('','','','','');
parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);
$alignment = $alignments{'AK017941.1'};
@HSPs = parse_blast_alignment_HSP($alignment);
($expect, $query, $query_range, $subject, $subject_range)
= extract_HSP_information($HSPs[1]);
# Print the results
print "\n-> Expect value: $expect\n";
print "\n-> Query string: $query\n";
print "\n-> Query range: $query_range\n";
print "\n-> Subject String: $subject\n";
print "\n-> Subject range: $subject_range\n";
exit;
################################################################################
# Subroutines for Example 12-2
################################################################################
# parse_blast_alignment_HSP
#
# --parse beginning annotation, and HSPs,
# from BLAST alignment
# Return an array with first element set to the beginning annotation,
# and each successive element set to an HSP
sub parse_blast_alignment_HSP {
my($alignment ) = @_;
# declare and initialize variables
my $beginning_annotation = '';
my $HSP_section = '';
my @HSPs = ( );
# Extract the beginning annotation and HSPs
($beginning_annotation, $HSP_section )
= ($alignment =~ /(.*?)(^ Score =.*)/ms);
# Store the $beginning_annotation as the first entry in @HSPs
push(@HSPs, $beginning_annotation);
# Parse the HSPs, store each HSP as an element in @HSPs
while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {
push(@HSPs, $&);
}
# Return an array with first element = the beginning annotation,
# and each successive element = an HSP
return(@HSPs);
}
# extract_HSP_information
#
# --parse a HSP from a BLAST output alignment section
# - return array with elements:
# Expect value
# Query string
# Query range
# Subject string
# Subject range
sub extract_HSP_information {
my($HSP) = @_;
# declare and initialize variables
my($expect) = '';
my($query) = '';
my($query_range) = '';
my($subject) = '';
my($subject_range) = '';
($expect) = ($HSP =~ /Expect = (\S+)/);
$query = join ( '' , ($HSP =~ /^Query.*\n/gm) );
$subject = join ( '' , ($HSP =~ /^Sbjct.*\n/gm) );
$query_range = join('..', ($query =~ /(\d+).*\D(\d+)/s));
$subject_range = join('..', ($subject =~ /(\d+).*\D(\d+)/s));
$query =~ s/[^acgt]//g;
$subject =~ s/[^acgt]//g;
return ($expect, $query, $query_range, $subject, $subject_range);
}
Example 12-2 gives the following output:
-> Expect value: 5e-52
-> Query string: ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg
gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgacaactgcaatgagt
ggttccatggggactgcatccggatca
-> Query range: 235..400
-> Subject String: ctggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatg
ggctgagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattggacttgtgacaactgca
acgagtggttccatggagactgcatccggatca
-> Subject range: 1048..1213
Let's discuss the new features of Example 12-2
and its subroutines. First notice that the two new subroutines from
Example 12-1 have been placed into the
BeginPerlBioinfo.pm module, so they aren't
printed again here.
The main program, Example 12-2, starts the same as
Example 12-1; it calls the
parse_blast
subroutine to separate the annotation from the alignments in the
BLAST output file.
The next line fetches one of the alignments from the
%alignments hash, which is then used as
the argument to the
parse_blast_alignment_HSP
subroutine, which then returns an array of annotation (as the first
element) and HSPs in @HSPs. Here you see that not
only can a subroutine return an array on a scalar value; it can also
return a hash.
Finally, Example 12-2 does the lower-level parsing of
an individual HSP by calling the
extract_HSP_information
subroutine, and the extracted parts of one of the HSPs are printed.
Example 12-2 shows a certain inconsistency in our
design. Some subroutines call their arguments by reference; others call them
by value (see Chapter 6). You may ask: is this a bad thing?
The answer is: not necessarily. The subroutine
parse_blast mixes several arguments, and one of
them is not a scalar type. Recall that this is a potentially good
place to use
call-by-reference in Perl. The other
subroutines don't mix argument types this way. However, they
can be designed to call their arguments by reference.
Continuing with the code, let's examine the subroutine
parse_blast_alignment_HSP
.
This takes one of the alignments from the BLAST output and separates
out the individual HSP string matches. The technique used is, once
again, regular expressions operating on a single string that contains
all the lines of the alignment given as the input argument.
The first regular expression parses out the annotation and the
section containing the HSPs:
($beginning_annotation, $HSP_section )
= ($alignment =~ /(.*?)(^ Score =.*)/ms);
The first parentheses in the regular expression is
(.*?)
This is the nongreedy or minimal
matching mentioned in Chapter 9. Here it gobbles
up everything before the first line that begins
Score = (without the
? after the *, it would gobble
everything until the final line that begins Score
=). This is the exact dividing line between the
beginning annotation and the HSP string matches.
The next loop and regular expression separates the individual HSP
string matches:
while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {
push(@HSPs, $&);
}
This is the same kind of global string match in a
while loop you've seen before; it keeps
iterating as long as the match can be found. The other modifier
/m
is the multiline modifier, which enables the metacharacters
$ and
^ to match before and after embedded newlines.
The expression within the first pair of
parentheses—(^ Score
=.*\n)—matches a line that begins
Score =, which is the kind of line that begins an
HSP string match section.
The code within the second pair of
parentheses—(^(?! Score
=).*\n)+—matches one or more (the
+ following the other parentheses) lines that do
not begin with Score =. The
?! at the beginning of the embedded parentheses is
the negative lookahead assertion you encountered in Example 12-1. So, in total, the regular expression captures
a line beginning with Score =
and all succeeding adjacent lines
that don't begin with Score
=.