12.2
String Matching and Homology
String matching is the
computer-science
term for algorithms that find one string embedded in another. It has
a fairly long and fruitful history, and many
string-matching algorithms have been
developed using a variety of techniques and for different cases. (See
the Gusfield book in Appendix A for an excellent
treatment with a biological emphasis.) We've already done a
fair amount of string matching, using the
binding
operator to search for motifs and other text with regular
expressions.
BLAST is basically a string-matching program. Details of the
string-matching algorithms, and of the algorithms used in BLAST in
particular, are beyond the scope of this book. But first I want to
define some terms that are frequently confused or used
interchangeably. I also briefly introduce the BLAST statistics.
Biological string matching looks for similarity as an
indication of homology. Similarity between the
query and the sequences in the database may be measured by the
percent identity, or
the number of bases in the query that exactly match a corresponding
region of a sequence from the database. It may also be measured by
the degree of
conservation, which finds
matches between equivalent (redundant) codons or between amino acid
residues with similar properties that don't alter the function
of a protein (see Chapter 8).
Homology between sequences means the sequences
are related evolutionarily. Two sequences are or are not homologous;
there's no degree of homology.
At the risk of oversimplifying a complex topic, I'll summarize
a few facts about BLAST statistics. (See the BLAST documentation for
a complete picture.) The output of a BLAST search reports a set of
scores and statistics on the matches it
has found based on the raw score S, various parameters of the scoring
algorithm, and properties of the query and database. The
raw score S is a measure of similarity and
the size of the match. The BLAST output lists the hits ranked by
their E value. The E (expect)
value of a match measures,
roughly, the chances that the string matching (allowing for gaps)
occurs in a randomly generated database of the same size and
composition. The closer to 0 the E value is, the less likely it
occurred by chance. In other words, the lower the E value, the better
the match. As a general rule of thumb for BLASTN, an E value less
than 1 may be a solid hit, and an E value of less than 10 may be
worth looking at, but this is not a hard and fast rule. (Of course,
proteins can be homologous with even a very
small percent identity; the percent similarity is typically higher
for homologous DNA.)
Now that you have the basics, let's write code to parse BLAST
output. First, you separate the hits, then extract the sequence, and
finally, you find the annotation showing the E value statistic.