Chapter
12. BLAST
In biological research, the search for sequence similarity is
very important. For instance, a researcher who has discovered a
potentially important DNA or protein sequence wants to know if
it's already been identified and characterized by another
researcher. If it hasn't, the researcher wants to know if it
resembles any known sequence from any organism. This information can
provide vital clues as to the role of the sequence in the organism.
The Basic Local
Alignment Search Tool (BLAST) is one of the most popular software
tools in biological research. It tests a query sequence against a
library of known sequences in order to find similarity. BLAST is
actually a collection of programs with versions for query-to-database
pairs such as nucleotide-nucleotide, protein-nucleotide,
protein-protein, nucleotide-protein, and more.
This chapter examines the output from the
nucleotide-nucleotide version of the
program, BLASTN
. For simplicity's
sake, I'll simply refer to it here as BLAST. The main goal of
this chapter is to show how to write code to parse a BLAST output
file using regular expressions. The code is simple and basic, but it
does the job. Once you understand the basics, you can build more
features into your parser or obtain one of the fancier BLAST output
parsers that's available via the Web. In either case,
you'll know enough about output parsers to use or extend them.
This chapter also gives you a brief introduction to Bioperl, which is
a collection of
Perl bioinformatics modules. The Bioperl
project is an example of an open source project that you, the Perl
bioinformatics programmer, can put to good use. The Perl programming
language is itself an open source project. The program and its source
code are available for use and modification with only very reasonable
restrictions and at no cost.