8.5
Reading DNA from Files in FASTA Format
Over the fairly short history of bioinformatics, several different
biologists and programmers invented several ways to format sequence
data in computer files, and so bioinformaticians must deal with these
different formats. We need to extract the sequence data and the
annotations from these files, which requires writing code to deal
with each different format.
There are many such formats, perhaps as many as 20 in regular
use for DNA alone. The very multiplicity of these formats can be an
annoyance when you're analyzing a sequence in the lab: it
becomes necessary to translate from one format to another for the
various programs you use to examine the sequence. Here are some of
the most popular:
-
FASTA
-
The FASTA and Basic Local
Alignment Search Technique (BLAST) programs are popular; they both
use the FASTA format. Because of its simplicity, the FASTA format is
perhaps the most widely used of all formats, aside from GenBank.
-
Genetic Sequence Data Bank (GenBank)
-
GenBank is a collection of all publicly released genetic data. It
includes lots of information in addition to the DNA sequence.
It's very important, and we'll be looking closely at
GenBank files in Chapter 10.
-
European Molecular Biology Laboratory (EMBL)
-
The EMBL database has substantially the same data
as the GenBank and the DDBJ (DNA Data Bank of Japan), but the format
is somewhat different.
-
Simple data, or Applied Biosystems (ABI) sequencer output
-
This is DNA sequence
data that has no formatting whatsoever,
just the characters that represent the bases; it is output into files
by the sequencing machines from ABI and from other machines and
programs.
-
Protein Identification Resource (PIR)
-
PIR is
a well-curated collection of protein
sequence data.
-
Genetics Computer Group (GCG)
-
The GCG program (a.k.a.
the GCG Wisconsin package) from Accelrys is used at many large
research institutions. Data must be in GCG format to be usable by
their programs.
Of these six sequence formats, GenBank and FASTA are by far the most
common. The next few sections take you through the process of reading
and manipulating data in FASTA.
8.5.1
FASTA Format
Let's write a subroutine that can handle
FASTA-style data. This is useful in
its own right and as a warm-up for the upcoming chapters on GenBank,
PDB, and BLAST.
FASTA format is basically just lines of sequence data with newlines
at the end so it can be printed on a page or displayed on a computer
screen. The length of the lines isn't specified, but for
compatibility, it's best to limit them to 80 characters in
length. There is also header
information, a line or lines at the
beginning of the file that start with the greater-than > character, that can
contain any text whatsoever (or no text). Typically, a header line
contains the name of the DNA or the gene it comes from, often
separated by a vertical bar from additional information about the
sequence, the experiment that produced it, or other, nonsequence
information of that nature.
Much FASTA-aware software insists that there must be only one header
line; others permit several lines. Our subroutine will accept either
one or several header lines plus comments beginning with #.
The following is a FASTA file. We'll call it
sample.dna and use it in several programs. You
should copy it, download it from this book's web site, or make
up your own file with your own data.
> sample dna | (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
8.5.2
A Design to Read FASTA Files
In Chapter 4, you learned how to read in sequence
data; here, you just have to extend that method to deal with the
header lines. You'll also learn how to discard empty lines and
lines that begin with the pound sign #, i.e., comments in Perl and
other languages and file formats. (These don't appear in the
FASTA file sample.dna just shown.)
There are two choices when reading in the data. You can read from the
open file one line at a time, making decisions as you go. Or, you can
slurp the whole file into an array and then operate on the array. For
very big files, it's sometimes best to read them one line at a
time, especially if you're looking for some small bit of
information. (This is because reading a large file into an array uses
a large amount of memory. If your system isn't robust enough,
it may crash.)
For smaller, normal-sized files, the advantage to reading all the
data into an array is that you can then easily look through at the
data and do operations on it. That's what we'll do with
our subroutine, but remember, this approach can cause memory space
problems with larger files, and there are other ways of proceeding.
Let's write a
subroutine
that, given as an argument a filename containing FASTA-formatted
data, returns the sequence data.
Before doing so you should think about whether you should have just
one subroutine, or perhaps one subroutine that opens and reads a
file, called by another subroutine that extracts the sequence data.
Let's use two subroutines, keeping in mind that you can reuse
the subroutine that deals with arbitrary files every time you need to
write such a program for other formats.
Let's start with some pseudocode:
subroutine get data from a file
argument = filename
open file
if can't open, print error message and exit
read in data and
return @data
}
Subroutine extract sequence data from fasta file
argument = array of file data in fasta format
Discard all header lines
(and blank and comment lines for good measure)
If first character of first line is >, discard it
Read in the rest of the file, join in a scalar,
edit out nonsequence data
return sequence
}
In the first subroutine that gets data from a file, there's a
question as to what's the best thing to do when the file
can't be read. Here, we're taking the drastic approach:
yelling "Fire!" and exiting. But you wouldn't
necessarily want your program to just stop whenever it can't
open a file. Maybe you're asking for filenames from the user at
the keyboard or on a web page, and you'd like to give them
three chances to type in the filename correctly. Or maybe, if the
file can't be opened, you want a default file instead.
Maybe you can return a false value, such as an
empty array, if you can't open the file. Then a program that
calls this subroutine can exit, try again, or whatever it wants. But
what if you successfully open the file, but it was absolutely empty?
Then you'd have succeeded and returned an empty array, and the
program calling this subroutine would think incorrectly, that the
file couldn't be opened. So, that wouldn't work.
There are other options, such as returning the special
"undefined" value. Let's keep what we've got,
but it's important to remember that handling errors can be an
important, and sometimes tricky, part of writing robust
code, code that responds well in unusual circumstances.
The second subroutine takes the array of FASTA-formatted sequence and
returns just the unformatted sequence in a string.
8.5.3
A Subroutine to Read FASTA Files
Now that
you've thought about the problem,
written some pseudocode, considered alternate ways of designing the
subroutines and the costs and benefits of the choices, you're
ready to
code:
# get_file_data
#
# A subroutine to get data from a file given its filename
sub get_file_data {
my($filename) = @_;
use strict;
use warnings;
# Initialize variables
my @filedata = ( );
unless( open(GET_FILE_DATA, $filename) ) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
@filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return @filedata;
}
# extract_sequence_from_fasta_data
#
# A subroutine to extract FASTA sequence data from an array
sub extract_sequence_from_fasta_data {
my(@fasta_file_data) = @_;
use strict;
use warnings;
# Declare and initialize variables
my $sequence = '';
foreach my $line (@fasta_file_data) {
# discard blank line
if ($line =~ /^\s*$/) {
next;
# discard comment line
} elsif($line =~ /^\s*#/) {
next;
# discard fasta header line
} elsif($line =~ /^>/) {
next;
# keep line, add to sequence string
} else {
$sequence .= $line;
}
}
# remove non-sequence data (in this case, whitespace) from $sequence string
$sequence =~ s/\s//g;
return $sequence;
}
Notice that nowhere in the code for
extract_sequence_from_fasta_data
do you check to see what's in the file: is it really DNA or
protein sequence data in FASTA format? Of course, you can write a
subroutine—call it is_fasta—that
checks the data to see if it's what we expect. But I'll
leave that for the exercises.
A few comments about the
extract_sequence_from_fasta_data subroutine
should be made. The following line includes a
variable declaration as it is
used in a loop:
foreach my $line (@fasta_file_data) {
You've seen this in for loops as well.
It's convenient to declare these my
variables as $line on the spot, as they tend to
have common names and aren't used outside the loop.
Some of the regular expressions deserve brief
comment. In this line:
if ($line =~ /^\s*$/) {
the \s matches
whitespace, that is,
space, tab, formfeed, carriage return, or newline.
\s* matches any amount of whitespace (even none).
The ^ matches the beginning of the line, and the
$ matches the end of the line. So altogether, this
regular expression matches blank lines with nothing or only
whitespace in them.
This regular expression also has nothing or only whitespace at the
beginning of the line, up to a pound sign:
} elsif($line =~ /^\s*#/) {
This expression matches a greater-than sign at the beginning of the
line:
} elsif($line =~ /^>/) {
Finally, the following statement
removes
whitespace, including newlines:
$sequence =~ s/\s//g;
We've placed these two new subroutines into our
BeginPerlBioinfo.pm module. Now let's
write a main program for these subroutines and look at the output.
First, there's one more subroutine to write that handles the
printing of long sequences.
8.5.4
Writing Formatted Sequence Data
When
you try to
print the "raw" sequence
data, it can be a problem if the data is much longer than the width
of the page. For most practical purposes, 80 characters is about the
maximum length you should try to fit across a page. Let's write
a print_sequence subroutine that takes as its
arguments some sequence and a line length and prints out the
sequence, breaking it up into lines of that length. It will have a
strong similarity to the dna2peptide subroutine.
Here it is:
# print_sequence
#
# A subroutine to format and print sequence data
sub print_sequence {
my($sequence, $length) = @_;
use strict;
use warnings;
# Print sequence in lines of $length
for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {
print substr($sequence, $pos, $length), "\n";
}
}
The code depends on the behavior of substr,
which gives the partial substring at the end of the string, even if
it's less than the requested length. You can see there's
a new print_sequence subroutine in the
BeginPerlBioinfo.pm module (see Chapter 6). We remembered to keep the statement
1; as the last line of the module. Example 8-2 shows the main program.
Example 8-2. Read a FASTA file and extract the sequence data
#!/usr/bin/perl
# Read a fasta file and extract the sequence data
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my @file_data = ( );
my $dna = '';
# Read in the contents of the file "sample.dna"
@file_data = get_file_data("sample.dna");
# Extract the sequence data from the contents of the file "sample.dna"
$dna = extract_sequence_from_fasta_data(@file_data);
# Print the sequence in lines 25 characters long
print_sequence($dna, 25);
exit;
Here's the output of Example 8-2:
agatggcggcgctgaggggtcttgg
gggctctaggccggccacctactgg
tttgcagcggagacgacgcatgggg
cctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagt
agttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgc
aggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcgga
gatatggagggagatggttcagacc
cagagcctccagatgccggggagga
cagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgca
aaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtgg
ttccatggggactgcatccggatca
ctgagaagatggccaaggccatccg
ggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattc
gctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagc
agtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatcc
agacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttg
ctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggcc
acacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagc
ccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtc
actgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatc
cggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgta
caagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgc
caaggccccgccggccactgcccac
ccaacagcagccacagccatcacag
aagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagt
caaggagcctcctgaggctacagcc
acacctgagccactctcagatgagg
accta
8.5.5
A Main Program for Reading DNA and Writing Protein
Now, one final program for this section. Let's add to the
preceding program a
translation from DNA to protein and
print out the protein instead. Notice how short Example 8-3 is! As you accumulate useful subroutines in
our modules, programs get easier and easier to write.
Example 8-3. Read a DNA FASTA file, translate to protein, and format output
#!/usr/bin/perl
# Read a fasta file and extract the DNA sequence data
# Translate it to protein and print it out in 25-character-long lines
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables
my @file_data = ( );
my $dna = '';
my $protein = '';
# Read in the contents of the file "sample.dna"
@file_data = get_file_data("sample.dna");
# Extract the sequence data from the contents of the file "sample.dna"
$dna = extract_sequence_from_fasta_data(@file_data);
# Translate the DNA to protein
$protein = dna2peptide($dna);
# Print the sequence in lines 25 characters long
print_sequence($protein, 25);
exit;
Here's the output of Example 8-3:
RWRR_GVLGALGRPPTGLQRRRRMG
PAQ_EYAAWEA_LEAEVVVGAFATA
WDAAEWSVQVRGSLAGVVRECAGSG
DMEGDGSDPEPPDAGEDSKSENGEN
APIYCICRKPDINCFMIGCDNCNEW
FHGDCIRITEKMAKAIREWYCRECR
EKDPKLEIRYRHKKSRERDGNERDS
SEPRDEGGGRKRPVPDPDLQRRAGS
GTGVGAMLARGSASPHKSSPQPLVA
TPSQHHQQQQQQIKRSARMCGECEA
CRRTEDCGHCDFCRDMKKFGGPNKI
RQKCRLRQCQLRARESYKYFPSSLS
PVTPSESLPRPRRPLPTQQQPQPSQ
KLGRIREDEGAVASSTVKEPPEATA
TPEPLSDEDL