10.4
Parsing Annotations
Now that you've
successfully extracted the sequence, let's look at parsing the
annotations of a GenBank file.
Looking at a GenBank record, it's interesting to think about
how to extract the useful information. The FEATURES table is
certainly a key part of the story. It has considerable structure:
what should be preserved, and what is unnecessary? For instance,
sometimes you just want to see if a word such as
"endonuclease" appears anywhere in the record. For this,
you just need a subroutine that searches for any regular expression
in the annotation. Sometimes this is enough, but when detailed
surgery is necessary, Perl has the necessary tools to make the
operation successful.
10.4.1
Using Arrays
Example 10-3 parses a few pieces
of information from the annotations in a GenBank file. It does this
using the data in the form of an array.
Example 10-3. Parsing GenBank annotations using arrays
#!/usr/bin/perl
# Parsing GenBank annotations using arrays
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my @genbank = ( );
my $locus = '';
my $accession = '';
my $organism = '';
# Get GenBank file data
@genbank = get_file_data('record.gb');
# Let's start with something simple. Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.
for my $line (@genbank) {
if($line =~ /^LOCUS/) {
$line =~ s/^LOCUS\s*//;
$locus = $line;
}elsif($line =~ /^ACCESSION/) {
$line =~ s/^ACCESSION\s*//;
$accession = $line;
}elsif($line =~ /^ ORGANISM/) {
$line =~ s/^\s*ORGANISM\s*//;
$organism = $line;
}
}
print "*** LOCUS ***\n";
print $locus;
print "*** ACCESSION ***\n";
print $accession;
print "*** ORGANISM ***\n";
print $organism;
exit;
Here's the output from Example 10-3:
*** LOCUS ***
AB031069 2487 bp mRNA PRI 27-MAY-2000
*** ACCESSION ***
AB031069
*** ORGANISM ***
Homo sapiens
Now let's slightly extend that program to handle the DEFINITION field. Notice that the
DEFINITION field can extend over more than one line. To collect that
field, use a trick you've already seen in Example 10-1: set a flag when you're in the
"state" of collecting a definition. The flag variable is
called, unsurprisingly, $flag.
Example 10-4. Parsing GenBank annotations using arrays, take 2
#!/usr/bin/perl
# Parsing GenBank annotations using arrays, take 2
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my @genbank = ( );
my $locus = '';
my $accession = '';
my $organism = '';
my $definition = '';
my $flag = 0;
# Get GenBank file data
@genbank = get_file_data('record.gb');
# Let's start with something simple. Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.
for my $line (@genbank) {
if($line =~ /^LOCUS/) {
$line =~ s/^LOCUS\s*//;
$locus = $line;
}elsif($line =~ /^DEFINITION/) {
$line =~ s/^DEFINITION\s*//;
$definition = $line;
$flag = 1;
}elsif($line =~ /^ACCESSION/) {
$line =~ s/^ACCESSION\s*//;
$accession = $line;
$flag = 0;
}elsif($flag) {
chomp($definition);
$definition .= $line;
}elsif($line =~ /^ ORGANISM/) {
$line =~ s/^\s*ORGANISM\s*//;
$organism = $line;
}
}
print "*** LOCUS ***\n";
print $locus;
print "*** DEFINITION ***\n";
print $definition;
print "*** ACCESSION ***\n";
print $accession;
print "*** ORGANISM ***\n";
print $organism;
exit;
Example 10-4 outputs:
*** LOCUS ***
AB031069 2487 bp mRNA PRI 27-MAY-2000
*** DEFINITION ***
Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds.
*** ACCESSION ***
AB031069
*** ORGANISM ***
Homo sapiens
This use of flags to remember which part of the file you're in,
from one iteration of a loop to the next, is a common technique when
extracting information from files that have multiline sections. As
the files and their fields get more complex, the code must keep track
of many flags at a time to remember which part of the file it's
in and what information needs to be extracted. It works, but as the
files become more complex, so does the code. It becomes hard to read
and hard to modify. So let's look at regular expressions as
a
vehicle for parsing annotations.
10.4.2
When to Use Regular Expressions
We've used two methods to parse GenBank
files: regular expressions and looping
through arrays of lines and setting flags. We used both methods to
separate the annotation from the sequence in a previous section of
this chapter. Both methods were equally well suited, since in GenBank
files, the annotation is followed by the sequence, clearly delimited
by an ORIGIN line: a simple structure. However,
parsing the annotations seems a bit more complicated; therefore,
let's try to use regular expressions to accomplish the task.
To begin, let's wrap the code we've been working on into
some convenient subroutines to focus on parsing the
annotations. You'll want to fetch GenBank records one at a time
from a library (a file containing one or more GenBank records),
extract the annotations and the sequence, and then if desired parse
the annotations. This would be useful if, say, you were looking for
some motif in a GenBank library. Then you can search for the motif,
and, if found, you can parse the annotations to look for additional
information about the sequence.
As mentioned previously, we'll use the file
library.gb, which you can download from this
book's web site.
Since dealing with annotation
data is somewhat complex, let's take a minute to break our
tasks into convenient subroutines. Here's the pseudocode:
sub open_file
given the filename, return the filehandle
sub get_next_record
given the filehandle, get the record
(we can get the offset by first calling "tell")
sub get_annotation_and_dna
given a record, split it into annotation and cleaned-up sequence
sub search_sequence
given a sequence and a regular expression,
return array of locations of hits
sub search_annotation
given a GenBank annotation and a regular expression,
return array of locations of hits
sub parse_annotation
separate out the fields of the annotation in a convenient form
sub parse_features
given the features field, separate out the components
The idea is to make a subroutine for each important task you want to
accomplish and then combine them into useful programs. Some of these
can be combined into other subroutines: for instance, perhaps you
want to open a file and get the record from it, all in one subroutine
call.
You're designing these subroutines to work with library files,
that is, files with multiple GenBank records. You pass the filehandle
into the subroutines as an argument, so that your subroutines can
access open library files as represented by the filehandles. Doing so
enables you to have a get_next_record
function, which is handy in a loop. Using the Perl function
tell also allows you to save the byte offset
of any record of interest, and then return later and extract the
record at that byte offset very quickly. (A byte
offset is just the number of characters into the file
where the information of interest lies.) The operating system
supports Perl in letting you go immediately to any byte offset
location in even huge files, thus bypassing the usual way of opening
the file and reading from the beginning until you get where you want
to be.
Using a byte offset is important when you're dealing with large
files. Perl gives you built-in functions such as
seek that allow you, on an open file, to go
immediately to any location in the
file. The idea is that when you find
something in a file, you can save the byte offset using the Perl
function tell. Then, when you want to return to
that point in the file, you can just call the Perl function
seek with the byte offset as an argument.
You'll see this later in this chapter when you build a DBM file
to look up records based on their accession numbers. But the main
point is that with a 250-MB file, it takes too long to find something
by searching from the beginning, and there are ways of getting around
it.
The parsing of the data is done in three steps, according to the
design:
-
You'll separate out the annotation and the sequence (which
you'll clean up by removing whitespace, etc., and making it a
simple string of sequence). Even at this step, you can search for
motifs in the sequence, as well as look for text in the annotation.
-
Extract out the fields.
-
Parse the features table.
These steps seem natural, and, depending on what you want to do,
allow you to parse to whatever depth is needed.
Here's a main program in pseudocode that shows how to use those
subroutines:
open_file
while ( get_next_record )
get_annotation_and_dna
if ( search_sequence for a motif AND
search_annotation for chromosome 22 )
parse_annotation
parse_features to get sizes of exons, look for small sizes
}
}
return accession numbers of records meeting the criteria
This example shows how to use subroutines to answer a question such
as: what are the genes on chromosome 22 that contain a
given
motif and have small exons?
10.4.3
Main Program
Let's test these
subroutines with Example 10-5, which has some subroutine definitions that
will be added to the BeginPerlBioinfo.pm module:
Example 10-5. GenBank library subroutines
#!/usr/bin/perl
# - test program of GenBank library subroutines
use strict;
use warnings;
# Don't use BeginPerlBioinfo
# Since all subroutines defined in this file
# use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my $fh; # variable to store filehandle
my $record;
my $dna;
my $annotation;
my $offset;
my $library = 'library.gb';
# Perform some standard subroutines for test
$fh = open_file($library);
$offset = tell($fh);
while( $record = get_next_record($fh) ) {
($annotation, $dna) = get_annotation_and_dna($record);
if( search_sequence($dna, 'AAA[CG].')) {
print "Sequence found in record at offset $offset\n";
}
if( search_annotation($annotation, 'homo sapiens')) {
print "Annotation found in record at offset $offset\n";
}
$offset = tell($fh);
}
exit;
################################################################################
# Subroutines
################################################################################
# open_file
#
# - given filename, set filehandle
sub open_file {
my($filename) = @_;
my $fh;
unless(open($fh, $filename)) {
print "Cannot open file $filename\n";
exit;
}
return $fh;
}
# get_next_record
#
# - given GenBank record, get annotation and DNA
sub get_next_record {
my($fh) = @_;
my($offset);
my($record) = '';
my($save_input_separator) = $/;
$/ = "//\n";
$record = <$fh>;
$/ = $save_input_separator;
return $record;
}
# get_annotation_and_dna
#
# - given GenBank record, get annotation and DNA
sub get_annotation_and_dna {
my($record) = @_;
my($annotation) = '';
my($dna) = '';
# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
# clean the sequence of any whitespace or / characters
# (the / has to be written \/ in the character class, because
# / is a metacharacter, so it must be "escaped" with \)
$dna =~ s/[\s\/]//g;
return($annotation, $dna)
}
# search_sequence
#
# - search sequence with regular expression
sub search_sequence {
my($sequence, $regularexpression) = @_;
my(@locations) = ( );
while( $sequence =~ /$regularexpression/ig ) {
push( @locations, pos );
}
return (@locations);
}
# search_annotation
#
# - search annotation with regular expression
sub search_annotation {
my($annotation, $regularexpression) = @_;
my(@locations) = ( );
# note the /s modifier--. matches any character including newline
while( $annotation =~ /$regularexpression/isg ) {
push( @locations, pos );
}
return (@locations);
}
Example 10-5 generates the following output on our
little GenBank library:
Sequence found in record at offset 0
Annotation found in record at offset 0
Sequence found in record at offset 6256
Annotation found in record at offset 6256
Sequence found in record at offset 12366
Annotation found in record at offset 12366
Sequence found in record at offset 17730
Annotation found in record at offset 17730
Sequence found in record at offset 22340
Annotation found in record at offset 22340
The tell
function reports the byte offset of the
file up to the point where it's been read; so you want to first
call tell and then read the record to get the
proper offset associated with the beginning of the record.
10.4.4
Parsing Annotations at the Top Level
Now let's parse the annotations.
There is a document from NCBI we mentioned earlier that gives the
details of the structure of a GenBank record. This file is
gbrel.txt and is part of the GenBank release,
available at the NCBI web site and their FTP site. It's updated
with every release (every two months at present), and it includes
notices of changes to the format. If you program with GenBank
records, you should read this document and keep a copy around for
reference use, and check periodically for announced changes in the
GenBank record format.
If you look back at the complete
GenBank record earlier in this chapter, you'll see that the
annotations have a certain structure. You have some fields, such as
LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, REFERENCE,
FEATURES, and BASE COUNT that start at the beginning of a line. Some
of these fields have subfields, especially the FEATURES field, which
has a fairly complex structure.
But for now, let's just extract the top-level fields. You will
need a regular expression that matches
everything from a word at the beginning of a line to a newline that
just precedes another word at the beginning of a line.
Here's a regular expression that matches our definition of a
field:
/^[A-Z].*\n(^\s.*\n)*/m
What does this regular expression say? First of all, it has the
/m pattern matching modifier, which means the
caret ^ and the dollar sign $
also match around embedded newlines (not just at the beginning and
end of the entire string, which is the default behavior).
The first part of the regular expression:
^[A-Z].*\n
matches a capital letter at the beginning of a line, followed by any
number of characters (except newlines), followed by a newline.
That's a good description of the first lines of the fields
you're trying to match.
The second part of the regular expression:
(^\s.*\n)*
matches a space or tab \s at the beginning of a
line, followed by any number of characters (except newlines),
followed by a newline. This is surrounded by parentheses and followed
by a *, which means 0 or more such lines. This matches succeeding
lines in a field, lines that start with whitespace. A field may have
no extra lines of this sort or several such lines.
So, the two parts of the regular expression combined match the fields
with their optional additional lines.
Example 10-6 shows a subroutine that, given the
annotations section of a GenBank record stored in a scalar variable,
returns a hash with keys equal to the names of the top-level fields
and values equal to the contents of those fields.
Example 10-6. Parsing Genbank annotation
#!/usr/bin/perl
# - test program for parse_annotation subroutine
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my $library = 'library.gb';
# Open library and read a record
$fh = open_file($library);
$record = get_next_record($fh);
# Parse the sequence and annotation
($annotation, $dna) = get_annotation_and_dna($record);
# Extract the fields of the annotation
%fields = parse_annotation($annotation);
# Print the fields
foreach my $key (keys %fields) {
print "******** $key *********\n";
print $fields{$key};
}
exit;
################################################################################
# Subroutine
################################################################################
# parse_annotation
#
# given a GenBank annotation, returns a hash with
# keys: the field names
# values: the fields
sub parse_annotation {
my($annotation) = @_;
my(%results) = ( );
while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) {
my $value = $&;
(my $key = $value) =~ s/^([A-Z]+).*/$1/s;
$results{$key} = $value;
}
return %results;
}
In the
subroutine
parse_annotation, note how the variables
$key and $value are scoped
within the while block. One benefit of this is
that you don't have to reinitialize the variables each time
through the loop. Also note that the key is the name of the field,
and the value is the whole field.
You should take the time to understand the regular expression that
extracts the field name for the key:
(my $key = $value) =~ s/^([A-Z]+).*/$1/s;
This first assigns $key the value
$value. It then replaces everything in
$key (note the /s modifier for
embedded newlines) with $1, which is a special
variable pattern between the first pair of parentheses
([A-Z]+). This pattern is one or more capital
letters (anchored to the beginning of the string, i.e., the field
name), so it sets $key to the value of the first
word in the field name.
You get the following output from Example 10-6 (the
test just fetches the first record in the GenBank library):
******** SOURCE *********
SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
mRNA.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
******** DEFINITION *********
DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
complete cds.
******** KEYWORDS *********
KEYWORDS .
******** VERSION *********
VERSION AB031069.1 GI:8100074
******** FEATURES *********
FEATURES Location/Qualifiers
source 1..2487
/organism="Homo sapiens"
/db_xref="taxon:9606"
/sex="male"
/cell_line="HuS-L12"
/cell_type="lung fibroblast"
/dev_stage="embryo"
gene 229..2199
/gene="PCCX1"
CDS 229..2199
/gene="PCCX1"
/note="a nuclear protein carrying a PHD finger and a CXXC
domain"
/codon_start=1
/product="protein containing CXXC domain 1"
/protein_id="BAA96307.1"
/db_xref="GI:8100075"
/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
******** REFERENCE *********
REFERENCE 2 (bases 1 to 2487)
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
Takano,T.
TITLE Direct Submission
JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
Tadahiro Fujino, Keio University School of Medicine, Department of
Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
(E-mail:fujino@microb.med.keio.ac.jp,
Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
******** ACCESSION *********
ACCESSION AB031069
******** LOCUS *********
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000
******** ORIGIN *********
ORIGIN
******** BASE *********
BASE COUNT 564 a 715 c 768 g 440 t
As you see, the method is working, and apart from the difficulty of
reading the regular expressions (which will become easier with
practice), the code is very straightforward, just a few short
subroutines.
10.4.5
Parsing the FEATURES Table
Let's take this one step further and parse the
features table to its next level, composed of the
source , gene, and
CDS features keys. (See
later in this section for a more complete list of these features
keys.) In the exercises at the end of the chapter, you'll be
challenged to descend further into the FEATURES table.
To study the FEATURES table, you should first look over the
NCBI gbrel.txt document mentioned previously.
Then you should study the most complete documentation for the
FEATURES table, available at http://www.ncbi.nlm.nih.gov/collab/FT/index.html.
10.4.5.1
Features
Although our GenBank entry is fairly simple and
includes only three features, there are actually quite a few of them.
Notice that the parsing code will find all of them, because
it's just looking at the structure of the document, not for
specific features.
The following is a list of the features defined for GenBank records.
Although lengthy, I think it's important to read through it to
get an idea of the range of information that may be present in a
GenBank record.
-
allele
-
Obsolete; see variation feature key
-
attenuator
-
Sequence related to transcription termination
-
C_region
-
Span of the C immunological feature
-
CAAT_signal
-
CAAT box in eukaryotic promoters
-
CDS
-
Sequence coding for amino acids in protein (includes stop codon)
-
conflict
-
Independent sequence determinations differ
-
D-loop
-
Displacement loop
-
D_segment
-
Span of the D immunological feature
-
enhancer
-
Cis-acting enhancer of promoter function
-
exon
-
Region that codes for part of spliced mRNA
-
gene
-
Region that defines a functional gene, possibly including upstream
(promoter, enhancer, etc.) and downstream control elements, and for
which a name has been assigned
-
GC_signal
-
GC box in eukaryotic promoters
-
iDNA
-
Intervening DNA eliminated by recombination
-
intron
-
Transcribed region excised by mRNA splicing
-
J_region
-
Span of the J immunological feature
-
LTR
-
Long terminal repeat
-
mat_peptide
-
Mature peptide coding region (doesn't include stop codon)
-
misc_binding
-
Miscellaneous binding site
-
misc_difference
-
Miscellaneous difference feature
-
misc_feature
-
Region of biological significance that can't be described by
any other feature
-
misc_recomb
-
Miscellaneous recombination feature
-
misc_RNA
-
Miscellaneous transcript feature not defined by other RNA keys
-
misc_signal
-
Miscellaneous signal
-
misc_structure
-
Miscellaneous DNA or RNA structure
-
modified_base
-
The indicated base is a modified nucleotide
-
mRNA
-
Messenger RNA
-
mutation
-
Obsolete: see variation feature key
-
N_region
-
Span of the N immunological feature
-
old_sequence
-
Presented sequence revises a previous version
-
polyA_signal
-
Signal for cleavage and polyadenylation
-
polyA_site
-
Site at which polyadenine is added to mRNA
-
precursor_RNA
-
Any RNA species that isn't yet the mature RNA product
-
prim_transcript
-
Primary (unprocessed) transcript
-
primer
-
Primer binding region used with PCR
-
primer_bind
-
Noncovalent primer binding site
-
promoter
-
A region involved in transcription initiation
-
protein_bind
-
Noncovalent protein binding site on DNA or RNA
-
RBS
-
Ribosome binding site
-
rep_origin
-
Replication origin for duplex DNA
-
repeat_region
-
Sequence containing repeated subsequences
-
repeat_unit
-
One repeated unit of a repeat_region
-
rRNA
-
Ribosomal RNA
-
S_region
-
Span of the S immunological feature
-
satellite
-
Satellite repeated sequence
-
scRNA
-
Small cytoplasmic RNA
-
sig_peptide
-
Signal peptide coding region
-
snRNA
-
Small nuclear RNA
-
source
-
Biological source of the sequence data represented by a GenBank
record; mandatory feature, one or more per record; for organisms that
have been incorporated within the NCBI taxonomy database, an
associated /db_xref="taxon:NNNN" qualifier will be
present (where NNNNN is the numeric identifier
assigned to the organism within the NCBI taxonomy database)
-
stem_loop
-
Hairpin loop structure in DNA or RNA
-
STS
-
Sequence Tagged Site: operationally unique sequence that identifies
the combination of primer spans used in a PCR assay
-
TATA_signal
-
TATA box in eukaryotic promoters
-
terminator
-
Sequence causing transcription termination
-
transit_peptide
-
Transit peptide coding region
-
transposon
-
Transposable element (TN)
-
tRNA
-
Transfer RNA
-
unsure
-
Authors are unsure about the sequence in this region
-
V_region
-
Span of the V immunological feature
-
variation
-
A related population contains stable mutation
-
-
-
Placeholder (hyphen)
-
-10_signal
-
Pribnow box in prokaryotic promoters
-
-35_signal
-
-35 box in prokaryotic promoters
-
3'clip
-
3'-most region of a precursor transcript removed in processing
-
3'UTR
-
3' untranslated region (trailer)
-
5'clip
-
5'-most region of a precursor transcript removed in processing
-
5'UTR
-
5' untranslated region (leader)
These feature keys can have their own additional features, which
you'll see here and in the exercises.
10.4.5.2
Parsing
Example 10-7 finds
whatever features are present and returns an array populated with
them. It doesn't look for the complete list of features as
presented in the last section; it finds just the features that are
actually present in the GenBank record and returns them for further
use.
It's often the case that there are multiple instances of the
same feature in a record. For instance, there may be several exons
specified in the FEATURES table of a GenBank record. For this reason
we'll store the features as elements in an array, rather than
in a hash keyed on the feature name (as this allows you to store, for
instance, only one instance of an exon).
Example 10-7. Testing subroutine parse_features
#!/usr/bin/perl
# - main program to test parse_features
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my @features;
my $library = 'library.gb';
# Get the fields from the first GenBank record in a library
$fh = open_file($library);
$record = get_next_record($fh);
($annotation, $dna) = get_annotation_and_dna($record);
%fields = parse_annotation($annotation);
# Extract the features from the FEATURES table
@features = parse_features($fields{'FEATURES'});
# Print out the features
foreach my $feature (@features) {
# extract the name of the feature (or "feature key")
my($featurename) = ($feature =~ /^ {5}(\S+)/);
print "******** $featurename *********\n";
print $feature;
}
exit;
############################################################################
# Subroutine
############################################################################
# parse_features
#
# extract the features from the FEATURES field of a GenBank record
sub parse_features {
my($features) = @_; # entire FEATURES field in a scalar variable
# Declare and initialize variables
my(@features) = (); # used to store the individual features
# Extract the features
while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {
my $feature = $&;
push(@features, $feature);
}
return @features;
}
Example 10-7 gives the output:
******** source *********
source 1..2487
/organism="Homo sapiens"
/db_xref="taxon:9606"
/sex="male"
/cell_line="HuS-L12"
/cell_type="lung fibroblast"
/dev_stage="embryo"
******** gene *********
gene 229..2199
/gene="PCCX1"
******** CDS *********
CDS 229..2199
/gene="PCCX1"
/note="a nuclear protein carrying a PHD finger and a CXXC
domain"
/codon_start=1
/product="protein containing CXXC domain 1"
/protein_id="BAA96307.1"
/db_xref="GI:8100075"
/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR
In subroutine parse_features of Example 10-7, the regular expression that extracts the
features is much like the regular expression used in Example 10-6 to parse the top level of the annotations.
Let's look at the essential parsing code of Example 10-7:
while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {
On the whole, and in brief, this regular expression finds features
formatted with the first lines beginning with 5 spaces, and optional
continuation lines beginning with 21 spaces.
First, note that the pattern modifier /m enables
the ^ metacharacter to match after embedded
newlines. Also, the {5} and
{21} are quantifiers that specify there should be
exactly 5, or 21, of the preceding item, which in both cases is a
space.
The regular expression is in two parts, corresponding to the first
line and optional continuation lines of the feature. The first part
^ {5}\S.*\n means that the
beginning of a line (^) has 5 spaces
({5}), followed by a non-whitespace character
(\S) followed by any number of non-newlines
(.*) followed by a newline
(\n). The second part of the regular expression,
(^ {21}\S.*\n)* means the
beginning of a line (^) has 21 spaces
({21}) followed by a non-whitespace character
(\S) followed by any number of non-newlines
(.*) followed by a newline
(\n); and there may be 0 or
more such lines, indicated by the ()* around the
whole expression.
The main program has a short regular expression along similar lines
to extract the feature name (also called the feature key) from the
feature.
So, again, success. The FEATURES table is now decomposed or
"parsed" in some detail, down to the level of separating
the individual features. The next stage in parsing the FEATURES table
is to extract the detailed information for each feature. This
includes the location (given on the same line as the feature name,
and possibly on additional lines); and the qualifiers indicated by a
slash, a qualifier name, and if applicable, an equals sign and
additional information of various kinds, possibly continued on
additional lines.
I'll leave this final step for the exercises. It's a
fairly straightforward extension of the approach we've been
using to parse the features. You will want to consult the
documentation from the NCBI web site for complete details about the
structure of the FEATURES table before trying to parse the location
and qualifiers from a feature.
The method I've used to parse the FEATURES table maintains the
structure of the information. However, sometimes you just want to see
if some word such as "endonulease" appears anywhere in
the record. For this, recall that you created a
search_annotation
subroutine in Example 10-5 that searches for any
regular expression in the entire annotation; very often, this is all
you really need. As you've now seen, however, when you really
need to dig into the FEATURES table, Perl has its own features that
make the job possible and even fairly easy.