< BACKCONTINUE >

10.4 Parsing Annotations

Now that you've successfully extracted the sequence, let's look at parsing the annotations of a GenBank file.

Looking at a GenBank record, it's interesting to think about how to extract the useful information. The FEATURES table is certainly a key part of the story. It has considerable structure: what should be preserved, and what is unnecessary? For instance, sometimes you just want to see if a word such as "endonuclease" appears anywhere in the record. For this, you just need a subroutine that searches for any regular expression in the annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has the necessary tools to make the operation successful.

10.4.1 Using Arrays

Example 10-3 parses a few pieces of information from the annotations in a GenBank file. It does this using the data in the form of an array.

Example 10-3. Parsing GenBank annotations using arrays
#!/usr/bin/perl
# Parsing GenBank annotations using arrays

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my @genbank = (  );
my $locus = '';
my $accession = '';
my $organism = '';

# Get GenBank file data
@genbank = get_file_data('record.gb');

# Let's start with something simple.  Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.

for my $line (@genbank) {
  if($line =~ /^LOCUS/) {
    $line =~ s/^LOCUS\s*//;
    $locus = $line;
  }elsif($line =~ /^ACCESSION/) {
    $line =~ s/^ACCESSION\s*//;
    $accession = $line;
  }elsif($line =~ /^  ORGANISM/) {
    $line =~ s/^\s*ORGANISM\s*//;
    $organism = $line;
  }
}

print "*** LOCUS ***\n";
print $locus;
print "*** ACCESSION ***\n";
print $accession;
print "*** ORGANISM ***\n";
print $organism;

exit;

Here's the output from Example 10-3:

*** LOCUS ***
AB031069     2487 bp    mRNA            PRI       27-MAY-2000
*** ACCESSION ***
AB031069
*** ORGANISM ***
Homo sapiens

Now let's slightly extend that program to handle the DEFINITION field. Notice that the DEFINITION field can extend over more than one line. To collect that field, use a trick you've already seen in Example 10-1: set a flag when you're in the "state" of collecting a definition. The flag variable is called, unsurprisingly, $flag.

Example 10-4. Parsing GenBank annotations using arrays, take 2
#!/usr/bin/perl
# Parsing GenBank annotations using arrays, take 2

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my @genbank = (  );
my $locus = '';
my $accession = '';
my $organism = '';
my $definition = '';
my $flag = 0;

# Get GenBank file data
@genbank = get_file_data('record.gb');

# Let's start with something simple.  Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.

for my $line (@genbank) {
  if($line =~ /^LOCUS/) {
    $line =~ s/^LOCUS\s*//;
    $locus = $line;
  }elsif($line =~ /^DEFINITION/) {
    $line =~ s/^DEFINITION\s*//;
    $definition = $line;
    $flag = 1;
  }elsif($line =~ /^ACCESSION/) {
    $line =~ s/^ACCESSION\s*//;
    $accession = $line;
    $flag = 0;
  }elsif($flag) {
    chomp($definition);
    $definition .= $line;
  }elsif($line =~ /^  ORGANISM/) {
    $line =~ s/^\s*ORGANISM\s*//;
    $organism = $line;
  }
}

print "*** LOCUS ***\n";
print $locus;
print "*** DEFINITION ***\n";
print $definition;
print "*** ACCESSION ***\n";
print $accession;
print "*** ORGANISM ***\n";
print $organism;

exit;

Example 10-4 outputs:

*** LOCUS ***
AB031069     2487 bp    mRNA            PRI       27-MAY-2000
*** DEFINITION ***
Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds.
*** ACCESSION ***
AB031069
*** ORGANISM ***
Homo sapiens

This use of flags to remember which part of the file you're in, from one iteration of a loop to the next, is a common technique when extracting information from files that have multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file it's in and what information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So let's look at regular expressions as a vehicle for parsing annotations.

10.4.2 When to Use Regular Expressions

We've used two methods to parse GenBank files: regular expressions and looping through arrays of lines and setting flags. We used both methods to separate the annotation from the sequence in a previous section of this chapter. Both methods were equally well suited, since in GenBank files, the annotation is followed by the sequence, clearly delimited by an ORIGIN line: a simple structure. However, parsing the annotations seems a bit more complicated; therefore, let's try to use regular expressions to accomplish the task.

To begin, let's wrap the code we've been working on into some convenient subroutines to focus on parsing the annotations. You'll want to fetch GenBank records one at a time from a library (a file containing one or more GenBank records), extract the annotations and the sequence, and then if desired parse the annotations. This would be useful if, say, you were looking for some motif in a GenBank library. Then you can search for the motif, and, if found, you can parse the annotations to look for additional information about the sequence.

As mentioned previously, we'll use the file library.gb, which you can download from this book's web site.

Since dealing with annotation data is somewhat complex, let's take a minute to break our tasks into convenient subroutines. Here's the pseudocode:

sub open_file
    given the filename, return the filehandle

sub get_next_record
    given the filehandle, get the record
    (we can get the offset by first calling "tell")

sub get_annotation_and_dna
    given a record, split it into annotation and cleaned-up sequence

sub search_sequence
    given a sequence and a regular expression,
      return array of locations of hits

sub search_annotation
    given a GenBank annotation and a regular expression,
      return array of locations of hits

sub parse_annotation
    separate out the fields of the annotation in a convenient form

sub parse_features
    given the features field, separate out the components

The idea is to make a subroutine for each important task you want to accomplish and then combine them into useful programs. Some of these can be combined into other subroutines: for instance, perhaps you want to open a file and get the record from it, all in one subroutine call.

You're designing these subroutines to work with library files, that is, files with multiple GenBank records. You pass the filehandle into the subroutines as an argument, so that your subroutines can access open library files as represented by the filehandles. Doing so enables you to have a get_next_record function, which is handy in a loop. Using the Perl function tell also allows you to save the byte offset of any record of interest, and then return later and extract the record at that byte offset very quickly. (A byte offset is just the number of characters into the file where the information of interest lies.) The operating system supports Perl in letting you go immediately to any byte offset location in even huge files, thus bypassing the usual way of opening the file and reading from the beginning until you get where you want to be.

Using a byte offset is important when you're dealing with large files. Perl gives you built-in functions such as seek that allow you, on an open file, to go immediately to any location in the file. The idea is that when you find something in a file, you can save the byte offset using the Perl function tell. Then, when you want to return to that point in the file, you can just call the Perl function seek with the byte offset as an argument. You'll see this later in this chapter when you build a DBM file to look up records based on their accession numbers. But the main point is that with a 250-MB file, it takes too long to find something by searching from the beginning, and there are ways of getting around it.

The parsing of the data is done in three steps, according to the design:

  1. You'll separate out the annotation and the sequence (which you'll clean up by removing whitespace, etc., and making it a simple string of sequence). Even at this step, you can search for motifs in the sequence, as well as look for text in the annotation.

  2. Extract out the fields.

  3. Parse the features table.

These steps seem natural, and, depending on what you want to do, allow you to parse to whatever depth is needed.

Here's a main program in pseudocode that shows how to use those subroutines:

open_file 

while ( get_next_record )

    get_annotation_and_dna

    if ( search_sequence for a motif AND 
         search_annotation for chromosome 22 )

        parse_annotation

        parse_features to get sizes of exons, look for small sizes
    }
}

return accession numbers of records meeting the criteria

This example shows how to use subroutines to answer a question such as: what are the genes on chromosome 22 that contain a given motif and have small exons?

10.4.3 Main Program

Let's test these subroutines with Example 10-5, which has some subroutine definitions that will be added to the BeginPerlBioinfo.pm module:

Example 10-5. GenBank library subroutines
#!/usr/bin/perl
#  - test program of GenBank library subroutines

use strict;
use warnings;
# Don't use BeginPerlBioinfo
# Since all subroutines defined in this file
# use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my $fh; # variable to store filehandle
my $record;
my $dna;
my $annotation;
my $offset;
my $library = 'library.gb';

# Perform some standard subroutines for test
$fh = open_file($library);

$offset = tell($fh);

while( $record = get_next_record($fh) ) {

    ($annotation, $dna) = get_annotation_and_dna($record);

    if( search_sequence($dna, 'AAA[CG].')) {
        print "Sequence found in record at offset $offset\n";
    }
    if( search_annotation($annotation, 'homo sapiens')) {
        print "Annotation found in record at offset $offset\n";
    }

    $offset = tell($fh);
}

exit;

################################################################################
# Subroutines
################################################################################

# open_file
#
#   - given filename, set filehandle

sub open_file {

    my($filename) = @_;
    my $fh;

    unless(open($fh, $filename)) {
        print "Cannot open file $filename\n";
        exit;
    }
    return $fh;
}

# get_next_record
#
#   - given GenBank record, get annotation and DNA

sub get_next_record {

    my($fh) = @_;

    my($offset);
    my($record) = '';
    my($save_input_separator) = $/;

    $/ = "//\n";

    $record = <$fh>;

    $/ = $save_input_separator;

    return $record;
}

# get_annotation_and_dna
#
#   - given GenBank record, get annotation and DNA

sub get_annotation_and_dna {

    my($record) = @_;

    my($annotation) = '';
    my($dna) = '';

    # Now separate the annotation from the sequence data
    ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

    # clean the sequence of any whitespace or / characters 
    #  (the / has to be written \/ in the character class, because
    #   / is a metacharacter, so it must be "escaped" with \)
    $dna =~ s/[\s\/]//g;

    return($annotation, $dna)
}

# search_sequence
#
#   - search sequence with regular expression

sub search_sequence {

    my($sequence, $regularexpression) = @_;

    my(@locations) = (  );

    while( $sequence =~ /$regularexpression/ig ) {
        push( @locations, pos );
    }

    return (@locations);
}

# search_annotation
#
#   - search annotation with regular expression

sub search_annotation {

    my($annotation, $regularexpression) = @_;

    my(@locations) = (  );

    # note the /s modifier--. matches any character including newline
    while( $annotation =~ /$regularexpression/isg ) {
        push( @locations, pos );
    }

    return (@locations);
}

Example 10-5 generates the following output on our little GenBank library:

Sequence found in record at offset 0
Annotation found in record at offset 0
Sequence found in record at offset 6256
Annotation found in record at offset 6256
Sequence found in record at offset 12366
Annotation found in record at offset 12366
Sequence found in record at offset 17730
Annotation found in record at offset 17730
Sequence found in record at offset 22340
Annotation found in record at offset 22340

The tell function reports the byte offset of the file up to the point where it's been read; so you want to first call tell and then read the record to get the proper offset associated with the beginning of the record.

10.4.4 Parsing Annotations at the Top Level

Now let's parse the annotations.

There is a document from NCBI we mentioned earlier that gives the details of the structure of a GenBank record. This file is gbrel.txt and is part of the GenBank release, available at the NCBI web site and their FTP site. It's updated with every release (every two months at present), and it includes notices of changes to the format. If you program with GenBank records, you should read this document and keep a copy around for reference use, and check periodically for announced changes in the GenBank record format.

If you look back at the complete GenBank record earlier in this chapter, you'll see that the annotations have a certain structure. You have some fields, such as LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, REFERENCE, FEATURES, and BASE COUNT that start at the beginning of a line. Some of these fields have subfields, especially the FEATURES field, which has a fairly complex structure.

But for now, let's just extract the top-level fields. You will need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.

Here's a regular expression that matches our definition of a field:

/^[A-Z].*\n(^\s.*\n)*/m

What does this regular expression say? First of all, it has the /m pattern matching modifier, which means the caret ^ and the dollar sign $ also match around embedded newlines (not just at the beginning and end of the entire string, which is the default behavior).

The first part of the regular expression:

^[A-Z].*\n

matches a capital letter at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. That's a good description of the first lines of the fields you're trying to match.

The second part of the regular expression:

(^\s.*\n)*

matches a space or tab \s at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. This is surrounded by parentheses and followed by a *, which means 0 or more such lines. This matches succeeding lines in a field, lines that start with whitespace. A field may have no extra lines of this sort or several such lines.

So, the two parts of the regular expression combined match the fields with their optional additional lines.

Example 10-6 shows a subroutine that, given the annotations section of a GenBank record stored in a scalar variable, returns a hash with keys equal to the names of the top-level fields and values equal to the contents of those fields.

Example 10-6. Parsing Genbank annotation
#!/usr/bin/perl
#  - test program for parse_annotation subroutine

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my $library = 'library.gb';

# Open library and read a record
$fh = open_file($library);

$record = get_next_record($fh);

# Parse the sequence and annotation
($annotation, $dna) = get_annotation_and_dna($record);

# Extract the fields of the annotation
%fields = parse_annotation($annotation);

# Print the fields
foreach my $key (keys %fields) {
    print "******** $key *********\n";
    print $fields{$key};
}

exit;

################################################################################
# Subroutine
################################################################################

# parse_annotation
#
#  given a GenBank annotation, returns a hash  with
#   keys: the field names
#   values: the fields

sub parse_annotation {

    my($annotation) = @_; 
    my(%results) = (  );

    while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) {
        my $value = $&;
        (my $key = $value) =~ s/^([A-Z]+).*/$1/s;
        $results{$key} = $value;
    }

    return %results;
}

In the subroutine parse_annotation, note how the variables $key and $value are scoped within the while block. One benefit of this is that you don't have to reinitialize the variables each time through the loop. Also note that the key is the name of the field, and the value is the whole field.

You should take the time to understand the regular expression that extracts the field name for the key:

(my $key = $value) =~ s/^([A-Z]+).*/$1/s;

This first assigns $key the value $value. It then replaces everything in $key (note the /s modifier for embedded newlines) with $1, which is a special variable pattern between the first pair of parentheses ([A-Z]+). This pattern is one or more capital letters (anchored to the beginning of the string, i.e., the field name), so it sets $key to the value of the first word in the field name.

You get the following output from Example 10-6 (the test just fetches the first record in the GenBank library):

******** SOURCE *********
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
******** DEFINITION *********
DEFINITION  Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
            complete cds.
******** KEYWORDS *********
KEYWORDS    .
******** VERSION *********
VERSION     AB031069.1  GI:8100074
******** FEATURES *********
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
******** REFERENCE *********
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
******** ACCESSION *********
ACCESSION   AB031069
******** LOCUS *********
LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
******** ORIGIN *********
ORIGIN      
******** BASE *********
BASE COUNT      564 a    715 c    768 g    440 t

As you see, the method is working, and apart from the difficulty of reading the regular expressions (which will become easier with practice), the code is very straightforward, just a few short subroutines.

10.4.5 Parsing the FEATURES Table

Let's take this one step further and parse the features table to its next level, composed of the source , gene, and CDS features keys. (See later in this section for a more complete list of these features keys.) In the exercises at the end of the chapter, you'll be challenged to descend further into the FEATURES table.

To study the FEATURES table, you should first look over the NCBI gbrel.txt document mentioned previously. Then you should study the most complete documentation for the FEATURES table, available at http://www.ncbi.nlm.nih.gov/collab/FT/index.html.

10.4.5.1 Features

Although our GenBank entry is fairly simple and includes only three features, there are actually quite a few of them. Notice that the parsing code will find all of them, because it's just looking at the structure of the document, not for specific features.

The following is a list of the features defined for GenBank records. Although lengthy, I think it's important to read through it to get an idea of the range of information that may be present in a GenBank record.

allele

Obsolete; see variation feature key

attenuator

Sequence related to transcription termination

C_region

Span of the C immunological feature

CAAT_signal

CAAT box in eukaryotic promoters

CDS

Sequence coding for amino acids in protein (includes stop codon)

conflict

Independent sequence determinations differ

D-loop

Displacement loop

D_segment

Span of the D immunological feature

enhancer

Cis-acting enhancer of promoter function

exon

Region that codes for part of spliced mRNA

gene

Region that defines a functional gene, possibly including upstream (promoter, enhancer, etc.) and downstream control elements, and for which a name has been assigned

GC_signal

GC box in eukaryotic promoters

iDNA

Intervening DNA eliminated by recombination

intron

Transcribed region excised by mRNA splicing

J_region

Span of the J immunological feature

LTR

Long terminal repeat

mat_peptide

Mature peptide coding region (doesn't include stop codon)

misc_binding

Miscellaneous binding site

misc_difference

Miscellaneous difference feature

misc_feature

Region of biological significance that can't be described by any other feature

misc_recomb

Miscellaneous recombination feature

misc_RNA

Miscellaneous transcript feature not defined by other RNA keys

misc_signal

Miscellaneous signal

misc_structure

Miscellaneous DNA or RNA structure

modified_base

The indicated base is a modified nucleotide

mRNA

Messenger RNA

mutation

Obsolete: see variation feature key

N_region

Span of the N immunological feature

old_sequence

Presented sequence revises a previous version

polyA_signal

Signal for cleavage and polyadenylation

polyA_site

Site at which polyadenine is added to mRNA

precursor_RNA

Any RNA species that isn't yet the mature RNA product

prim_transcript

Primary (unprocessed) transcript

primer

Primer binding region used with PCR

primer_bind

Noncovalent primer binding site

promoter

A region involved in transcription initiation

protein_bind

Noncovalent protein binding site on DNA or RNA

RBS

Ribosome binding site

rep_origin

Replication origin for duplex DNA

repeat_region

Sequence containing repeated subsequences

repeat_unit

One repeated unit of a repeat_region

rRNA

Ribosomal RNA

S_region

Span of the S immunological feature

satellite

Satellite repeated sequence

scRNA

Small cytoplasmic RNA

sig_peptide

Signal peptide coding region

snRNA

Small nuclear RNA

source

Biological source of the sequence data represented by a GenBank record; mandatory feature, one or more per record; for organisms that have been incorporated within the NCBI taxonomy database, an associated /db_xref="taxon:NNNN" qualifier will be present (where NNNNN is the numeric identifier assigned to the organism within the NCBI taxonomy database)

stem_loop

Hairpin loop structure in DNA or RNA

STS

Sequence Tagged Site: operationally unique sequence that identifies the combination of primer spans used in a PCR assay

TATA_signal

TATA box in eukaryotic promoters

terminator

Sequence causing transcription termination

transit_peptide

Transit peptide coding region

transposon

Transposable element (TN)

tRNA

Transfer RNA

unsure

Authors are unsure about the sequence in this region

V_region

Span of the V immunological feature

variation

A related population contains stable mutation

-

Placeholder (hyphen)

-10_signal

Pribnow box in prokaryotic promoters

-35_signal

-35 box in prokaryotic promoters

3'clip

3'-most region of a precursor transcript removed in processing

3'UTR

3' untranslated region (trailer)

5'clip

5'-most region of a precursor transcript removed in processing

5'UTR

5' untranslated region (leader)

These feature keys can have their own additional features, which you'll see here and in the exercises.

10.4.5.2 Parsing

Example 10-7 finds whatever features are present and returns an array populated with them. It doesn't look for the complete list of features as presented in the last section; it finds just the features that are actually present in the GenBank record and returns them for further use.

It's often the case that there are multiple instances of the same feature in a record. For instance, there may be several exons specified in the FEATURES table of a GenBank record. For this reason we'll store the features as elements in an array, rather than in a hash keyed on the feature name (as this allows you to store, for instance, only one instance of an exon).

Example 10-7. Testing subroutine parse_features
#!/usr/bin/perl
#  - main program to test parse_features

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my @features;
my $library = 'library.gb';

# Get the fields from the first GenBank record in a library
$fh = open_file($library);

$record = get_next_record($fh);

($annotation, $dna) = get_annotation_and_dna($record);

%fields = parse_annotation($annotation);

# Extract the features from the FEATURES table
@features = parse_features($fields{'FEATURES'});

# Print out the features
foreach my $feature (@features) {

     # extract the name of the feature (or "feature key")
     my($featurename) = ($feature =~ /^ {5}(\S+)/);

     print "******** $featurename *********\n";
     print $feature;
}

exit;

############################################################################ 
# Subroutine
############################################################################ 

# parse_features
#
#  extract the features from the FEATURES field of a GenBank record

sub parse_features {

     my($features) = @_;   # entire FEATURES field in a scalar variable

     # Declare and initialize variables
     my(@features) = ();   # used to store the individual features

     # Extract the features
     while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {
         my $feature = $&;
     push(@features, $feature);
     }

     return @features;
}

Example 10-7 gives the output:

******** source *********
      source          1..2487
                      /organism="Homo sapiens"
                      /db_xref="taxon:9606"
                      /sex="male"
                      /cell_line="HuS-L12"
                      /cell_type="lung fibroblast"
                      /dev_stage="embryo"
******** gene *********
      gene            229..2199
                      /gene="PCCX1"
******** CDS *********
      CDS             229..2199
                      /gene="PCCX1"
                      /note="a nuclear protein carrying a PHD finger and a CXXC
                      domain"
                      /codon_start=1
                      /product="protein containing CXXC domain 1"
                      /protein_id="BAA96307.1"
                      /db_xref="GI:8100075"
                      /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                      NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                      RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                      QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                      FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                      EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                      KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                      DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                      FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                      YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                      PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                      AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR

In subroutine parse_features of Example 10-7, the regular expression that extracts the features is much like the regular expression used in Example 10-6 to parse the top level of the annotations. Let's look at the essential parsing code of Example 10-7:

     while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {

On the whole, and in brief, this regular expression finds features formatted with the first lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces.

First, note that the pattern modifier /m enables the ^ metacharacter to match after embedded newlines. Also, the {5} and {21} are quantifiers that specify there should be exactly 5, or 21, of the preceding item, which in both cases is a space.

The regular expression is in two parts, corresponding to the first line and optional continuation lines of the feature. The first part ^ {5}\S.*\n means that the beginning of a line (^) has 5 spaces ({5}), followed by a non-whitespace character (\S) followed by any number of non-newlines (.*) followed by a newline (\n). The second part of the regular expression, (^ {21}\S.*\n)* means the beginning of a line (^) has 21 spaces ({21}) followed by a non-whitespace character (\S) followed by any number of non-newlines (.*) followed by a newline (\n); and there may be 0 or more such lines, indicated by the ()* around the whole expression.

The main program has a short regular expression along similar lines to extract the feature name (also called the feature key) from the feature.

So, again, success. The FEATURES table is now decomposed or "parsed" in some detail, down to the level of separating the individual features. The next stage in parsing the FEATURES table is to extract the detailed information for each feature. This includes the location (given on the same line as the feature name, and possibly on additional lines); and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign and additional information of various kinds, possibly continued on additional lines.

I'll leave this final step for the exercises. It's a fairly straightforward extension of the approach we've been using to parse the features. You will want to consult the documentation from the NCBI web site for complete details about the structure of the FEATURES table before trying to parse the location and qualifiers from a feature.

The method I've used to parse the FEATURES table maintains the structure of the information. However, sometimes you just want to see if some word such as "endonulease" appears anywhere in the record. For this, recall that you created a search_annotation subroutine in Example 10-5 that searches for any regular expression in the entire annotation; very often, this is all you really need. As you've now seen, however, when you really need to dig into the FEATURES table, Perl has its own features that make the job possible and even fairly easy.

< BACKCONTINUE >

Index terms contained in this section

annotations, GenBank files
      parsing
            at top-level
            FEATURES table
            using arrays
            using regular expressions
arrays
      GenBank annotations, parsing with
byte offsets
      reporting with tell function
CDS feature key
DEFINITION field, collecting in GenBank annotations
FEATURES table (GenBank annotations)
     feature keys
            listing with definitions
            parsing
fields, GenBank records
flags, use of
GenBank (Genetic Sequence Data Bank)
     annotations
            parsing
            parsing at top level
            parsing FEATURES table
      library subroutines, testing
gene features key
get_next_record function
offsets
opening files
      GenBank
parsing
      GenBank file annotations 2nd
            FEATURES table
            using arrays
            using regular expressions
patterns (and regular expressions)
      GenBank annotations, parsing 2nd
records, GenBank files
      fields
scoping
      variables within while loop
search_annotation subroutine
seek function
source feature key
subroutines
      GenBank library
            testing
      search_annotation
tell function 2nd
variables
     scoping
            in while loop
web sites
      FEATURES table

© 2002, O'Reilly & Associates, Inc.