Safari | Beginning Perl for Bioinformatics -> 10.5 Indexing GenBank with DBM

Beginning Perl for Bioinformatics > 10. GenBank > 10.5 Indexing GenBank with DBM

10.5 Indexing GenBank with DBM

DBM stands for Database Management. Perl provides a set of built-in functions that give Perl programmers access to DBM files.

10.5.1 DBM Essentials

When you open a DBM file, you access it like a hash: you give it keys and it returns values, and you can add and delete key-value pairs. What's useful about DBM is that it saves the key-value data in a permanent disk file on your computer. It can thus save information between the times you run your program; it can also serve as a way to share information between different programs that need the same data. A DBM file can get very big without killing the main memory on your computer and making your program—and everything else—slow to a crawl.

There are two functions, dbmopen and dbmclose, that "tie" a hash to a DBM file; then you just use the hash. As you've seen, with a hash, lookups are easy, as are definitions. You can get a list of all the keys from a hash called %my_hash by typing keys %my_hash. You then can get a list of all values by typing values %my_hash. For large DBM files, you may not want to do this; the Perl function each allows you to read key-value pairs one at a time, thus saving the memory of your running program. There is also a delete function to remove the definitions of keys:

delete $my_hash{'DNA'}

entirely removes that key from the hash.

DBM files are a very simple database. They don't have the power of a relational database such as MySQL , Oracle, or PostgreSQL ; however, it's remarkable how often a simple database is all that a problem really needs. When you have a set of key-value data (or several such sets), consider using DBM. It's certainly easy to use with Perl.

The main wrinkle to using DBM is that there are several, slightly different DBM implementations—NDBM, GDBM, SDBM, and Berkeley DB. The differences are small but real; but for most purposes, the implementations are interchangeable. Newer versions of Perl give you Berkeley DB by default, and it's easy to get it and install it for your Perl if you want. If you don't have really long keys or values, it's not a problem. Some older DBMs require you to add null bytes to keys and delete them from values:

$value = $my_hash{"$key\0"};
chop $value;

Chances are good that you won't have to do that. Berkeley DB handles long strings well (some of the other DBM implementations have limits), and because you have some potentially long strings in biology, I recommend installing Berkeley DB if you don't have it.

10.5.2 A DBM Database for GenBank

You've seen how to extract information from a GenBank record or from a library of GenBank records. You've just seen how DBM files can save your hash data on your hard disk between program runs. You've also seen the use of tell and seek to quickly access a location in a file.

Now let's combine the three ideas and use DBM to build a database of information about a GenBank library. It'll be something simple: you'll extract the accession numbers for the keys and store the byte offsets in the GenBank library of records for the values. You'll add some code that, given a library and an offset, returns the record at that offset, and write a main program that allows the user to interactively request GenBank records by accession number. When complete, your program should very quickly return a GenBank record if given its accession number.

This general idea is extended in the exercises at the end of the chapter to a considerable extent; you may want to glance ahead at them now to get an idea of the potential power of the technique I'm about to present.

With just the appropriate amount of further ado, here is a code fragment that opens (creating if necessary) a DBM file:

unless(dbmopen(%my_hash, 'DBNAME', 0644)) {

    print "Cannot open DBM file DBNAME with mode 0644\n";
    exit;
}

%my_hash is like any other hash in Perl, but it will be tied to the DBM file with this statement. DBNAME is the basename of the actual DBM files that will be created. Some DBM versions create one file of exactly that name; others create two files with file extensions .dir and .pag.

Another parameter is called the mode. Unix or Linux users will be familiar with file permissions in this form. Many possibilities exist; here are the most common ones:

0644: You can read and write; others can just read.
0600: Only you can read or write.
0666: Anyone can read or write.
0444: Anyone can read (nobody can write).
0400: Only you can read (nobody else can do anything).

The dbmopen call fails if you try to open a file with a mode that assumes there are more permissions than were conferred on the DBM file when it was created. Usually, the mode 0644 is declared by the owner if only the owner should be allowed to write, and 0444 is declared by readers. Mode 0666 is declared by the owner and others if the file is meant to be read or written by anyone.

That's pretty much it; DBM files are that simple. Example 10-8 displays a DBM file that stores key-value pairs of accession numbers of GenBank records for keys, and byte offsets of the records as values.

Example 10-8. A DBM index of a GenBank library

#!/usr/bin/perl
#  - make a DBM index of a GenBank library,
#     and demonstrate its use interactively

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my %dbm;
my $answer;
my $offset;
my $library = 'library.gb';

# open DBM file, creating if necessary
unless(dbmopen(%dbm, 'GB', 0644)) {
    print "Cannot open DBM file GB with mode 0644\n";
    exit;
}

# Parse GenBank library, saving accession number and offset in DBM file
$fh = open_file($library);

$offset = tell($fh);

while ( $record = get_next_record($fh) ) {

    # Get accession field for this record.
    ($annotation, $dna) = get_annotation_and_dna($record);

    %fields = parse_annotation($annotation);

    my $accession = $fields{'ACCESSION'};

    # extract just the accession number from the accession field
    # --remove any trailing spaces
    $accession =~ s/^ACCESSION\s*//;

    $accession =~ s/\s*$//;

    # store the key/value of  accession/offset
    $dbm{$accession} = $offset;

    # get offset for next record
    $offset = tell($fh);
}

# Now interactively query the DBM database with accession numbers
#  to see associated records

print "Here are the available accession numbers:\n";

print join ( "\n", keys %dbm ), "\n";

print "Enter accession number (or quit): ";

while( $answer = <STDIN> ) {
    chomp $answer;
    if($answer =~ /^\s*q/) {
        last;
    }
    $offset = $dbm{$answer};

    if ($offset) {
        seek($fh, $offset, 0);
        $record = get_next_record($fh);
        print $record;
    }else{
        print "Do not have an entry for accession number $answer\n";
    }

    print "\nEnter accession number (or quit): ";
}

dbmclose(%dbm);

close($fh);

exit;

Here's the truncated output of Example 10-8:

Here are the available accession numbers:
XM_006271
NM_021964
XM_009873
AB031069
XM_006269
Enter accession number (or quit): NM_021964
LOCUS       NM_021964    3032 bp    mRNA            PRI       14-MAR-2001
DEFINITION  Homo sapiens zinc finger protein 148 (pHZ-52) (ZNF148), mRNA.
...
//

Enter accession number (or quit): q

< BACK

CONTINUE >

Index terms contained in this section

accession numbers (GenBank records), storing as keys
Berkeley DB
byte offsets
      GenBank records, storing as values
DBM (database management)
      database for GenBank
      different implementations of
      indexing GenBank with
dbmopen and dbmclose functions
file permission modes, DBM files
GenBank (Genetic Sequence Data Bank)
      indexing with DBM
hashes
      DMM files, using with
indexing
      GenBank with DBM
key/value pairs
      DBM index for GenBank library
     in hashes
            handling long
modes (file permission), DBM files
strings
      long, DMB handling of