10.5
Indexing GenBank with DBM
DBM stands for Database Management.
Perl provides a set of built-in functions that give Perl programmers
access to DBM files.
10.5.1
DBM Essentials
When you open a DBM file, you access it like a hash: you give it keys
and it returns values, and you can add and delete key-value pairs.
What's useful about DBM is that it saves the key-value data in
a permanent disk file on your computer. It can thus save information
between the times you run your program; it can also serve as a way to
share information between different programs that need the same data.
A DBM file can get very big without killing the main memory on your
computer and making your program—and everything else—slow
to a crawl.
There are two functions,
dbmopen and dbmclose,
that "tie" a
hash to a DBM file; then you just use the
hash. As you've seen, with a hash, lookups are easy, as are
definitions. You can get a list of all the keys from a hash called
%my_hash by typing keys
%my_hash. You then can get a list of all values by
typing values %my_hash. For large DBM files, you
may not want to do this; the Perl function each
allows you to read key-value pairs one at a time, thus saving the
memory of your running program. There is also a
delete function to remove the definitions of
keys:
delete $my_hash{'DNA'}
entirely removes that key from the hash.
DBM files are a very simple database. They don't have the power
of a relational database such as MySQL
,
Oracle,
or
PostgreSQL
; however, it's
remarkable how often a simple database is all that a problem really
needs. When you have a set of key-value data (or several such sets),
consider using DBM. It's certainly easy to use with Perl.
The main wrinkle to using DBM is that there are several, slightly
different DBM implementations—NDBM, GDBM, SDBM, and
Berkeley DB. The
differences are small but real; but for most purposes, the
implementations are interchangeable. Newer versions of Perl give you
Berkeley DB by default, and it's easy to get it and install it
for your Perl if you want. If you don't have really long
keys or values,
it's not a problem. Some older DBMs require you to add null
bytes to keys and delete them from values:
$value = $my_hash{"$key\0"};
chop $value;
Chances are good that you won't have to do that. Berkeley DB
handles long strings well (some of the
other DBM implementations have limits), and because you have some
potentially long strings in biology, I recommend installing Berkeley
DB if you don't have it.
10.5.2
A DBM Database for GenBank
You've seen how to extract information from a
GenBank record or from a library of GenBank records. You've
just seen how DBM files can save your hash data on your hard disk
between program runs. You've also seen the use of
tell and seek to quickly access
a location in a file.
Now let's combine the three ideas and use DBM to build a
database of information about a GenBank library. It'll be
something simple: you'll extract the accession numbers for the
keys and store the byte offsets in the GenBank library of records for
the values. You'll add some code that, given a library and an
offset, returns the record at that offset, and write a main program
that allows the user to interactively request GenBank records by
accession number. When complete, your program should very quickly
return a GenBank record if given its accession number.
This general idea is extended in the exercises at the end of the
chapter to a considerable extent; you may want to glance ahead at
them now to get an idea of the potential power of the technique
I'm about to present.
With just the appropriate amount of further ado, here is a code
fragment that opens (creating if necessary) a DBM file:
unless(dbmopen(%my_hash, 'DBNAME', 0644)) {
print "Cannot open DBM file DBNAME with mode 0644\n";
exit;
}
%my_hash is like any other hash in Perl, but it
will be tied to the DBM file with this statement.
DBNAME is the basename of the actual DBM files
that will be created. Some DBM versions create one file of exactly
that name; others create two files with file extensions
.dir and .pag.
Another parameter is called the mode. Unix or
Linux users will be familiar with file permissions in
this form. Many possibilities exist; here are the most common ones:
-
0644
-
You can read and write; others can just read.
-
0600
-
Only you can read or write.
-
0666
-
Anyone can read or write.
-
0444
-
Anyone can read (nobody can write).
-
0400
-
Only you can read (nobody else can do anything).
The dbmopen call fails if you try to open a file
with a mode that assumes there are more permissions than were
conferred on the DBM file when it was created. Usually, the mode 0644
is declared by the owner if only the owner should be allowed to
write, and 0444 is declared by readers. Mode 0666 is declared by the
owner and others if the file is meant to be read or written by
anyone.
That's pretty much it; DBM files are that simple. Example 10-8 displays a DBM file that stores
key-value pairs of accession numbers
of GenBank records for keys, and byte offsets of the records as
values.
Example 10-8. A DBM index of a GenBank library
#!/usr/bin/perl
# - make a DBM index of a GenBank library,
# and demonstrate its use interactively
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my %dbm;
my $answer;
my $offset;
my $library = 'library.gb';
# open DBM file, creating if necessary
unless(dbmopen(%dbm, 'GB', 0644)) {
print "Cannot open DBM file GB with mode 0644\n";
exit;
}
# Parse GenBank library, saving accession number and offset in DBM file
$fh = open_file($library);
$offset = tell($fh);
while ( $record = get_next_record($fh) ) {
# Get accession field for this record.
($annotation, $dna) = get_annotation_and_dna($record);
%fields = parse_annotation($annotation);
my $accession = $fields{'ACCESSION'};
# extract just the accession number from the accession field
# --remove any trailing spaces
$accession =~ s/^ACCESSION\s*//;
$accession =~ s/\s*$//;
# store the key/value of accession/offset
$dbm{$accession} = $offset;
# get offset for next record
$offset = tell($fh);
}
# Now interactively query the DBM database with accession numbers
# to see associated records
print "Here are the available accession numbers:\n";
print join ( "\n", keys %dbm ), "\n";
print "Enter accession number (or quit): ";
while( $answer = <STDIN> ) {
chomp $answer;
if($answer =~ /^\s*q/) {
last;
}
$offset = $dbm{$answer};
if ($offset) {
seek($fh, $offset, 0);
$record = get_next_record($fh);
print $record;
}else{
print "Do not have an entry for accession number $answer\n";
}
print "\nEnter accession number (or quit): ";
}
dbmclose(%dbm);
close($fh);
exit;
Here's the truncated output of Example 10-8:
Here are the available accession numbers:
XM_006271
NM_021964
XM_009873
AB031069
XM_006269
Enter accession number (or quit): NM_021964
LOCUS NM_021964 3032 bp mRNA PRI 14-MAR-2001
DEFINITION Homo sapiens zinc finger protein 148 (pHZ-52) (ZNF148), mRNA.
...
//
Enter accession number (or quit): q