10.2
GenBank Libraries
GenBank
is distributed as a set of libraries—flat files containing many
records in succession.[2]
As of GenBank release 125.0,
August 2001, there are 243 files, most of which are over 200 MB in
size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296
bases from 12,813,516 reported sequences. The libraries are usually
distributed compressed, which means you can download somewhat smaller
files, but you need to uncompress them after you received them.
Uncompressed, this amounts to about 50 GB of data. Since 1982, the
number of sequences in GenBank has doubled about every 14 months.
GenBank libraries are further organized into
divisions by the classification of the sequences they contain, either
phylogenetically or by sequencing technology. Here are the divisions:
-
PRI: primate sequences
-
ROD: rodent sequences
-
MAM: other mammalian sequences
-
VRT: other vertebrate sequences
-
INV: invertebrate sequences
-
PLN: plant, fungal, and algal sequences
-
BCT: bacterial sequences
-
VRL: viral sequences
-
PHG: bacteriophage sequences
-
SYN: synthetic and chimeric sequences
-
UNA: unannotated sequences
-
EST: EST sequences (expressed sequence tags)
-
PAT: patent sequences
-
STS: STS sequences (sequence tagged sites)
-
GSS: GSS sequences (genome survey sequences)
-
HTG: HTGS sequences (high throughput genomic sequencing data)
-
HTC: HTC sequences (high throughput cDNA sequencing data)
Some divisions are very large: the largest, the
EST, or expressed sequence
tag division, is comprised of 123 library files! A
portion of human DNA is stored in the PRI
division, which contains (as of this writing) 13 library files, for a
total of almost 3.5 GB of data. Human data is also stored in the STS,
GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up
almost 5 million record entries with over 8 trillion bases of
sequence.
The
public database
servers such as Entrez or BLAST at
http://www.ncbi.nlm.nih.gov/ give
you access to well-maintained and updated sequence data and programs,
but many researchers find that they need to write their own programs
to manipulate and analyze the data. The problem is, there's so
much data. For many purposes, you can download a selected set of
records from NCBI or other locations, but sometimes you need the
whole dataset.
It's possible to set up a desktop workstation (Windows, Mac,
Unix, or Linux) that contains all of
GenBank; just be sure
to buy a very large hard disk! Getting all that data onto your hard
drive, however, is more difficult. A Perl program called
mirror.pl helps to address this need.
Downloading it, even with a university-standard, high-speed Internet
connection can be time-consuming; downloading an entire dataset with
a modem can be an exercise in frustration. The best solution is to
download only the files you need, in compressed form. The EST data,
for example, is about half the entire database; don't download
it unless you really need to. If you need to download GenBank, I
recommend contacting the help desk at NCBI. They'll help you
get the most up-to-date information.
Since you're learning to program, it makes more sense to
practice on a tiny, five-record library file, but the programs
you'll write will work just fine on the real files.