Safari | Beginning Perl for Bioinformatics -> 10.2 GenBank Libraries

Beginning Perl for Bioinformatics > 10. GenBank > 10.2 GenBank Libraries

10.2 GenBank Libraries

GenBank is distributed as a set of libraries—flat files containing many records in succession.^[2] As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.

^[2] The data is also distributed in the ASN.1 format.

GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:

PRI: primate sequences
ROD: rodent sequences
MAM: other mammalian sequences
VRT: other vertebrate sequences
INV: invertebrate sequences
PLN: plant, fungal, and algal sequences
BCT: bacterial sequences
VRL: viral sequences
PHG: bacteriophage sequences
SYN: synthetic and chimeric sequences
UNA: unannotated sequences
EST: EST sequences (expressed sequence tags)
PAT: patent sequences
STS: STS sequences (sequence tagged sites)
GSS: GSS sequences (genome survey sequences)
HTG: HTGS sequences (high throughput genomic sequencing data)
HTC: HTC sequences (high throughput cDNA sequencing data)

Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.

The public database servers such as Entrez or BLAST at http://www.ncbi.nlm.nih.gov/ give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.

It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called mirror.pl helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.

Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files.

< BACK

CONTINUE >

Index terms contained in this section

BLAST (Basic Local Alignment Search Tool)
      public database servers
divisions, GenBank libraries
Entrez (public database server)
EST (expressed sequence tag division), GenBank libraries
GenBank (Genetic Sequence Data Bank)
      desktop workstations, setting up
      libraries
human DNA sequence data, GenBank libraries
libraries
      GenBank
operating systems
      desktop workstations for GenBank
public database servers
servers
      public database
web sites
      public database servers