11.1
Overview of PDB
The main source for information about
3D structures of macromolecules
(including proteins, peptides, viruses, protein/nucleic acid
complexes, nucleic acids, and carbohydrates) is PDB, and its format
is the de facto standard for the exchange of structural information.
Most of these structures are determined experimentally by means of
X-ray diffraction or nuclear magnetic resonance
(NMR) studies.
PDB started in 1971 with seven proteins; it will soon grow to 20,000
structures. With the international effort in structural genomics
increasing, the PDB is certain to continue its rapid growth. Within a
few short years the number of known structures will approach 100,000.
PDB files are like GenBank records, in that they are human-readable
ASCII flat files. The text conforms to a
specific format, so computer programs may be written to extract the
information. PDB is organized with one structure per file, unlike
Genbank, which is distributed with many records in each
"library" file.
Bioinformaticians who work extensively with PDB files report that
there are serious problems with the consistency of the PDB format.
For instance, as the field has advanced and the data format has
evolved to meet new knowledge requirements, some of the older files
have become out of date, and efforts are underway to address the
uniformity of PDB data. Until these efforts are complete and a new
data format is developed, inconsistencies in the current data format
are a challenge programmers have to face. If you do a lot of
programming with PDB files, you'll find many inconsistencies
and errors in the data, especially in the older files. Plus, many
parsing tools that work well on newer files perform poorly on older
files.
As you become a more experienced programmer, these and other issues
the PDB faces become more important. For instance, as PDB evolves,
the code you write to interact with it must also evolve; you must
always maintain your code with an eye on how the rest of the world is
changing. As links between databases become better supported, your
code will take advantage of the new opportunities the links provide.
With new standards of data storage becoming established, your code
will have to evolve to include them.
The PDB web site contains a wealth of information on how to download
all the files. They are also conveniently distributed—and at no
cost—on a set of CDs, which is a real advantage for those
lacking high-throughput Internet connections.