Chapter
11. Protein Data Bank
The success of the Human Genome Project in decoding the DNA
sequence of human genes has captured the public imagination, but
another project has been quietly gaining momentum, and it promises
equally revolutionary results. This project is an international
effort to determine the 3D structure of a comprehensive range of
proteins on a genome-wide level using high-throughput analytical
technologies. This international effort is the foundation of the new
field of structural genomics.
Recent and expected advances in technology promise an accelerating
pace of protein structure determination. The storehouse for all of
this data is the Protein Data Bank
(PDB). The PDB may
be found on the web at http://www.rcsb.org/pdb/.
Finding the amino acid or primary sequence is just the beginning of
studying a protein. Proteins fold locally into
secondary
structures such as alpha helices, beta-strands, and turns. Two or
three adjacent secondary structures might combine into common local
folds called "
motifs" or
"supersecondary" structures such as
beta
sheets or alpha-alpha units. These building blocks then fold into the
3D or tertiary structure of a protein.
Finally, one or more tertiary structures may be combined as subunits
into a
quaternary structure
such as an enzyme or a virus.
Without knowing how a protein folds into a 3D structure, you are less
likely to know what the protein does or how it does it. Even if you
know that the protein is implicated in a disease,
knowledge of its tertiary structure is usually needed to find a
possible treatment. Knowing the tertiary conformation of the
active site of a
protein (which may involve amino acids that are far apart in terms of
the primary sequence but which are brought together by the folding of
the protein) is critical to guide the selection of targets for new
drugs.
Now that the basic genetic information of a number of organisms,
including humans, has been decoded, a primary challenge facing
biologists is to learn as much as possible about the proteins those
genes produce and how they interact.
In fact, one of the great questions of modern biology is how the
primary amino acid sequence of a protein determines its ultimate 3D
shape. If a computational method can be found to reliably predict the
fold of a protein from its amino acid sequence, the effect on biology
and medicine would be profound.
In this chapter, you'll learn the basics of PDB files and how
to parse out selected information form them. You'll also
explore interesting Perl techniques for finding and iterating over
lots of files, as well as controlling other bioinformatics programs
from a Perl program. The exercises at the end of the chapter
challenge you to extend the introductory material presented here to
gain access to more of the PDB data.