11.3
PDB Files
Here's a section of an actual PDB file:
HEADER SUGAR BINDING PROTEIN 03-MAR-99 1C1F
TITLE LIGAND-FREE CONGERIN I
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: CONGERIN I;
COMPND 3 CHAIN: A;
COMPND 4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN;
COMPND 5 BIOLOGICAL_UNIT: HOMODIMER
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: CONGER MYRIASTER;
SOURCE 3 ORGANISM_COMMON: CONGER EEL;
SOURCE 4 TISSUE: SKIN MUCUS;
SOURCE 5 SECRETION: NON-CLASSICAL
KEYWDS GALECTIN, LECTIN, BETA-GALACTOSE-BINDING, SUGAR BINDING
KEYWDS 2 PROTEIN
EXPDTA X-RAY DIFFRACTION
AUTHOR T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,T.YAMANE,
AUTHOR 2 H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO
REVDAT 2 14-OCT-99 1C1F 1 SEQADV HEADER
REVDAT 1 08-OCT-99 1C1F 0
JRNL AUTH T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,
JRNL AUTH 2 T.YAMANE,H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO
JRNL TITL HIGH-RESOLUTION STRUCTURE OF CONGER EEL GALECTIN,
JRNL TITL 2 CONGERIN I, IN LACTOSE- LIGANDED AND LIGAND-FREE
JRNL TITL 3 FORMS: EMERGENCE OF A NEW STRUCTURE CLASS BY
JRNL TITL 4 ACCELERATED EVOLUTION
JRNL REF STRUCTURE (LONDON) V. 7 1223 1999
JRNL REFN ASTM STRUE6 UK ISSN 0969-2126 2005
REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 1.6 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : X-PLOR 3.1
REMARK 3 AUTHORS : BRUNGER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.60
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 8.00
REMARK 3 DATA CUTOFF (SIGMA(F)) : 3.000
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : NULL
REMARK 3 DATA CUTOFF LOW (ABS(F)) : NULL
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 85.0
REMARK 3 NUMBER OF REFLECTIONS : 17099
REMARK 3
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.201
REMARK 3 FREE R VALUE : 0.247
REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.000
REMARK 3 FREE R VALUE TEST SET COUNT : 855
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : NULL
REMARK 3
... (file truncated here)
REMARK 4
REMARK 4 1C1F COMPLIES WITH FORMAT V. 2.3, 09-JULY-1998
REMARK 7
REMARK 7 >>> WARNING: CHECK REMARK 999 CAREFULLY
REMARK 8
REMARK 8 SIDE-CHAINS OF SER123 AND LEU124 ARE MODELED AS ALTERNATIVE
REMARK 8 CONFORMERS.
REMARK 9
REMARK 9 SER1 IS ACETYLATED.
REMARK 10
REMARK 10 TER
REMARK 10 SER: THE N-TERMINAL RESIDUE WAS NOT OBSERVED
REMARK 100
REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 07-MAR-1999.
REMARK 100 THE RCSB ID CODE IS RCSB000566.
REMARK 200
REMARK 200 EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION : NULL
REMARK 200 TEMPERATURE (KELVIN) : 291.0
REMARK 200 PH : 9.00
REMARK 200 NUMBER OF CRYSTALS USED : 1
REMARK 200
REMARK 200 SYNCHROTRON (Y/N) : Y
REMARK 200 RADIATION SOURCE : PHOTON FACTORY
REMARK 200 BEAMLINE : BL6A
REMARK 200 X-RAY GENERATOR MODEL : NULL
REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M
REMARK 200 WAVELENGTH OR RANGE (A) : 1.00
REMARK 200 MONOCHROMATOR : NULL
REMARK 200 OPTICS : NULL
REMARK 200
... (file truncated here)
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: COVALENT BOND ANGLES
REMARK 500
REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
REMARK 500 THAN 4*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 STANDARD TABLE:
REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)
REMARK 500
REMARK 500 EXPECTED VALUES: ENGH AND HUBER, 1991
REMARK 500
REMARK 500 M RES CSSEQI ATM1 ATM2 ATM3
REMARK 500 HIS A 44 N - CA - C ANGL. DEV. =-10.3 DEGREES
REMARK 500 LEU A 132 CA - CB - CG ANGL. DEV. = 12.5 DEGREES
REMARK 700
REMARK 700 SHEET
REMARK 700 DETERMINATION METHOD: AUTHOR-DETERMINED
REMARK 999
REMARK 999 SEQUENCE
REMARK 999 LEU A 135 IS NOT PRESENT IN SEQUENCE DATABASE
REMARK 999
DBREF 1C1F A 1 136 SWS P26788 LEG_CONMY 1 135
SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 999
SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL
SEQRES 2 A 136 GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER
SEQRES 3 A 136 PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN
SEQRES 4 A 136 SER LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY
SEQRES 5 A 136 ALA ASP GLN ASN THR ILE VAL MET ASN SER THR LEU LYS
SEQRES 6 A 136 GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN
SEQRES 7 A 136 PHE THR LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU
SEQRES 8 A 136 SER TYR ASP ILE ASN LYS PHE TYR ILE ASP ILE LEU ASP
SEQRES 9 A 136 GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU
SEQRES 10 A 136 PHE LEU PRO PHE LEU SER LEU ALA GLY ASP ALA ARG LEU
SEQRES 11 A 136 THR LEU VAL LYS LEU GLU
FORMUL 2 HOH *81(H2 O1)
HELIX 1 1 GLY A 66 ASN A 68 5 3
SHEET 1 S1 1 GLY A 3 VAL A 6 0
SHEET 1 S2 1 PHE A 121 GLY A 126 0
SHEET 1 S3 1 ARG A 29 GLY A 35 0
SHEET 1 S4 1 LEU A 41 ASN A 50 0
SHEET 1 S5 1 GLN A 55 THR A 63 0
SHEET 1 S6 1 GLN A 74 SER A 76 0
SHEET 1 F1 1 ALA A 128 GLU A 136 0
SHEET 1 F2 1 PHE A 16 ILE A 23 0
SHEET 1 F3 1 TYR A 86 TYR A 93 0
SHEET 1 F4 1 LYS A 97 ILE A 102 0
SHEET 1 F5 1 ASN A 107 PRO A 111 0
CRYST1 94.340 36.920 40.540 90.00 90.00 90.00 P 21 21 2 4
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.010600 0.000000 0.000000 0.00000
SCALE2 0.000000 0.027085 0.000000 0.00000
SCALE3 0.000000 0.000000 0.024667 0.00000
ATOM 1 N GLY A 2 1.888 -8.251 -2.511 1.00 36.63 N
ATOM 2 CA GLY A 2 2.571 -8.428 -1.248 1.00 33.02 C
ATOM 3 C GLY A 2 2.586 -7.069 -0.589 1.00 30.43 C
ATOM 4 O GLY A 2 2.833 -6.107 -1.311 1.00 33.27 O
ATOM 5 N GLY A 3 2.302 -6.984 0.693 1.00 24.67 N
ATOM 6 CA GLY A 3 2.176 -5.723 1.348 1.00 18.88 C
ATOM 7 C GLY A 3 0.700 -5.426 1.526 1.00 16.58 C
ATOM 8 O GLY A 3 -0.187 -6.142 1.010 1.00 12.47 O
ATOM 9 N LEU A 4 0.494 -4.400 2.328 1.00 15.00 N
... (file truncated here)
ATOM 1078 CG GLU A 136 -0.873 9.368 16.046 1.00 38.96 C
ATOM 1079 CD GLU A 136 -0.399 9.054 17.456 1.00 44.66 C
ATOM 1080 OE1 GLU A 136 0.789 8.749 17.641 1.00 47.97 O
ATOM 1081 OE2 GLU A 136 -1.236 9.099 18.361 1.00 47.75 O
ATOM 1082 OXT GLU A 136 0.764 12.146 12.712 1.00 26.22 O
TER 1083 GLU A 136
HETATM 1084 O HOH 200 -1.905 -7.624 2.822 1.00 14.50 O
HETATM 1085 O HOH 201 -8.374 7.981 9.202 1.00 20.77 O
HETATM 1086 O HOH 202 -4.047 9.199 11.632 1.00 38.24 O
HETATM 1087 O HOH 203 6.172 14.210 8.483 1.00 14.50 O
HETATM 1088 O HOH 204 2.903 7.804 15.329 1.00 24.51 O
HETATM 1089 O HOH 205 16.654 0.676 11.968 1.00 10.49 O
... (file truncated here)
HETATM 1157 O HOH 286 6.960 14.840 -3.025 1.00 35.59 O
HETATM 1158 O HOH 287 -3.222 10.410 7.061 1.00 38.91 O
HETATM 1159 O HOH 288 28.306 0.551 4.876 1.00 52.13 O
HETATM 1160 O HOH 290 21.506 -12.424 9.751 1.00 31.68 O
HETATM 1161 O HOH 291 12.951 10.424 -7.324 1.00 46.10 O
HETATM 1162 O HOH 292 18.119 -15.184 14.793 1.00 56.82 O
HETATM 1163 O HOH 293 13.501 22.220 8.216 1.00 43.30 O
HETATM 1164 O HOH 294 13.916 -11.387 9.695 1.00 47.13 O
MASTER 240 0 0 1 11 0 0 6 1163 1 0 11
END
PDB files are long, mostly due to the need for information about each
atom in the molecule; this relatively short one, when complete, is
extensive—28 formatted pages. I cut it here to a little over
three pages, showing just enough of the principal sections to give
you the overall idea.
The PDB web site has the basic documents you need to read and program
with PDB files. The Protein Data Bank Contents Guide (http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html)
is the best reference, and there are also FAQs and additional
documents available.
In the following sections, you'll extract information from
these files. Since the information in these files describes the 3D
structure of macromolecules, the files are frequently used by
graphical programs that display a spatial representation of the
molecules. The scope of this book does not include graphics; however,
you will see how to get spatial coordinates out of the files. The
largest part of PDB files are the ATOM record type lines containing
the coordinates of the atoms. Because of this level of detail, PDB
files are typically longer than GenBank records. (Note the
inconsistent terminology—a unit of PDB is the file, which
contains one structure; a unit of GenBank is the record, which
contains one entry.)
11.3.1
PDB File Format
Let's take a look at a PDB file and the documentation that tells
how the information is formatted in a PDB file. Based on that
information, you'll parse the file to extract information of
interest.
PDB files are composed of lines of 80 columns that begin with one of several
predefined
record names and end with a newline.
("Column" means position on a line: the first character
is in the first column, and so forth.) Blank columns are padded with
spaces. A record
type is one or
more lines with the same record name. Different record types have
different types of fields defined within the lines. They are also
grouped according to function.
The SEQRES record type is one of four record types in the
Primary Structure Section, which presents
the primary structure of the peptide or nucleotide
sequence:
-
DBREF
-
Reference to the entry in the sequence database(s)
-
SEQADV
-
Identification of conflicts between PDB and the named
sequence database
-
SEQRES
-
Primary sequence of backbone residues
-
MODRES
-
Identification of modifications to
standard residues
The DBREF and SEQADV record types in the example PDB entry from the
previous section give reference information and details on conflicts
between the PDB and the original database. (The example doesn't
include a MODRES record type.) Here are those record types from the
entry:
DBREF 1C1F A 1 136 SWS P26788 LEG_CONMY 1 135
SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 999
Briefly, the DBREF line
states there's a PDB file called 1C1F
(from a file named pdb1c1f.ent), the residues in
chain A are numbered from 1 to 136 in the original Swiss-Prot (SWS)
database, the ID number P26788 and the name LEG_CONMY are assigned in
that database (in many databases these are identical), and the
residues are numbered 1 to 135 in PDB. The discrepancy in the
numbering between the original database and PDB is explained in the
SEQADV
record type, which refers you to a REMARK 999 line (not shown here)
where you discover that the PDB entry disagrees with the Swiss-Prot
sequence concerning a leucine at position 135 (perhaps two different
groups determined the structure, and they disagree at this
point).[2]
You can see that to parse the information in those two lines by a
program requires several steps, such as following links to other
lines in the PDB entry that further explain discrepancies and
identifying other databases.
Links between databases are important in bioinformatics.
Table 11-1 displays the databases that are referred
to in PDB files. As you already know, there are many biological
databases; those shown here have a
good deal of protein or structural data.
Table 11-1. Databases referenced in PDB files
Database
|
PDB code
|
BioMagResBank
|
BMRB
|
BLOCKS
|
BLOCKS
|
European Molecular Biology Laboratory
|
EMBL
|
GenBank
|
GB
|
Genome Data Base
|
GDB
|
Nucleic Acid Database
|
NDB
|
PROSITE
|
PROSIT
|
Protein Data Bank
|
PDB
|
Protein Identification Resource
|
PIR
|
SWISS-PROT
|
SWS
|
TREMBL
|
TREMBL
|
11.3.2
SEQRES
For starters, let's try a fairly easy task in Perl: extracting
the amino acid sequence data. To extract the amino
acid primary sequence information, you need to parse the record type
SEQRES. Here is a SEQRES line from the PDB file listed earlier:
SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL
The following code shows the SEQRES record type as defined in the
Protein Data Bank Contents Guide. This section on SEQRES, which is a
fairly simple record type, is shown in its entirely to help
familiarize you with this kind of documentation.
SEQRES
Overview
SEQRES records contain the amino acid or nucleic acid sequence of residues in
each chain of the
macromolecule that was studied.
Record Format
COLUMNS DATA TYPE FIELD DEFINITION
---------------------------------------------------------------------------------
1 - 6 Record name "SEQRES"
9 - 10 Integer serNum Serial number of the SEQRES record
for the current chain. Starts at 1
and increments by one each line.
Reset to 1 for each chain.
12 Character chainID Chain identifier. This may be any
single legal character, including a
blank which is used if there is
only one chain.
14 - 17 Integer numRes Number of residues in the chain.
This value is repeated on every
record.
20 - 22 Residue name resName Residue name.
24 - 26 Residue name resName Residue name.
28 - 30 Residue name resName Residue name.
32 - 34 Residue name resName Residue name.
36 - 38 Residue name resName Residue name.
40 - 42 Residue name resName Residue name.
44 - 46 Residue name resName Residue name.
48 - 50 Residue name resName Residue name.
52 - 54 Residue name resName Residue name.
56 - 58 Residue name resName Residue name.
60 - 62 Residue name resName Residue name.
64 - 66 Residue name resName Residue name.
68 - 70 Residue name resName Residue name.
Details
* PDB entries use the three-letter abbreviation for amino acid names and the
one-letter code for nucleic acids.
* In the case of non-standard groups, a hetID of up to three (3) alphanumeric
characters is used. Common HET names appear in the HET dictionary.
* Each covalently contiguous sequence of residues (connected via the "backbone"
atoms) is represented as an individual chain.
* Heterogens which are integrated into the backbone of the chain are listed as
being part of the chain and are included in the SEQRES records for that chain.
* Each set of SEQRES records and each HET group is assigned a component number.
The component number is assigned serially beginning with 1 for the first set
of SEQRES records. This number is given explicitly in the FORMUL record, but
only implicitly in the SEQRES record.
* The SEQRES records must list residues present in the molecule studied, even
if the coordinates are not present.
* C- and N-terminus residues for which no coordinates are provided due to
disorder must be listed on SEQRES.
* All occurrences of standard amino or nucleic acid residues (ATOM records)
must be listed on a SEQRES record. This implies that a numRes of 1 is valid.
* No distinction is made between ribo- and deoxyribonucleotides in the SEQRES
records. These residues are identified with the same residue name (i.e., A,
C, G, T, U, I).
* If the entire residue sequence is unknown, the serNum in column 10 is "0",
the number of residues thought to comprise the molecule is entered as numRes
in columns 14 - 17, and resName in columns 20 - 22 is "UNK".
* In case of microheterogeneity, only one of the sequences is presented. A
REMARK is generated to explain this and a SEQADV is also generated.
Verification/Validation/Value Authority Control
The residues presented on the SEQRES records must agree with those found in
the ATOM records.
The SEQRES records are checked by PDB using the sequence databases and
information provided by the depositor.
SEQRES is compared to the ATOM records during processing, and both are checked
against the sequence database. All discrepancies are either resolved or
annotated in the entry.
Relationships to Other Record Types
The residues presented on the SEQRES records must agree with those found in
the ATOM records. DBREF refers to the corresponding entry in the sequence
databases. SEQADV lists all discrepancies between the entry's sequence for
which there are coordinates and that referenced in the sequence database.
MODRES describes modifications to a standard residue.
Example
1 2 3 4 5 6 7
1234567890123456789012345678901234567890123456789012345678901234567890
SEQRES 1 A 21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES 2 A 21 TYR GLN LEU GLU ASN TYR CYS ASN
SEQRES 1 B 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU
SEQRES 2 B 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR
SEQRES 3 B 30 THR PRO LYS ALA
SEQRES 1 C 21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES 2 C 21 TYR GLN LEU GLU ASN TYR CYS ASN
SEQRES 1 D 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU
SEQRES 2 D 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR
SEQRES 3 D 30 THR PRO LYS ALA
Known Problems
Polysaccharides do not lend themselves to being represented in SEQRES.
There is no mechanism provided to describe sequence runs when the exact
ordering of the sequence is not known.
For cyclic peptides, PDB arbitrarily assigns a residue as the N-terminus.
For microheterogeneity only one of the possible residues in a given position
is provided in SEQRES.
No distinction is made between ribo- and deoxyribonucleotides in the SEQRES
records. These residues are identified with the same residue name (i.e., A,
C, G, T, U).
The structure of the line containing the SEQRES record type is fairly
straightforward, with fields assigned to specific locations or
columns in the line. You'll see later how to use these
locations to parse the information. Note that the documentation
includes many details that arise when handling such complex
experimental data.
Apart from the fairly standard problem of accumulating the sequence,
there is the added complication of multiple strands. By reading the
documentation just shown, you'll see that the SEQRES identifier
is followed by a number representing the line number for that chain,
and the chain is given in the next field (although in older records
it was optional and may be blank). Following those fields comes a
number that gives the total number of residues in the chain. Finally,
after that, come residues represented as three-letter codes. What is
needed, and what can be ignored to meet our programming
goals?