< BACKCONTINUE >

11.3 PDB Files

Here's a section of an actual PDB file:

HEADER    SUGAR BINDING PROTEIN                   03-MAR-99   1C1F              
TITLE     LIGAND-FREE CONGERIN I                                                
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: CONGERIN I;                                                
COMPND   3 CHAIN: A;                                                            
COMPND   4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN;                           
COMPND   5 BIOLOGICAL_UNIT: HOMODIMER                                           
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: CONGER MYRIASTER;                               
SOURCE   3 ORGANISM_COMMON: CONGER EEL;                                         
SOURCE   4 TISSUE: SKIN MUCUS;                                                  
SOURCE   5 SECRETION: NON-CLASSICAL                                             
KEYWDS    GALECTIN, LECTIN, BETA-GALACTOSE-BINDING, SUGAR BINDING               
KEYWDS   2 PROTEIN                                                              
EXPDTA    X-RAY DIFFRACTION                                                     
AUTHOR    T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,T.YAMANE,                
AUTHOR   2 H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO                                  
REVDAT   2   14-OCT-99 1C1F    1       SEQADV HEADER                            
REVDAT   1   08-OCT-99 1C1F    0                                                
JRNL        AUTH   T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,                
JRNL        AUTH 2 T.YAMANE,H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO                 
JRNL        TITL   HIGH-RESOLUTION STRUCTURE OF CONGER EEL GALECTIN,            
JRNL        TITL 2 CONGERIN I, IN LACTOSE- LIGANDED AND LIGAND-FREE             
JRNL        TITL 3 FORMS: EMERGENCE OF A NEW STRUCTURE CLASS BY                 
JRNL        TITL 4 ACCELERATED EVOLUTION                                        
JRNL        REF    STRUCTURE (LONDON)            V.   7  1223 1999              
JRNL        REFN   ASTM STRUE6  UK ISSN 0969-2126                 2005          
REMARK   1                                                                      
REMARK   2                                                                      
REMARK   2 RESOLUTION. 1.6 ANGSTROMS.                                           
REMARK   3                                                                      
REMARK   3 REFINEMENT.                                                          
REMARK   3   PROGRAM     : X-PLOR 3.1                                           
REMARK   3   AUTHORS     : BRUNGER                                              
REMARK   3                                                                      
REMARK   3  DATA USED IN REFINEMENT.                                            
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.60                           
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 8.00                           
REMARK   3   DATA CUTOFF            (SIGMA(F)) : 3.000                          
REMARK   3   DATA CUTOFF HIGH         (ABS(F)) : NULL                           
REMARK   3   DATA CUTOFF LOW          (ABS(F)) : NULL                           
REMARK   3   COMPLETENESS (WORKING+TEST)   (%) : 85.0                           
REMARK   3   NUMBER OF REFLECTIONS             : 17099                          
REMARK   3                                                                      
REMARK   3                                                                      
REMARK   3  FIT TO DATA USED IN REFINEMENT.                                     
REMARK   3   CROSS-VALIDATION METHOD          : THROUGHOUT                      
REMARK   3   FREE R VALUE TEST SET SELECTION  : RANDOM                          
REMARK   3   R VALUE            (WORKING SET) : 0.201                           
REMARK   3   FREE R VALUE                     : 0.247                           
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.000                           
REMARK   3   FREE R VALUE TEST SET COUNT      : 855                             
REMARK   3   ESTIMATED ERROR OF FREE R VALUE  : NULL                            
REMARK   3                                                                      
... 
(file truncated here)
REMARK 4 REMARK 4 1C1F COMPLIES WITH FORMAT V. 2.3, 09-JULY-1998 REMARK 7 REMARK 7 >>> WARNING: CHECK REMARK 999 CAREFULLY REMARK 8 REMARK 8 SIDE-CHAINS OF SER123 AND LEU124 ARE MODELED AS ALTERNATIVE REMARK 8 CONFORMERS. REMARK 9 REMARK 9 SER1 IS ACETYLATED. REMARK 10 REMARK 10 TER REMARK 10 SER: THE N-TERMINAL RESIDUE WAS NOT OBSERVED REMARK 100 REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 07-MAR-1999. REMARK 100 THE RCSB ID CODE IS RCSB000566. REMARK 200 REMARK 200 EXPERIMENTAL DETAILS REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION REMARK 200 DATE OF DATA COLLECTION : NULL REMARK 200 TEMPERATURE (KELVIN) : 291.0 REMARK 200 PH : 9.00 REMARK 200 NUMBER OF CRYSTALS USED : 1 REMARK 200 REMARK 200 SYNCHROTRON (Y/N) : Y REMARK 200 RADIATION SOURCE : PHOTON FACTORY REMARK 200 BEAMLINE : BL6A REMARK 200 X-RAY GENERATOR MODEL : NULL REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M REMARK 200 WAVELENGTH OR RANGE (A) : 1.00 REMARK 200 MONOCHROMATOR : NULL REMARK 200 OPTICS : NULL REMARK 200 ...
(file truncated here)
REMARK 500 REMARK 500 GEOMETRY AND STEREOCHEMISTRY REMARK 500 SUBTOPIC: COVALENT BOND ANGLES REMARK 500 REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE REMARK 500 THAN 4*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE). REMARK 500 REMARK 500 STANDARD TABLE: REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1) REMARK 500 REMARK 500 EXPECTED VALUES: ENGH AND HUBER, 1991 REMARK 500 REMARK 500 M RES CSSEQI ATM1 ATM2 ATM3 REMARK 500 HIS A 44 N - CA - C ANGL. DEV. =-10.3 DEGREES REMARK 500 LEU A 132 CA - CB - CG ANGL. DEV. = 12.5 DEGREES REMARK 700 REMARK 700 SHEET REMARK 700 DETERMINATION METHOD: AUTHOR-DETERMINED REMARK 999 REMARK 999 SEQUENCE REMARK 999 LEU A 135 IS NOT PRESENT IN SEQUENCE DATABASE REMARK 999 DBREF 1C1F A 1 136 SWS P26788 LEG_CONMY 1 135 SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 999 SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL SEQRES 2 A 136 GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER SEQRES 3 A 136 PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN SEQRES 4 A 136 SER LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY SEQRES 5 A 136 ALA ASP GLN ASN THR ILE VAL MET ASN SER THR LEU LYS SEQRES 6 A 136 GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN SEQRES 7 A 136 PHE THR LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU SEQRES 8 A 136 SER TYR ASP ILE ASN LYS PHE TYR ILE ASP ILE LEU ASP SEQRES 9 A 136 GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU SEQRES 10 A 136 PHE LEU PRO PHE LEU SER LEU ALA GLY ASP ALA ARG LEU SEQRES 11 A 136 THR LEU VAL LYS LEU GLU FORMUL 2 HOH *81(H2 O1) HELIX 1 1 GLY A 66 ASN A 68 5 3 SHEET 1 S1 1 GLY A 3 VAL A 6 0 SHEET 1 S2 1 PHE A 121 GLY A 126 0 SHEET 1 S3 1 ARG A 29 GLY A 35 0 SHEET 1 S4 1 LEU A 41 ASN A 50 0 SHEET 1 S5 1 GLN A 55 THR A 63 0 SHEET 1 S6 1 GLN A 74 SER A 76 0 SHEET 1 F1 1 ALA A 128 GLU A 136 0 SHEET 1 F2 1 PHE A 16 ILE A 23 0 SHEET 1 F3 1 TYR A 86 TYR A 93 0 SHEET 1 F4 1 LYS A 97 ILE A 102 0 SHEET 1 F5 1 ASN A 107 PRO A 111 0 CRYST1 94.340 36.920 40.540 90.00 90.00 90.00 P 21 21 2 4 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.010600 0.000000 0.000000 0.00000 SCALE2 0.000000 0.027085 0.000000 0.00000 SCALE3 0.000000 0.000000 0.024667 0.00000 ATOM 1 N GLY A 2 1.888 -8.251 -2.511 1.00 36.63 N ATOM 2 CA GLY A 2 2.571 -8.428 -1.248 1.00 33.02 C ATOM 3 C GLY A 2 2.586 -7.069 -0.589 1.00 30.43 C ATOM 4 O GLY A 2 2.833 -6.107 -1.311 1.00 33.27 O ATOM 5 N GLY A 3 2.302 -6.984 0.693 1.00 24.67 N ATOM 6 CA GLY A 3 2.176 -5.723 1.348 1.00 18.88 C ATOM 7 C GLY A 3 0.700 -5.426 1.526 1.00 16.58 C ATOM 8 O GLY A 3 -0.187 -6.142 1.010 1.00 12.47 O ATOM 9 N LEU A 4 0.494 -4.400 2.328 1.00 15.00 N ...
(file truncated here)
ATOM 1078 CG GLU A 136 -0.873 9.368 16.046 1.00 38.96 C ATOM 1079 CD GLU A 136 -0.399 9.054 17.456 1.00 44.66 C ATOM 1080 OE1 GLU A 136 0.789 8.749 17.641 1.00 47.97 O ATOM 1081 OE2 GLU A 136 -1.236 9.099 18.361 1.00 47.75 O ATOM 1082 OXT GLU A 136 0.764 12.146 12.712 1.00 26.22 O TER 1083 GLU A 136 HETATM 1084 O HOH 200 -1.905 -7.624 2.822 1.00 14.50 O HETATM 1085 O HOH 201 -8.374 7.981 9.202 1.00 20.77 O HETATM 1086 O HOH 202 -4.047 9.199 11.632 1.00 38.24 O HETATM 1087 O HOH 203 6.172 14.210 8.483 1.00 14.50 O HETATM 1088 O HOH 204 2.903 7.804 15.329 1.00 24.51 O HETATM 1089 O HOH 205 16.654 0.676 11.968 1.00 10.49 O ...
(file truncated here)
HETATM 1157 O HOH 286 6.960 14.840 -3.025 1.00 35.59 O HETATM 1158 O HOH 287 -3.222 10.410 7.061 1.00 38.91 O HETATM 1159 O HOH 288 28.306 0.551 4.876 1.00 52.13 O HETATM 1160 O HOH 290 21.506 -12.424 9.751 1.00 31.68 O HETATM 1161 O HOH 291 12.951 10.424 -7.324 1.00 46.10 O HETATM 1162 O HOH 292 18.119 -15.184 14.793 1.00 56.82 O HETATM 1163 O HOH 293 13.501 22.220 8.216 1.00 43.30 O HETATM 1164 O HOH 294 13.916 -11.387 9.695 1.00 47.13 O MASTER 240 0 0 1 11 0 0 6 1163 1 0 11 END

PDB files are long, mostly due to the need for information about each atom in the molecule; this relatively short one, when complete, is extensive—28 formatted pages. I cut it here to a little over three pages, showing just enough of the principal sections to give you the overall idea.

The PDB web site has the basic documents you need to read and program with PDB files. The Protein Data Bank Contents Guide (http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html) is the best reference, and there are also FAQs and additional documents available.

In the following sections, you'll extract information from these files. Since the information in these files describes the 3D structure of macromolecules, the files are frequently used by graphical programs that display a spatial representation of the molecules. The scope of this book does not include graphics; however, you will see how to get spatial coordinates out of the files. The largest part of PDB files are the ATOM record type lines containing the coordinates of the atoms. Because of this level of detail, PDB files are typically longer than GenBank records. (Note the inconsistent terminology—a unit of PDB is the file, which contains one structure; a unit of GenBank is the record, which contains one entry.)

11.3.1 PDB File Format

Let's take a look at a PDB file and the documentation that tells how the information is formatted in a PDB file. Based on that information, you'll parse the file to extract information of interest.

PDB files are composed of lines of 80 columns that begin with one of several predefined record names and end with a newline. ("Column" means position on a line: the first character is in the first column, and so forth.) Blank columns are padded with spaces. A record type is one or more lines with the same record name. Different record types have different types of fields defined within the lines. They are also grouped according to function.

The SEQRES record type is one of four record types in the Primary Structure Section, which presents the primary structure of the peptide or nucleotide sequence:

DBREF

Reference to the entry in the sequence database(s)

SEQADV

Identification of conflicts between PDB and the named sequence database

SEQRES

Primary sequence of backbone residues

MODRES

Identification of modifications to standard residues

The DBREF and SEQADV record types in the example PDB entry from the previous section give reference information and details on conflicts between the PDB and the original database. (The example doesn't include a MODRES record type.) Here are those record types from the entry:

DBREF  1C1F A    1   136  SWS    P26788   LEG_CONMY        1    135             
SEQADV 1C1F LEU A  135  SWS  P26788              SEE REMARK 999                 

Briefly, the DBREF line states there's a PDB file called 1C1F (from a file named pdb1c1f.ent), the residues in chain A are numbered from 1 to 136 in the original Swiss-Prot (SWS) database, the ID number P26788 and the name LEG_CONMY are assigned in that database (in many databases these are identical), and the residues are numbered 1 to 135 in PDB. The discrepancy in the numbering between the original database and PDB is explained in the SEQADV record type, which refers you to a REMARK 999 line (not shown here) where you discover that the PDB entry disagrees with the Swiss-Prot sequence concerning a leucine at position 135 (perhaps two different groups determined the structure, and they disagree at this point).[2]

[2] The cross-referencing to different databases is problematic in older PDB files: it may be missing, or buried somewhere in a REMARK 999 line.

You can see that to parse the information in those two lines by a program requires several steps, such as following links to other lines in the PDB entry that further explain discrepancies and identifying other databases.

Links between databases are important in bioinformatics. Table 11-1 displays the databases that are referred to in PDB files. As you already know, there are many biological databases; those shown here have a good deal of protein or structural data.

Table 11-1. Databases referenced in PDB files

Database

PDB code

BioMagResBank

BMRB

BLOCKS

BLOCKS

European Molecular Biology Laboratory

EMBL

GenBank

GB

Genome Data Base

GDB

Nucleic Acid Database

NDB

PROSITE

PROSIT

Protein Data Bank

PDB

Protein Identification Resource

PIR

SWISS-PROT

SWS

TREMBL

TREMBL

11.3.2 SEQRES

For starters, let's try a fairly easy task in Perl: extracting the amino acid sequence data. To extract the amino acid primary sequence information, you need to parse the record type SEQRES. Here is a SEQRES line from the PDB file listed earlier:

SEQRES   1 A  136  SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL

The following code shows the SEQRES record type as defined in the Protein Data Bank Contents Guide. This section on SEQRES, which is a fairly simple record type, is shown in its entirely to help familiarize you with this kind of documentation.

SEQRES 
 
Overview 

SEQRES records contain the amino acid or nucleic acid sequence of residues in
each chain of the 
macromolecule that was studied. 

Record Format 

COLUMNS        DATA TYPE       FIELD         DEFINITION                           
---------------------------------------------------------------------------------
 1 -  6        Record name     "SEQRES"                                           

 9 - 10        Integer         serNum        Serial number of the SEQRES record   
                                             for the current chain.  Starts at 1  
                                             and increments by one each line.     
                                             Reset to 1 for each chain.           

12             Character       chainID       Chain identifier.  This may be any   
                                             single legal character, including a  
                                             blank which is used if there is      
                                             only one chain.                      

14 - 17        Integer         numRes        Number of residues in the chain.     
                                             This value is repeated on every      
                                             record.                              

20 - 22        Residue name    resName       Residue name.                        

24 - 26        Residue name    resName       Residue name.                        

28 - 30        Residue name    resName       Residue name.                        

32 - 34        Residue name    resName       Residue name.                        

36 - 38        Residue name    resName       Residue name.                        

40 - 42        Residue name    resName       Residue name.                        

44 - 46        Residue name    resName       Residue name.                        

48 - 50        Residue name    resName       Residue name.                        

52 - 54        Residue name    resName       Residue name.                        

56 - 58        Residue name    resName       Residue name.                        

60 - 62        Residue name    resName       Residue name.                        

64 - 66        Residue name    resName       Residue name.                        

68 - 70        Residue name    resName       Residue name.                        

Details 

* PDB entries use the three-letter abbreviation for amino acid names and the
  one-letter code for nucleic acids. 

* In the case of non-standard groups, a hetID of up to three (3) alphanumeric
  characters is used. Common HET names appear in the HET dictionary. 

* Each covalently contiguous sequence of residues (connected via the "backbone"
  atoms) is represented as an individual chain. 

* Heterogens which are integrated into the backbone of the chain are listed as
  being part of the chain and are included in the SEQRES records for that chain. 

* Each set of SEQRES records and each HET group is assigned a component number.
  The component number is assigned serially beginning with 1 for the first set
  of SEQRES records. This number is given explicitly in the FORMUL record, but
  only implicitly in the SEQRES record. 

* The SEQRES records must list residues present in the molecule studied, even
  if the coordinates are not present. 

* C- and N-terminus residues for which no coordinates are provided due to
  disorder must be listed on SEQRES. 

* All occurrences of standard amino or nucleic acid residues (ATOM records)
  must be listed on a SEQRES record. This implies that a numRes of 1 is valid. 

* No distinction is made between ribo- and deoxyribonucleotides in the SEQRES
  records. These residues are identified with the same residue name (i.e., A,
  C, G, T, U, I). 

* If the entire residue sequence is unknown, the serNum in column 10 is "0",
  the number of residues thought to comprise the molecule is entered as numRes
  in columns 14 - 17, and resName in columns 20 - 22 is "UNK". 

* In case of microheterogeneity, only one of the sequences is presented. A
  REMARK is generated to explain this and a SEQADV is also generated. 

Verification/Validation/Value Authority Control 

The residues presented on the SEQRES records must agree with those found in
the ATOM records. 

The SEQRES records are checked by PDB using the sequence databases and
information provided by the depositor. 

SEQRES is compared to the ATOM records during processing, and both are checked
against the sequence database. All discrepancies are either resolved or
annotated in the entry. 

Relationships to Other Record Types 

The residues presented on the SEQRES records must agree with those found in
the ATOM records. DBREF refers to the corresponding entry in the sequence
databases. SEQADV lists all discrepancies between the entry's sequence for
which there are coordinates and that referenced in the sequence database.
MODRES describes modifications to a standard residue. 

Example 

         1         2         3         4         5         6         7
1234567890123456789012345678901234567890123456789012345678901234567890
SEQRES   1 A   21  GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES   2 A   21  TYR GLN LEU GLU ASN TYR CYS ASN                    
SEQRES   1 B   30  PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU
SEQRES   2 B   30  ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR
SEQRES   3 B   30  THR PRO LYS ALA                                    
SEQRES   1 C   21  GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES   2 C   21  TYR GLN LEU GLU ASN TYR CYS ASN                    
SEQRES   1 D   30  PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU
SEQRES   2 D   30  ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR
SEQRES   3 D   30  THR PRO LYS ALA                                    

Known Problems 

Polysaccharides do not lend themselves to being represented in SEQRES. 

There is no mechanism provided to describe sequence runs when the exact
ordering of the sequence is not known. 

For cyclic peptides, PDB arbitrarily assigns a residue as the N-terminus. 

For microheterogeneity only one of the possible residues in a given position
is provided in SEQRES. 

No distinction is made between ribo- and deoxyribonucleotides in the SEQRES
records. These residues are identified with the same residue name (i.e., A,
C, G, T, U). 

The structure of the line containing the SEQRES record type is fairly straightforward, with fields assigned to specific locations or columns in the line. You'll see later how to use these locations to parse the information. Note that the documentation includes many details that arise when handling such complex experimental data.

Apart from the fairly standard problem of accumulating the sequence, there is the added complication of multiple strands. By reading the documentation just shown, you'll see that the SEQRES identifier is followed by a number representing the line number for that chain, and the chain is given in the next field (although in older records it was optional and may be blank). Following those fields comes a number that gives the total number of residues in the chain. Finally, after that, come residues represented as three-letter codes. What is needed, and what can be ignored to meet our programming goals?

< BACKCONTINUE >

Index terms contained in this section

backbone residues, primary sequence of
columns, PDB files
databases
     PDB files
            record types
            referenced in
DBREF record type, PDB files
      example
files
     PDB
            format of
macromolecules
      residues, amino acid or nucleic acid sequence of
MODRES record type, PDB files
names
      PDB file records
nucleotides
      primary structure of
parsing
     PDB files
            SEQRES record
peptides
      primary structure of
primary structure, proteins
Protein Data Bank (PDB)
     files
            format of
            SEQRES record, parsing
records, PDB files
      names
      types
references
      to databases in PDB files
      DBREF record type, PDB files
residues in macromolecule chains
      amino acid or nucleic acid sequence of
SEQADV record type, PDB files
      example
SEQRES record type, PDB files
      parsing
standard residues, modifications to
structure
      primary, of peptide or nucleotide sequence

© 2002, O'Reilly & Associates, Inc.