Safari | Beginning Perl for Bioinformatics -> 11.3 PDB Files

Beginning Perl for Bioinformatics > 11. Protein Data Bank > 11.3 PDB Files

11.3 PDB Files

Here's a section of an actual PDB file:

HEADER    SUGAR BINDING PROTEIN                   03-MAR-99   1C1F              
TITLE     LIGAND-FREE CONGERIN I                                                
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: CONGERIN I;                                                
COMPND   3 CHAIN: A;                                                            
COMPND   4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN;                           
COMPND   5 BIOLOGICAL_UNIT: HOMODIMER                                           
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: CONGER MYRIASTER;                               
SOURCE   3 ORGANISM_COMMON: CONGER EEL;                                         
SOURCE   4 TISSUE: SKIN MUCUS;                                                  
SOURCE   5 SECRETION: NON-CLASSICAL                                             
KEYWDS    GALECTIN, LECTIN, BETA-GALACTOSE-BINDING, SUGAR BINDING               
KEYWDS   2 PROTEIN                                                              
EXPDTA    X-RAY DIFFRACTION                                                     
AUTHOR    T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,T.YAMANE,                
AUTHOR   2 H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO                                  
REVDAT   2   14-OCT-99 1C1F    1       SEQADV HEADER                            
REVDAT   1   08-OCT-99 1C1F    0                                                
JRNL        AUTH   T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,                
JRNL        AUTH 2 T.YAMANE,H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO                 
JRNL        TITL   HIGH-RESOLUTION STRUCTURE OF CONGER EEL GALECTIN,            
JRNL        TITL 2 CONGERIN I, IN LACTOSE- LIGANDED AND LIGAND-FREE             
JRNL        TITL 3 FORMS: EMERGENCE OF A NEW STRUCTURE CLASS BY                 
JRNL        TITL 4 ACCELERATED EVOLUTION                                        
JRNL        REF    STRUCTURE (LONDON)            V.   7  1223 1999              
JRNL        REFN   ASTM STRUE6  UK ISSN 0969-2126                 2005          
REMARK   1                                                                      
REMARK   2                                                                      
REMARK   2 RESOLUTION. 1.6 ANGSTROMS.                                           
REMARK   3                                                                      
REMARK   3 REFINEMENT.                                                          
REMARK   3   PROGRAM     : X-PLOR 3.1                                           
REMARK   3   AUTHORS     : BRUNGER                                              
REMARK   3                                                                      
REMARK   3  DATA USED IN REFINEMENT.                                            
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.60                           
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 8.00                           
REMARK   3   DATA CUTOFF            (SIGMA(F)) : 3.000                          
REMARK   3   DATA CUTOFF HIGH         (ABS(F)) : NULL                           
REMARK   3   DATA CUTOFF LOW          (ABS(F)) : NULL                           
REMARK   3   COMPLETENESS (WORKING+TEST)   (%) : 85.0                           
REMARK   3   NUMBER OF REFLECTIONS             : 17099                          
REMARK   3                                                                      
REMARK   3                                                                      
REMARK   3  FIT TO DATA USED IN REFINEMENT.                                     
REMARK   3   CROSS-VALIDATION METHOD          : THROUGHOUT                      
REMARK   3   FREE R VALUE TEST SET SELECTION  : RANDOM                          
REMARK   3   R VALUE            (WORKING SET) : 0.201                           
REMARK   3   FREE R VALUE                     : 0.247                           
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.000                           
REMARK   3   FREE R VALUE TEST SET COUNT      : 855                             
REMARK   3   ESTIMATED ERROR OF FREE R VALUE  : NULL                            
REMARK   3                                                                      
... (file truncated here)
REMARK   4                                                                      
REMARK   4 1C1F COMPLIES WITH FORMAT V. 2.3, 09-JULY-1998                       
REMARK   7                                                                      
REMARK   7 >>> WARNING: CHECK REMARK 999 CAREFULLY                              
REMARK   8                                                                      
REMARK   8 SIDE-CHAINS OF SER123 AND LEU124 ARE MODELED AS ALTERNATIVE          
REMARK   8 CONFORMERS.                                                          
REMARK   9                                                                      
REMARK   9 SER1 IS ACETYLATED.                                                  
REMARK  10                                                                      
REMARK  10 TER                                                                  
REMARK  10  SER: THE N-TERMINAL RESIDUE WAS NOT OBSERVED                        
REMARK 100                                                                      
REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 07-MAR-1999.                
REMARK 100 THE RCSB ID CODE IS RCSB000566.                                      
REMARK 200                                                                      
REMARK 200 EXPERIMENTAL DETAILS                                                 
REMARK 200  EXPERIMENT TYPE                : X-RAY DIFFRACTION                  
REMARK 200  DATE OF DATA COLLECTION        : NULL                               
REMARK 200  TEMPERATURE           (KELVIN) : 291.0                              
REMARK 200  PH                             : 9.00                               
REMARK 200  NUMBER OF CRYSTALS USED        : 1                                  
REMARK 200                                                                      
REMARK 200  SYNCHROTRON              (Y/N) : Y                                  
REMARK 200  RADIATION SOURCE               : PHOTON FACTORY                     
REMARK 200  BEAMLINE                       : BL6A                               
REMARK 200  X-RAY GENERATOR MODEL          : NULL                               
REMARK 200  MONOCHROMATIC OR LAUE    (M/L) : M                                  
REMARK 200  WAVELENGTH OR RANGE        (A) : 1.00                               
REMARK 200  MONOCHROMATOR                  : NULL                               
REMARK 200  OPTICS                         : NULL                               
REMARK 200                                                                      
... (file truncated here)
REMARK 500                                                                      
REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
REMARK 500 SUBTOPIC: COVALENT BOND ANGLES                                       
REMARK 500                                                                      
REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES              
REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE               
REMARK 500 THAN 4*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN               
REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                 
REMARK 500                                                                      
REMARK 500 STANDARD TABLE:                                                      
REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)              
REMARK 500                                                                      
REMARK 500 EXPECTED VALUES: ENGH AND HUBER, 1991                                
REMARK 500                                                                      
REMARK 500  M RES CSSEQI ATM1   ATM2   ATM3                                     
REMARK 500    HIS A  44   N   -  CA  -  C   ANGL. DEV. =-10.3 DEGREES           
REMARK 500    LEU A 132   CA  -  CB  -  CG  ANGL. DEV. = 12.5 DEGREES           
REMARK 700                                                                      
REMARK 700 SHEET                                                                
REMARK 700 DETERMINATION METHOD: AUTHOR-DETERMINED                              
REMARK 999                                                                      
REMARK 999 SEQUENCE                                                             
REMARK 999 LEU A 135 IS NOT PRESENT IN SEQUENCE DATABASE                        
REMARK 999                                                                      
DBREF  1C1F A    1   136  SWS    P26788   LEG_CONMY        1    135             
SEQADV 1C1F LEU A  135  SWS  P26788              SEE REMARK 999                 
SEQRES   1 A  136  SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL          
SEQRES   2 A  136  GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER          
SEQRES   3 A  136  PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN          
SEQRES   4 A  136  SER LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY          
SEQRES   5 A  136  ALA ASP GLN ASN THR ILE VAL MET ASN SER THR LEU LYS          
SEQRES   6 A  136  GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN          
SEQRES   7 A  136  PHE THR LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU          
SEQRES   8 A  136  SER TYR ASP ILE ASN LYS PHE TYR ILE ASP ILE LEU ASP          
SEQRES   9 A  136  GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU          
SEQRES  10 A  136  PHE LEU PRO PHE LEU SER LEU ALA GLY ASP ALA ARG LEU          
SEQRES  11 A  136  THR LEU VAL LYS LEU GLU                                      
FORMUL   2  HOH   *81(H2 O1)                                                    
HELIX    1   1 GLY A   66  ASN A   68  5                                   3    
SHEET    1  S1 1 GLY A   3  VAL A   6  0                                        
SHEET    1  S2 1 PHE A 121  GLY A 126  0                                        
SHEET    1  S3 1 ARG A  29  GLY A  35  0                                        
SHEET    1  S4 1 LEU A  41  ASN A  50  0                                        
SHEET    1  S5 1 GLN A  55  THR A  63  0                                        
SHEET    1  S6 1 GLN A  74  SER A  76  0                                        
SHEET    1  F1 1 ALA A 128  GLU A 136  0                                        
SHEET    1  F2 1 PHE A  16  ILE A  23  0                                        
SHEET    1  F3 1 TYR A  86  TYR A  93  0                                        
SHEET    1  F4 1 LYS A  97  ILE A 102  0                                        
SHEET    1  F5 1 ASN A 107  PRO A 111  0                                        
CRYST1   94.340   36.920   40.540  90.00  90.00  90.00 P 21 21 2     4          
ORIGX1      1.000000  0.000000  0.000000        0.00000                         
ORIGX2      0.000000  1.000000  0.000000        0.00000                         
ORIGX3      0.000000  0.000000  1.000000        0.00000                         
SCALE1      0.010600  0.000000  0.000000        0.00000                         
SCALE2      0.000000  0.027085  0.000000        0.00000                         
SCALE3      0.000000  0.000000  0.024667        0.00000                         
ATOM      1  N   GLY A   2       1.888  -8.251  -2.511  1.00 36.63           N  
ATOM      2  CA  GLY A   2       2.571  -8.428  -1.248  1.00 33.02           C  
ATOM      3  C   GLY A   2       2.586  -7.069  -0.589  1.00 30.43           C  
ATOM      4  O   GLY A   2       2.833  -6.107  -1.311  1.00 33.27           O  
ATOM      5  N   GLY A   3       2.302  -6.984   0.693  1.00 24.67           N  
ATOM      6  CA  GLY A   3       2.176  -5.723   1.348  1.00 18.88           C  
ATOM      7  C   GLY A   3       0.700  -5.426   1.526  1.00 16.58           C  
ATOM      8  O   GLY A   3      -0.187  -6.142   1.010  1.00 12.47           O  
ATOM      9  N   LEU A   4       0.494  -4.400   2.328  1.00 15.00           N  
... (file truncated here)
ATOM   1078  CG  GLU A 136      -0.873   9.368  16.046  1.00 38.96           C  
ATOM   1079  CD  GLU A 136      -0.399   9.054  17.456  1.00 44.66           C  
ATOM   1080  OE1 GLU A 136       0.789   8.749  17.641  1.00 47.97           O  
ATOM   1081  OE2 GLU A 136      -1.236   9.099  18.361  1.00 47.75           O  
ATOM   1082  OXT GLU A 136       0.764  12.146  12.712  1.00 26.22           O  
TER    1083      GLU A 136                                                      
HETATM 1084  O   HOH   200      -1.905  -7.624   2.822  1.00 14.50           O  
HETATM 1085  O   HOH   201      -8.374   7.981   9.202  1.00 20.77           O  
HETATM 1086  O   HOH   202      -4.047   9.199  11.632  1.00 38.24           O  
HETATM 1087  O   HOH   203       6.172  14.210   8.483  1.00 14.50           O  
HETATM 1088  O   HOH   204       2.903   7.804  15.329  1.00 24.51           O  
HETATM 1089  O   HOH   205      16.654   0.676  11.968  1.00 10.49           O  
... (file truncated here)
HETATM 1157  O   HOH   286       6.960  14.840  -3.025  1.00 35.59           O  
HETATM 1158  O   HOH   287      -3.222  10.410   7.061  1.00 38.91           O  
HETATM 1159  O   HOH   288      28.306   0.551   4.876  1.00 52.13           O  
HETATM 1160  O   HOH   290      21.506 -12.424   9.751  1.00 31.68           O  
HETATM 1161  O   HOH   291      12.951  10.424  -7.324  1.00 46.10           O  
HETATM 1162  O   HOH   292      18.119 -15.184  14.793  1.00 56.82           O  
HETATM 1163  O   HOH   293      13.501  22.220   8.216  1.00 43.30           O  
HETATM 1164  O   HOH   294      13.916 -11.387   9.695  1.00 47.13           O  
MASTER      240    0    0    1   11    0    0    6 1163    1    0   11          
END

PDB files are long, mostly due to the need for information about each atom in the molecule; this relatively short one, when complete, is extensive—28 formatted pages. I cut it here to a little over three pages, showing just enough of the principal sections to give you the overall idea.

The PDB web site has the basic documents you need to read and program with PDB files. The Protein Data Bank Contents Guide (http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html) is the best reference, and there are also FAQs and additional documents available.

In the following sections, you'll extract information from these files. Since the information in these files describes the 3D structure of macromolecules, the files are frequently used by graphical programs that display a spatial representation of the molecules. The scope of this book does not include graphics; however, you will see how to get spatial coordinates out of the files. The largest part of PDB files are the ATOM record type lines containing the coordinates of the atoms. Because of this level of detail, PDB files are typically longer than GenBank records. (Note the inconsistent terminology—a unit of PDB is the file, which contains one structure; a unit of GenBank is the record, which contains one entry.)

11.3.1 PDB File Format

Let's take a look at a PDB file and the documentation that tells how the information is formatted in a PDB file. Based on that information, you'll parse the file to extract information of interest.

PDB files are composed of lines of 80 columns that begin with one of several predefined record names and end with a newline. ("Column" means position on a line: the first character is in the first column, and so forth.) Blank columns are padded with spaces. A record type is one or more lines with the same record name. Different record types have different types of fields defined within the lines. They are also grouped according to function.

The SEQRES record type is one of four record types in the Primary Structure Section, which presents the primary structure of the peptide or nucleotide sequence:

DBREF: Reference to the entry in the sequence database(s)
SEQADV: Identification of conflicts between PDB and the named sequence database
SEQRES: Primary sequence of backbone residues
MODRES: Identification of modifications to standard residues

The DBREF and SEQADV record types in the example PDB entry from the previous section give reference information and details on conflicts between the PDB and the original database. (The example doesn't include a MODRES record type.) Here are those record types from the entry:

DBREF  1C1F A    1   136  SWS    P26788   LEG_CONMY        1    135             
SEQADV 1C1F LEU A  135  SWS  P26788              SEE REMARK 999

Briefly, the DBREF line states there's a PDB file called 1C1F (from a file named pdb1c1f.ent), the residues in chain A are numbered from 1 to 136 in the original Swiss-Prot (SWS) database, the ID number P26788 and the name LEG_CONMY are assigned in that database (in many databases these are identical), and the residues are numbered 1 to 135 in PDB. The discrepancy in the numbering between the original database and PDB is explained in the SEQADV record type, which refers you to a REMARK 999 line (not shown here) where you discover that the PDB entry disagrees with the Swiss-Prot sequence concerning a leucine at position 135 (perhaps two different groups determined the structure, and they disagree at this point).^[2]

^[2] The cross-referencing to different databases is problematic in older PDB files: it may be missing, or buried somewhere in a REMARK 999 line.

You can see that to parse the information in those two lines by a program requires several steps, such as following links to other lines in the PDB entry that further explain discrepancies and identifying other databases.

Links between databases are important in bioinformatics. Table 11-1 displays the databases that are referred to in PDB files. As you already know, there are many biological databases; those shown here have a good deal of protein or structural data.

Table 11-1. Databases referenced in PDB files

Database

PDB code

BioMagResBank

BMRB

BLOCKS

BLOCKS

European Molecular Biology Laboratory

EMBL

GenBank

GB

Genome Data Base

GDB

Nucleic Acid Database

NDB

PROSITE

PROSIT

Protein Data Bank

PDB

Protein Identification Resource

PIR

SWISS-PROT

SWS

TREMBL

TREMBL

11.3.2 SEQRES

For starters, let's try a fairly easy task in Perl: extracting the amino acid sequence data. To extract the amino acid primary sequence information, you need to parse the record type SEQRES. Here is a SEQRES line from the PDB file listed earlier:

SEQRES   1 A  136  SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL

The following code shows the SEQRES record type as defined in the Protein Data Bank Contents Guide. This section on SEQRES, which is a fairly simple record type, is shown in its entirely to help familiarize you with this kind of documentation.

SEQRES 
 
Overview 

SEQRES records contain the amino acid or nucleic acid sequence of residues in
each chain of the 
macromolecule that was studied. 

Record Format 

COLUMNS        DATA TYPE       FIELD         DEFINITION                           
---------------------------------------------------------------------------------
 1 -  6        Record name     "SEQRES"                                           

 9 - 10        Integer         serNum        Serial number of the SEQRES record   
                                             for the current chain.  Starts at 1  
                                             and increments by one each line.     
                                             Reset to 1 for each chain.           

12             Character       chainID       Chain identifier.  This may be any   
                                             single legal character, including a  
                                             blank which is used if there is      
                                             only one chain.                      

14 - 17        Integer         numRes        Number of residues in the chain.     
                                             This value is repeated on every      
                                             record.                              

20 - 22        Residue name    resName       Residue name.                        

24 - 26        Residue name    resName       Residue name.                        

28 - 30        Residue name    resName       Residue name.                        

32 - 34        Residue name    resName       Residue name.                        

36 - 38        Residue name    resName       Residue name.                        

40 - 42        Residue name    resName       Residue name.                        

44 - 46        Residue name    resName       Residue name.                        

48 - 50        Residue name    resName       Residue name.                        

52 - 54        Residue name    resName       Residue name.                        

56 - 58        Residue name    resName       Residue name.                        

60 - 62        Residue name    resName       Residue name.                        

64 - 66        Residue name    resName       Residue name.                        

68 - 70        Residue name    resName       Residue name.                        

Details 

* PDB entries use the three-letter abbreviation for amino acid names and the
  one-letter code for nucleic acids. 

* In the case of non-standard groups, a hetID of up to three (3) alphanumeric
  characters is used. Common HET names appear in the HET dictionary. 

* Each covalently contiguous sequence of residues (connected via the "backbone"
  atoms) is represented as an individual chain. 

* Heterogens which are integrated into the backbone of the chain are listed as
  being part of the chain and are included in the SEQRES records for that chain. 

* Each set of SEQRES records and each HET group is assigned a component number.
  The component number is assigned serially beginning with 1 for the first set
  of SEQRES records. This number is given explicitly in the FORMUL record, but
  only implicitly in the SEQRES record. 

* The SEQRES records must list residues present in the molecule studied, even
  if the coordinates are not present. 

* C- and N-terminus residues for which no coordinates are provided due to
  disorder must be listed on SEQRES. 

* All occurrences of standard amino or nucleic acid residues (ATOM records)
  must be listed on a SEQRES record. This implies that a numRes of 1 is valid. 

* No distinction is made between ribo- and deoxyribonucleotides in the SEQRES
  records. These residues are identified with the same residue name (i.e., A,
  C, G, T, U, I). 

* If the entire residue sequence is unknown, the serNum in column 10 is "0",
  the number of residues thought to comprise the molecule is entered as numRes
  in columns 14 - 17, and resName in columns 20 - 22 is "UNK". 

* In case of microheterogeneity, only one of the sequences is presented. A
  REMARK is generated to explain this and a SEQADV is also generated. 

Verification/Validation/Value Authority Control 

The residues presented on the SEQRES records must agree with those found in
the ATOM records. 

The SEQRES records are checked by PDB using the sequence databases and
information provided by the depositor. 

SEQRES is compared to the ATOM records during processing, and both are checked
against the sequence database. All discrepancies are either resolved or
annotated in the entry. 

Relationships to Other Record Types 

The residues presented on the SEQRES records must agree with those found in
the ATOM records. DBREF refers to the corresponding entry in the sequence
databases. SEQADV lists all discrepancies between the entry's sequence for
which there are coordinates and that referenced in the sequence database.
MODRES describes modifications to a standard residue. 

Example 

         1         2         3         4         5         6         7
1234567890123456789012345678901234567890123456789012345678901234567890
SEQRES   1 A   21  GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES   2 A   21  TYR GLN LEU GLU ASN TYR CYS ASN                    
SEQRES   1 B   30  PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU
SEQRES   2 B   30  ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR
SEQRES   3 B   30  THR PRO LYS ALA                                    
SEQRES   1 C   21  GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES   2 C   21  TYR GLN LEU GLU ASN TYR CYS ASN                    
SEQRES   1 D   30  PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU
SEQRES   2 D   30  ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR
SEQRES   3 D   30  THR PRO LYS ALA                                    

Known Problems 

Polysaccharides do not lend themselves to being represented in SEQRES. 

There is no mechanism provided to describe sequence runs when the exact
ordering of the sequence is not known. 

For cyclic peptides, PDB arbitrarily assigns a residue as the N-terminus. 

For microheterogeneity only one of the possible residues in a given position
is provided in SEQRES. 

No distinction is made between ribo- and deoxyribonucleotides in the SEQRES
records. These residues are identified with the same residue name (i.e., A,
C, G, T, U).

The structure of the line containing the SEQRES record type is fairly straightforward, with fields assigned to specific locations or columns in the line. You'll see later how to use these locations to parse the information. Note that the documentation includes many details that arise when handling such complex experimental data.

Apart from the fairly standard problem of accumulating the sequence, there is the added complication of multiple strands. By reading the documentation just shown, you'll see that the SEQRES identifier is followed by a number representing the line number for that chain, and the chain is given in the next field (although in older records it was optional and may be blank). Following those fields comes a number that gives the total number of residues in the chain. Finally, after that, come residues represented as three-letter codes. What is needed, and what can be ignored to meet our programming goals?

< BACK

CONTINUE >

Index terms contained in this section

backbone residues, primary sequence of
columns, PDB files
databases
     PDB files
            record types
            referenced in
DBREF record type, PDB files
      example
files
     PDB
            format of
macromolecules
      residues, amino acid or nucleic acid sequence of
MODRES record type, PDB files
names
      PDB file records
nucleotides
      primary structure of
parsing
     PDB files
            SEQRES record
peptides
      primary structure of
primary structure, proteins
Protein Data Bank (PDB)
     files
            format of
            SEQRES record, parsing
records, PDB files
      names
      types
references
      to databases in PDB files
      DBREF record type, PDB files
residues in macromolecule chains
      amino acid or nucleic acid sequence of
SEQADV record type, PDB files
      example
SEQRES record type, PDB files
      parsing
standard residues, modifications to
structure
      primary, of peptide or nucleotide sequence