< BACKCONTINUE >

11.4 Parsing PDB Files

First, Example 11-5 shows the main program and three subroutines that will be discussed in this section.

Example 11-5. Extract sequence chains from PDB file
#!/usr/bin/perl
#  Extract sequence chains from PDB file

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Read in PDB file:  Warning - some files are very large!
my @file = get_file_data('pdb/c1/pdb1c1f.ent');

# Parse the record types of the PDB file
my %recordtypes = parsePDBrecordtypes(@file);

# Extract the amino acid sequences of all chains in the protein
my @chains = extractSEQRES( $recordtypes{'SEQRES'} );

# Translate the 3-character codes to 1-character codes, and print
foreach my $chain (@chains) {
    print "****chain $chain **** \n";
    print "$chain\n";
    print iub3to1($chain), "\n";
}

exit;

################################################################################
# Subroutines for Example 11-5
################################################################################

# parsePDBrecordtypes
#
#--given an array of a PDB file, return a hash with
#    keys   = record type names
#    values = scalar containing lines for that record type 

sub parsePDBrecordtypes {

    my @file = @_;

    use strict;
    use warnings;
    
    my %recordtypes = (  );
    
    foreach my $line (@file) {
    
        # Get the record type name which begins at the
        # start of the line and ends at the first space
        my($recordtype) = ($line =~ /^(\S+)/);
    
        # .= fails if a key is undefined, so we have to
        # test for definition and use either .= or = depending
        if(defined $recordtypes{$recordtype} ) {
            $recordtypes{$recordtype} .= $line;
        }else{
            $recordtypes{$recordtype} = $line;
        }
    }
    
    return %recordtypes;
}

# extractSEQRES
#
#--given an scalar containing SEQRES lines,
#    return an array containing the chains of the sequence

sub extractSEQRES {

    use strict;
    use warnings;

    my($seqres) = @_;

    my $lastchain = '';
    my $sequence = '';
    my @results = (  );

    # make array of lines
    my @record = split ( /\n/, $seqres);
    
    foreach my $line (@record) {
        # Chain is in column 12, residues start in column 20
        my ($thischain) = substr($line, 11, 1);
        my($residues)  = substr($line, 19, 52); # add space at end
    
        # Check if a new chain, or continuation of previous chain
        if("$lastchain" eq "") {
            $sequence = $residues;
        }elsif("$thischain" eq "$lastchain") {
            $sequence .= $residues;
    
        # Finish gathering previous chain (unless first record)
        }elsif ( $sequence ) {
            push(@results, $sequence);
            $sequence = $residues;
        }
        $lastchain = $thischain;
    }

    # save last chain
    push(@results, $sequence);
    
    return @results;
}

# iub3to1
#
#--change string of 3-character IUB amino acid codes (whitespace separated)
#    into a string of 1-character amino acid codes

sub iub3to1 {

    my($input) = @_;
    
    my %three2one = (
      'ALA' => 'A',
      'VAL' => 'V',
      'LEU' => 'L',
      'ILE' => 'I',
      'PRO' => 'P',
      'TRP' => 'W',
      'PHE' => 'F',
      'MET' => 'M',
      'GLY' => 'G',
      'SER' => 'S',
      'THR' => 'T',
      'TYR' => 'Y',
      'CYS' => 'C',
      'ASN' => 'N',
      'GLN' => 'Q',
      'LYS' => 'K',
      'ARG' => 'R',
      'HIS' => 'H',
      'ASP' => 'D',
      'GLU' => 'E',
    );

    # clean up the input
    $input =~ s/\n/ /g;

    my $seq = '';
    
    # This use of split separates on any contiguous whitespace
    my @code3 = split(' ', $input);

    foreach my $code (@code3) {
        # A little error checking
        if(not defined $three2one{$code}) {
            print "Code $code not defined\n";
            next;
        }
        $seq .= $three2one{$code};
    }
    return $seq;
}

It's important to note that the main program, which calls the subroutine get_file_data to read in the PDB file, has included a warning about the potentially large size of any given PDB file. (For instance, the PDB file 1gav weighs in at 3.45 MB.) Plus, the main program follows the reading in of the entire file, with the subroutine parsePDBrecordtypes that makes copies of all lines in the input file, separated by record type. At this point, the running program is using twice the amount of memory as the size of the file. This design has the advantage of clarity and modularity, but it can cause problems if main memory is in short supply. The use of memory can be lessened by not saving the results of reading in the file, but instead passing the file data directly to the parsePDBrecordtypes subroutine, like so:

# Get the file data and parse the record types of the PDB file
%recordtypes = parsePDBrecordtypes(get_file_data('pdb/c1/pdb1c1f.ent'));

Further savings of memory are possible. For instance, you can rewrite the program to just read the file one line at a time while parsing the data into the record types. I point out these considerations to give you an idea of the kinds of choices that are practically important in processing large files. However, let's stick with this design for now. It may be expensive in terms of memory, but it's very clear in terms of overall program structure.

In Chapter 10, I demonstrated two ways to parse GenBank files into sequence and annotation and then how to parse the annotation into finer and finer levels of detail.

The first method involved iterating through an array of the lines of the record. Recall that due to the structure of multiline fields, it was necessary to set flags while iterating to keep track of which field the input line was in.[3]

[3] In GenBank, the multiline information sets were called fields; in PDB, they're called record types. Just as in biology different researchers may use their own terminology for some structures or concepts, so too in computer science there can be a certain creativity in terminology. This is one of the interesting difficulties in integrating biological data sources.

The other method, which worked better for GenBank files, involved regular expressions. Which method will work best for PDB files? (Or should you settle on a third approach?)

There are several ways to extract this information. PDB makes it easy to collect record types, since they all start with the same keyword at the beginning of the line. The technique in the last chapter that used regular expressions parsed the top-level fields of the file; this would be somewhat unwieldy for PDB files. (See the exercises at the end of the chapter.) For instance, a regular expression such as the following matches all adjacent SEQRES lines into a scalar string:

$record =~ /SEQRES.*\n(SEQRES.*\n)*/;
$seqres = $&;

The regular expression matches a single SEQRES line with SEQRES.*\n and then matches zero or more additional lines with (SEQRES.*\n)*. Notice how the final * indicates zero or more of the preceding item, namely, the parenthesized expression (SEQRES.*\n). Also note that the .* matches zero or more nonnewline characters. Finally, the second line captures the pattern matched, denoted by $&, into the variable $seqres.

To extend this to capture all record types, see the exercises at the end of the chapter.

For PDB files, each line starts with a keyword that explicitly states to which record type that line belongs. You will find in the documentation that each record type has all its lines adjacent to each other in a group. In this case, it seems that simply iterating through the lines and collecting the record types would be the simplest programming approach.

Example 11-5 contains a subroutine parsePDBrecordtypes that parses the PDB record types from an array containing the lines of the PDB record. This is a short, clean subroutine that accomplishes what is needed. The comments describe what's happening pretty well, which, as you know, is a critical factor in writing good code. Basically, each line is examined for its record type and is then added to the value of a hash entry with the record type as the key. The hash is returned from the subroutine.

11.4.1 Extracting Primary Sequence

Let's examine the subroutine extractSEQRES , now that the record types have been parsed out, and extract the primary amino acid sequence.

You need to extract each chain separately and return an array of one or more strings of sequence corresponding to those chains, instead of just one sequence.

The previous parse, in Example 11-4, left the required SEQRES record type, which stretches over several lines, in a scalar string that is the value of the key 'SEQRES' in a hash. Our success with the previous parsePDBrecordtypes subroutine that used iteration over lines (as opposed to regular expressions over multiline strings) leads to the same approach here. The split Perl function enables you to turn a multiline string into an array.

As you iterate through the lines of the SEQRES record type, notice when a new chain is starting, save the previous chain in @results, reset the $sequence array, and reset the $lastchain flag to the new chain. Also, when done with all the lines, make sure to save the last sequence chain in the @results array.

Also notice (and verify by exploring the Perl documentation for the function) that split, with the arguments you gave it, does what you want.

The third and final subroutine of Example 11-5 is called iub3to1 . Since in PDB the sequence information is in three-character codes, you need this subroutine to change those sequences into one-character codes. It uses a straightforward hash lookup to perform the translation.

We've now decomposed the problem into a few complementary subroutines. It's always interesting as to how to best divide a problem into cooperating subroutines. You can put the call to iub3to1 inside the extractSEQRES subroutine; that might be a cleaner way to package these subroutines together, since, outside the PDB file format, you won't have use for the strings of amino acids in three-character codes.

The important observation at this juncture is to point out that a few short subroutines, tied together with a very short main program, were sufficient to do a great deal of parsing of PDB files.

11.4.2 Finding Atomic Coordinates

So far, I've tried not to give more than a very brief overview of protein structure. However, in parsing PDB files, you will be faced with a great deal of detailed information about the structures and the experimental conditions under which they were determined. I will now present a short program that extracts the coordinates of atoms in a PDB file. I don't cover the whole story: for that, you will want to read the PDB documentation in detail and consult texts on protein structure, X-ray crystallography, and NMR techniques.

That said, let's extract the coordinates from the ATOM record type. ATOM record types are the most numerous of the several record types that deal with atomic-coordinate data: MODEL, ATOM, SIGATM, ANISOU, SIGUIJ, TER, HETATM, and ENDMDL. There are also several record types that handle coordinate transformation: ORIGXn, SCALEn, MTRIXn, and TVECT.

Here is part of the PDB documentation that shows the field definitions of each ATOM record:

ATOM 

Overview 

The ATOM records present the atomic coordinates for standard residues.
They also present the occupancy and temperature factor for each atom.
Heterogen coordinates use the HETATM record type. The element symbol
is always present on each ATOM record; segment identifier and charge
are optional. 

Record Format 

COLUMNS        DATA TYPE       FIELD         DEFINITION                            
---------------------------------------------------------------------------------
 1 -  6        Record name     "ATOM  "                                            

 7 - 11        Integer         serial        Atom serial number.                   

13 - 16        Atom            name          Atom name.                            

17             Character       altLoc        Alternate location indicator.         

18 - 20        Residue name    resName       Residue name.                         

22             Character       chainID       Chain identifier.                     

23 - 26        Integer         resSeq        Residue sequence number.              

27             AChar           iCode         Code for insertion of residues.       

31 - 38        Real(8.3)       x             Orthogonal coordinates for X in       
                                             Angstroms.                       

39 - 46        Real(8.3)       y             Orthogonal coordinates for Y in       
                                             Angstroms.                            

47 - 54        Real(8.3)       z             Orthogonal coordinates for Z in       
                                             Angstroms.                            

55 - 60        Real(6.2)       occupancy     Occupancy.                            

61 - 66        Real(6.2)       tempFactor    Temperature factor.                   

73 - 76        LString(4)      segID         Segment identifier, left-justified.   

77 - 78        LString(2)      element       Element symbol, right-justified.      

79 - 80        LString(2)      charge        Charge on the atom.       

Here is a typical ATOM line:

ATOM      1  N   GLY A   2       1.888  -8.251  -2.511  1.00 36.63           N  

Let's do something fairly simple: let's extract all x, y, and z coordinates for each atom, plus the serial number (a unique integer for each atom in the molecule) and the element symbol. Example 11-6 is a subroutine that accomplishes that, with a main program to exercise the subroutine.

Example 11-6. Extract atomic coordinates from PDB file
#!/usr/bin/perl
#  Extract atomic coordinates from PDB file

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Read in PDB file
my @file = get_file_data('pdb/c1/pdb1c1f.ent');

# Parse the record types of the PDB file
my %recordtypes = parsePDBrecordtypes(@file);

# Extract the atoms of all chains in the protein
my %atoms = parseATOM ( $recordtypes{'ATOM'} );

# Print out a couple of the atoms
print $atoms{'1'}, "\n";
print $atoms{'1078'}, "\n";

exit;

################################################################################
# Subroutines of Example 11-6
################################################################################

# parseATOM
#
# --extract x, y, and z coordinates, serial number and element symbol
#     from PDB ATOM record type
#      Return a hash with key=serial number, value=coordinates in a string

sub parseATOM {

    my($atomrecord) = @_;

    use strict;
    use warnings;

    my %results = (  );

    # Turn the scalar into an array of ATOM lines
    my(@atomrecord) = split(/\n/, $atomrecord);

    foreach my $record (@atomrecord) {
       my $number  = substr($record,  6, 5);  # columns 7-11
       my $x       = substr($record, 30, 8);  # columns 31-38
       my $y       = substr($record, 38, 8);  # columns 39-46
       my $z       = substr($record, 46, 8);  # columns 47-54
       my $element = substr($record, 76, 2);  # columns 77-78

        # $number and $element may have leading spaces: strip them
        $number =~ s/^\s*//;
        $element =~ s/^\s*//;

        # Store information in hash
        $results{$number} = "$x $y $z $element";
    }

    # Return the hash
    return %results;
}

The parseATOM subroutine is quite short: the strict format of these ATOM records makes parsing the information quite straightforward. You first split the scalar argument, which contains the ATOM lines, into an array of lines.

Then, for each line, use the substr function to extract the specific columns of the line that contains the needed data: the serial number of the atom; the x, y, and z coordinates; and the element symbol.

Finally, save the results by making a hash with keys equal to the serial numbers and values set to strings containing the other four relevant fields. Now, this may not always be the most convenient way to return the data. For one thing, hashes are not sorted on the keys, so that would need to be an additional step if you had to sort the atoms by serial number. In particular, an array is a logical choice to store information sorted by serial number. Or, it could be that what you really want is to find all the metals, in which case, another data structure would be suggested. Nevertheless, this short subroutine shows one way to find and report information.

It often happens that what you really need is a reformatting of the data for use by another program. Using the technique of this subroutine, you can see how to extract the needed data and add a print statement that formats the data into the desired form. Take a look at the printf and sprintf functions to get more specific control over the format. For real heavy-duty formatting, there's the format function, which merits its own chapter in O'Reilly's comprehensive Programming Perl. (See also Chapter 12 and Appendix B of this book.)

Here's the output from Example 11-6:

 1.888   -8.251   -2.511  N
18.955  -10.180   10.777  C

You can now extract at least the major portion of the atomic coordinates from a PDB file. Again, notice the good news: it doesn't take a long or particularly complex program to do what is needed.

This program has been designed so that its parts can be used in the future to work well for other purposes. You parse all record types, for instance, not only the ATOM record types. Let's take a look at a very short program that just parses the ATOM record type lines from an input file; by targeting only this one problem, you can write a much shorter program. Here's the program:

while(<>) {
        /^ATOM/ or next;

        my($n, $x, $y, $z, $element)
           = ($_ =~ /^.{6}(.{5}).{19}(.{8})(.{8})(.{8}).{22}(..)/);

        # $n and $element may have leading spaces: strip them
        $n      =~ s/^\s*//;
        $element =~ s/^\s*//;

        if (($n == 1) or ($n == 1078)) {
                printf "%8.3f%8.3f%8.3f %2s\n", $x, $y, $z, $element;
        }
}

For each line, a regular-expression match extracts just the needed information. Recall that a regular expression that contains parentheses metacharacters returns an array whose elements are the parts of the string that matched within the parentheses. You assign the five variables $number, $x, $y, $z, and $element to the substrings from these matched parentheses.

The actual regular expression is simply using dots and the quantifier operator .{num} to stand for num characters. In this way, you can, starting from the beginning of the string as represented by the caret ^ metacharacter, specify the columns with the information you want returned by surrounding them with parentheses.

For instance, you don't want the first six characters, so you specify them as ^.{6}, but you do want the next five characters because they contain the serial number of the atom; so, specify that field as (.{5}).

Frankly, I think that the use of substr is clearer for this purpose, but I wanted to show you an alternative way using regular expressions as well.

We've already seen the use of the printf function to format output with more options then with the print function.

This program has another important shortcut. It doesn't specify the file to open and read from. In Perl, you can give the input filename on the command line (or drag and drop it onto a Mac droplet), and the program takes its input from that file. Just use the angle brackets as shown in the first line of the program to read from the file. For short, fast programs, such as the one demonstrated here, this is a great convenience. You can leave out all the calls and tests for success of the open function and just use the angle brackets. You would call it from the command line like so, assuming you saved the program in a file called get_two_atoms:

%perl get_two_atoms pdb1a4o.ent

Alternatively, you can pipe data to the program with the commands:

% cat pdb1a40.cat | perl get_two_atoms

or:

% perl get_two__atoms < pdb1a40.ent 

and use <STDIN> instead of <> in your program to read the data.

< BACKCONTINUE >

Index terms contained in this section

() (parentheses)
      metacharacter in regular expressions
(angle brackets)
      line input (angle) operator
^ (caret)
      metacharacter in regular expressions
angle operator
arrays
      multiline strings, converting to
atomic coordinates, finding in PDB files
      ATOM record type
            parsing from input file
extractSEQRES subroutine
fields, GenBank records
      PDB record types vs.
files
     PDB
            parsing
get_file_data subroutine
      reading in PDB file
hashes
      atomic coordinate data, PDB files
      PDB character codes, translating
input
      line input (angle) operator
iub3to1 subroutine
line input (angle) operator
multiline information sets, GenBank vs. PDB
multiline strings
      converting to arrays
operators
      input
parseATOM subroutine
parsePDBrecordtypes subroutine
parsing
      PDB files
            extracting primary sequence
            extracting sequence chains (main program)
            finding atomic coordinates
patterns (and regular expressions)
      ATOM record type, extracting from PDB file
      PDB file record types, matching
performance
      PDB files and
Protein Data Bank (PDB)
      parsing files
            extracting primary sequence
            extracting sequence chains (main program)
            finding atomic coordinates
reading
      PDB files
records, PDB files
     types
            for atomic coordinates
            GenBank fields vs.
            parsePDBrecordtypes subroutine
split function
      multiline strings, converting to arrays
strings
     multiline
            converting to arrays
subroutines
      extractSEQRES
      get_file_data
      iub3to1 (converting PDB character codes)
      parseATOM
      parsePDBrecordtypes

© 2002, O'Reilly & Associates, Inc.