11.4
Parsing PDB Files
First, Example 11-5 shows the main
program and
three subroutines that will be discussed in
this section.
Example 11-5. Extract sequence chains from PDB file
#!/usr/bin/perl
# Extract sequence chains from PDB file
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Read in PDB file: Warning - some files are very large!
my @file = get_file_data('pdb/c1/pdb1c1f.ent');
# Parse the record types of the PDB file
my %recordtypes = parsePDBrecordtypes(@file);
# Extract the amino acid sequences of all chains in the protein
my @chains = extractSEQRES( $recordtypes{'SEQRES'} );
# Translate the 3-character codes to 1-character codes, and print
foreach my $chain (@chains) {
print "****chain $chain **** \n";
print "$chain\n";
print iub3to1($chain), "\n";
}
exit;
################################################################################
# Subroutines for Example 11-5
################################################################################
# parsePDBrecordtypes
#
#--given an array of a PDB file, return a hash with
# keys = record type names
# values = scalar containing lines for that record type
sub parsePDBrecordtypes {
my @file = @_;
use strict;
use warnings;
my %recordtypes = ( );
foreach my $line (@file) {
# Get the record type name which begins at the
# start of the line and ends at the first space
my($recordtype) = ($line =~ /^(\S+)/);
# .= fails if a key is undefined, so we have to
# test for definition and use either .= or = depending
if(defined $recordtypes{$recordtype} ) {
$recordtypes{$recordtype} .= $line;
}else{
$recordtypes{$recordtype} = $line;
}
}
return %recordtypes;
}
# extractSEQRES
#
#--given an scalar containing SEQRES lines,
# return an array containing the chains of the sequence
sub extractSEQRES {
use strict;
use warnings;
my($seqres) = @_;
my $lastchain = '';
my $sequence = '';
my @results = ( );
# make array of lines
my @record = split ( /\n/, $seqres);
foreach my $line (@record) {
# Chain is in column 12, residues start in column 20
my ($thischain) = substr($line, 11, 1);
my($residues) = substr($line, 19, 52); # add space at end
# Check if a new chain, or continuation of previous chain
if("$lastchain" eq "") {
$sequence = $residues;
}elsif("$thischain" eq "$lastchain") {
$sequence .= $residues;
# Finish gathering previous chain (unless first record)
}elsif ( $sequence ) {
push(@results, $sequence);
$sequence = $residues;
}
$lastchain = $thischain;
}
# save last chain
push(@results, $sequence);
return @results;
}
# iub3to1
#
#--change string of 3-character IUB amino acid codes (whitespace separated)
# into a string of 1-character amino acid codes
sub iub3to1 {
my($input) = @_;
my %three2one = (
'ALA' => 'A',
'VAL' => 'V',
'LEU' => 'L',
'ILE' => 'I',
'PRO' => 'P',
'TRP' => 'W',
'PHE' => 'F',
'MET' => 'M',
'GLY' => 'G',
'SER' => 'S',
'THR' => 'T',
'TYR' => 'Y',
'CYS' => 'C',
'ASN' => 'N',
'GLN' => 'Q',
'LYS' => 'K',
'ARG' => 'R',
'HIS' => 'H',
'ASP' => 'D',
'GLU' => 'E',
);
# clean up the input
$input =~ s/\n/ /g;
my $seq = '';
# This use of split separates on any contiguous whitespace
my @code3 = split(' ', $input);
foreach my $code (@code3) {
# A little error checking
if(not defined $three2one{$code}) {
print "Code $code not defined\n";
next;
}
$seq .= $three2one{$code};
}
return $seq;
}
It's important to note
that the main program, which calls the
subroutine
get_file_data to read in the PDB file, has
included a warning about the potentially large size of any given PDB
file. (For instance, the PDB file 1gav weighs in
at 3.45 MB.) Plus, the main program follows the reading in of the
entire file, with the
subroutine
parsePDBrecordtypes that makes copies of all lines
in the input file, separated by record type. At this point, the
running program is using twice the amount of memory as the size of
the file. This design has the advantage of clarity and modularity,
but it can cause problems if main memory is in short supply. The use
of memory can be lessened by not saving the results of reading in the
file, but instead passing the file data directly to the
parsePDBrecordtypes subroutine, like so:
# Get the file data and parse the record types of the PDB file
%recordtypes = parsePDBrecordtypes(get_file_data('pdb/c1/pdb1c1f.ent'));
Further savings of memory are possible. For instance, you can rewrite
the program to just read the file one line at a time while parsing
the data into the record types. I point out these considerations to
give you an idea of the kinds of choices that are practically
important in processing large files. However, let's stick with
this design for now. It may be expensive in terms of memory, but
it's very clear in terms of overall program structure.
In Chapter 10, I demonstrated two ways to parse
GenBank files into sequence and annotation and then how to parse the
annotation into finer and finer levels of detail.
The first method involved iterating through an array of the lines of
the record. Recall that due to the structure of multiline fields, it
was necessary to set flags while iterating to keep track of which
field the input line was in.[3]
The other method, which worked better for GenBank files, involved
regular expressions. Which method
will work best for PDB files? (Or should you settle on a third
approach?)
There are several ways to extract this information. PDB makes it easy
to collect record types, since they all start with the same keyword
at the beginning of the line. The technique in the last chapter that
used regular expressions parsed the top-level fields of the file;
this would be somewhat unwieldy for PDB files. (See the exercises at
the end of the chapter.) For instance, a regular expression such as
the following matches all adjacent SEQRES lines into a scalar string:
$record =~ /SEQRES.*\n(SEQRES.*\n)*/;
$seqres = $&;
The regular expression matches a single SEQRES line with
SEQRES.*\n and then matches zero or more
additional lines with (SEQRES.*\n)*. Notice how
the final * indicates zero or more of the
preceding item, namely, the parenthesized expression
(SEQRES.*\n). Also note that the
.* matches zero or more nonnewline characters.
Finally, the second line captures the pattern matched, denoted by
$&, into the variable
$seqres.
To extend this to capture all record types, see the exercises at the
end of the chapter.
For PDB files, each line starts with a keyword that explicitly states
to which record type that line belongs. You will find in the
documentation that each record type has all its lines adjacent to
each other in a group. In this case, it seems that simply iterating
through the lines and collecting the record types would be the
simplest programming approach.
Example 11-5 contains a subroutine
parsePDBrecordtypes that parses the PDB record
types from an array containing the lines of the PDB record. This is a
short, clean subroutine that accomplishes what is needed. The
comments describe what's happening pretty well, which, as you
know, is a critical factor in writing good code. Basically, each line
is examined for its record type and is then added to the value of a
hash entry with the record type as the key. The hash is returned
from
the subroutine.
11.4.1
Extracting Primary Sequence
Let's examine the subroutine
extractSEQRES
,
now that the record types have been parsed out, and extract the
primary amino acid sequence.
You need to extract each chain separately and return an array of one
or more strings of sequence corresponding to those chains, instead of
just one sequence.
The previous parse, in Example 11-4, left the
required SEQRES record type, which stretches over several lines, in a
scalar string that is the value of the key
'SEQRES' in a hash. Our success with the previous
parsePDBrecordtypes subroutine that used
iteration over lines (as opposed to regular expressions over
multiline strings) leads to the same approach here. The
split
Perl function enables you to turn a
multiline string into an array.
As you iterate through the lines of the SEQRES record type, notice
when a new chain is starting, save the previous chain in
@results, reset the $sequence
array, and reset the $lastchain flag to the new
chain. Also, when done with all the lines, make sure to save the last
sequence chain in the @results array.
Also notice (and verify by exploring the Perl documentation for the
function) that split, with the arguments you
gave it, does what you want.
The third and final subroutine of Example 11-5 is
called iub3to1
. Since in
PDB the sequence information is in three-character codes, you need
this subroutine to change those sequences into one-character codes.
It uses a straightforward
hash lookup to perform the
translation.
We've now decomposed the problem into a few complementary
subroutines. It's always interesting as to how to best divide a
problem into cooperating subroutines. You can put the call to
iub3to1 inside the
extractSEQRES subroutine; that might be a
cleaner way to package these subroutines together, since, outside the
PDB file format, you won't have use for the strings of amino
acids in three-character codes.
The important observation at this juncture is to point out that a few
short subroutines, tied together with a very short main program, were
sufficient to do a great deal of parsing of PDB files.
11.4.2
Finding Atomic Coordinates
So far, I've tried not to give more than a very brief overview
of protein structure. However, in parsing PDB files, you will be
faced with a great deal of detailed information about the structures
and the experimental conditions under which they were determined. I
will now present a short program that extracts the coordinates of
atoms in a PDB file. I don't cover the whole story: for that,
you will want to read the PDB documentation in detail and consult
texts on protein structure, X-ray crystallography, and NMR
techniques.
That said, let's extract the coordinates
from the
ATOM record type. ATOM record types are
the most numerous of the several record types that deal with
atomic-coordinate data: MODEL, ATOM, SIGATM, ANISOU, SIGUIJ, TER,
HETATM, and ENDMDL. There are also several record types that handle
coordinate transformation: ORIGXn, SCALEn, MTRIXn, and TVECT.
Here is part of the PDB documentation that shows the field
definitions of each ATOM record:
ATOM
Overview
The ATOM records present the atomic coordinates for standard residues.
They also present the occupancy and temperature factor for each atom.
Heterogen coordinates use the HETATM record type. The element symbol
is always present on each ATOM record; segment identifier and charge
are optional.
Record Format
COLUMNS DATA TYPE FIELD DEFINITION
---------------------------------------------------------------------------------
1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in
Angstroms.
39 - 46 Real(8.3) y Orthogonal coordinates for Y in
Angstroms.
47 - 54 Real(8.3) z Orthogonal coordinates for Z in
Angstroms.
55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
73 - 76 LString(4) segID Segment identifier, left-justified.
77 - 78 LString(2) element Element symbol, right-justified.
79 - 80 LString(2) charge Charge on the atom.
Here is a typical ATOM line:
ATOM 1 N GLY A 2 1.888 -8.251 -2.511 1.00 36.63 N
Let's do something fairly simple: let's extract all x, y,
and z coordinates for each atom, plus the serial number (a unique
integer for each atom in the molecule) and the element symbol. Example 11-6 is a subroutine that accomplishes that, with a
main program to exercise the subroutine.
Example 11-6. Extract atomic coordinates from PDB file
#!/usr/bin/perl
# Extract atomic coordinates from PDB file
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Read in PDB file
my @file = get_file_data('pdb/c1/pdb1c1f.ent');
# Parse the record types of the PDB file
my %recordtypes = parsePDBrecordtypes(@file);
# Extract the atoms of all chains in the protein
my %atoms = parseATOM ( $recordtypes{'ATOM'} );
# Print out a couple of the atoms
print $atoms{'1'}, "\n";
print $atoms{'1078'}, "\n";
exit;
################################################################################
# Subroutines of Example 11-6
################################################################################
# parseATOM
#
# --extract x, y, and z coordinates, serial number and element symbol
# from PDB ATOM record type
# Return a hash with key=serial number, value=coordinates in a string
sub parseATOM {
my($atomrecord) = @_;
use strict;
use warnings;
my %results = ( );
# Turn the scalar into an array of ATOM lines
my(@atomrecord) = split(/\n/, $atomrecord);
foreach my $record (@atomrecord) {
my $number = substr($record, 6, 5); # columns 7-11
my $x = substr($record, 30, 8); # columns 31-38
my $y = substr($record, 38, 8); # columns 39-46
my $z = substr($record, 46, 8); # columns 47-54
my $element = substr($record, 76, 2); # columns 77-78
# $number and $element may have leading spaces: strip them
$number =~ s/^\s*//;
$element =~ s/^\s*//;
# Store information in hash
$results{$number} = "$x $y $z $element";
}
# Return the hash
return %results;
}
The parseATOM
subroutine is quite short: the strict format of these ATOM records
makes parsing the information quite straightforward. You first split
the scalar argument, which contains the ATOM lines, into an array of
lines.
Then, for each line, use the substr function to
extract the specific columns of the line that contains the needed
data: the serial number of the atom; the x, y, and z coordinates; and
the element symbol.
Finally, save the results by making a
hash with keys equal to the
serial numbers and values set to strings containing the other four
relevant fields. Now, this may not always be the most convenient way
to return the data. For one thing, hashes are not sorted on the keys,
so that would need to be an additional step if you had to sort the
atoms by serial number. In particular, an array is a logical choice
to store information sorted by serial number. Or, it could be that
what you really want is to find all the metals, in which case,
another data structure would be suggested. Nevertheless, this short
subroutine shows one way to find and report information.
It often happens that what you really need is a reformatting of the
data for use by another program. Using the technique of this
subroutine, you can see how to extract the needed data and add a
print statement that formats the data into the
desired form. Take a look at the printf and
sprintf functions to get more specific control
over the format. For real heavy-duty formatting, there's the
format function, which merits its own chapter in
O'Reilly's comprehensive Programming
Perl. (See also Chapter 12 and Appendix B of this book.)
Here's the output from Example 11-6:
1.888 -8.251 -2.511 N
18.955 -10.180 10.777 C
You can now extract at least the major portion of the atomic
coordinates from a PDB file. Again, notice the good news: it
doesn't take a long or particularly complex program to do what
is needed.
This program has been designed so that its parts can be used in the
future to work well for other purposes. You parse all record types,
for instance, not only the ATOM record types. Let's take a look
at a very short program that just parses the
ATOM record type lines from an input file;
by targeting only this one problem, you can write a much shorter
program. Here's the program:
while(<>) {
/^ATOM/ or next;
my($n, $x, $y, $z, $element)
= ($_ =~ /^.{6}(.{5}).{19}(.{8})(.{8})(.{8}).{22}(..)/);
# $n and $element may have leading spaces: strip them
$n =~ s/^\s*//;
$element =~ s/^\s*//;
if (($n == 1) or ($n == 1078)) {
printf "%8.3f%8.3f%8.3f %2s\n", $x, $y, $z, $element;
}
}
For each line, a regular-expression match extracts
just the needed information. Recall that a regular expression that
contains parentheses metacharacters returns
an array whose elements are the parts of the string that matched
within the parentheses. You assign the five variables
$number, $x,
$y, $z, and
$element to the substrings from these matched
parentheses.
The actual regular expression is simply using dots and the quantifier
operator .{num} to stand for
num characters. In this way, you can, starting
from the beginning of the string as represented by the
caret ^
metacharacter, specify the columns with the information you want
returned by surrounding them with parentheses.
For instance, you don't want the first six characters, so you
specify them as ^.{6}, but you do want the next
five characters because they contain the serial number of the atom;
so, specify that field as (.{5}).
Frankly, I think that the use of substr is clearer
for this purpose, but I wanted to show you an alternative way using
regular expressions as well.
We've already seen the use of the printf
function to format output with more options then with the
print function.
This program has another important shortcut. It doesn't specify
the file to open and read from. In Perl, you can give the input
filename on the command line (or drag and drop it onto a Mac
droplet), and the program takes its input from that file. Just
use the
angle
brackets as shown in the first line of the program to read from the
file. For short, fast programs, such as the one demonstrated here,
this is a great convenience. You can leave out all the calls and
tests for success of the open function and just
use the angle brackets. You would call it from the command line like
so,
assuming you saved the
program in a file called
get_two_atoms:
%perl get_two_atoms pdb1a4o.ent
Alternatively, you can pipe data to the program with the commands:
% cat pdb1a40.cat | perl get_two_atoms
or:
% perl get_two__atoms < pdb1a40.ent
and use <STDIN> instead of
<> in your program to read the data.