11.5
Controlling Other Programs
Perl makes it easy to start other programs and collect their
output, all from within your Perl program. This is an extremely
useful capability; for most programs, Perl makes it fairly simple to
accomplish.
You may need to run some particular program many times, for instance
over every file in PDB to extract secondary structure information.
The program itself may not have a way to tell it "run yourself
over all these files." Also, the output of the program may have
all sorts of extraneous information. What you need is a much simpler
report that just presents the information that interests
you—perhaps in a format that could then be input to another
program! With Perl you can write a program to do exactly this.
An important kind of program to automate is a web site that provides
some useful program or data online. Using the appropriate Perl
modules, you can connect to the web site, send it your input, collect
the output, and then parse and reformat as you wish. It's
actually not hard to do! O'Reilly's Perl
Cookbook, a companion volume to Programming
Perl, is an excellent source of short programs and
helpful descriptions to get you started.
Perl is a great way to automate other programs. The next section
shows an example of a Perl program that starts another program and
collects, parses, reformats, and outputs the results. This program
will control another program on the same computer. The example will
be from a Unix or Linux environment; consult your Perl documentation
on how to get the same functionality from your Windows or Macintosh
platform.
11.5.1
The Stride Secondary Structure Predictor
We will use an external program to
calculate the secondary structure
from the 3D coordinates of a PDB file. As a secondary structure
assignment engine, I use a program that outputs a secondary structure
report, called stride.
stride is available from EMBL (http://www.embl-heidelberg.de/stride/stride_info.html)
and runs on Unix, Linux, Windows, Macintosh, and VMS systems. The
program works very simply; just give it a command-line argument of a
PDB filename and collect the output in the subroutine
call_stride
that follows.
Example 11-7 is the entire program: two subroutines
and a main program, followed by a discussion.
Example 11-7. Call another program for secondary structure prediction
#!/usr/bin/perl
# Call another program to perform secondary structure prediction
use strict;
use warnings;
# Call "stride" on a file, collect the report
my(@stride_output) = call_stride('pdb/c1/pdb1c1f.ent');
# Parse the stride report into primary sequence, and secondary
# structure prediction
my($sequence, $structure) = parse_stride(@stride_output);
# Print out the beginnings of the sequence and the secondary structure
print substr($sequence, 0, 80), "\n";
print substr($structure, 0, 80), "\n";
exit;
################################################################################
# Subroutine for Example 11-7
################################################################################
# call_stride
#
# --given a PDB filename, return the output from the "stride"
# secondary structure prediction program
sub call_stride {
use strict;
use warnings;
my($filename) = @_;
# The stride program options
my($stride) = '/usr/local/bin/stride';
my($options) = '';
my(@results) = ( );
# Check for presence of PDB file
unless ( -e $filename ) {
print "File \"$filename\" doesn\'t seem to exist!\n";
exit;
}
# Start up the program, capture and return the output
@results = `$stride $options $filename`;
return @results;
}
# parse_stride
#
#--given stride output, extract the primary sequence and the
# secondary structure prediction, returning them in a
# two-element array.
sub parse_stride {
use strict;
use warnings;
my(@stridereport) = @_;
my($seq) = '';
my($str) = '';
my $length;
# Extract the lines of interest
my(@seq) = grep(/^SEQ /, @stridereport);
my(@str) = grep(/^STR /, @stridereport);
# Process those lines to discard all but the sequence
# or structure information
for (@seq) { $_ = substr($_, 10, 50) }
for (@str) { $_ = substr($_, 10, 50) }
# Return the information as an array of two strings
$seq = join('', @seq);
$str = join('', @str);
# Delete unwanted spaces from the ends of the strings.
# ($seq has no spaces that are wanted, but $str may)
$seq =~ s/(\s+)$//;
$length = length($1);
$str =~ s/\s{$length}$//;
return( ($seq, $str) );
}
As you can see in the
subroutine
call_stride, variables have been made for the
program name ($stride) and for the options you may
want to pass ($options). Since these are parts of
the program you may want to change, put them as variables near the
top of the code, to make them easy to find and alter. The argument to
the subroutine is the PDB filename ($filename).
(Of course, if you expect the options to change frequently, you can
make them another argument to the subroutine.)
Since you're dealing with a program that takes a file, do a
little error checking to see if a file by that name actually exists.
Use the -e file test operator.
Or you can omit this and let the stride program
figure it out, and capture its error output. But that requires
parsing the stride output for its error output, which involves
figuring out how stride reports errors. This can
get complicated, so I'd stick with using the
-e file test operator.
The actual running of the program and collecting its output happens
in just one line. The program to be run is enclosed in
backticks, which run the program (first
expanding variables) and return the output as an array of lines.
There are other ways to run a program. One common way is the
system function call. It behaves differently
from the backticks: it doesn't return the output of the command
it calls (it just returns the exit status, an integer indicating
success or failure of the command). Other methods include
qx
, the open
system call, and the fork and
exec functions.
11.5.2
Parsing Stride Output
I don't go into
too much detail here about parsing the
output of stride. Let's just exhibit some
code that extracts the primary sequence and the secondary structure
prediction. See the exercises at the end of the chapter for a
challenge to extract the secondary structure information from a PDB
file's HELIX, SHEET, and TURN record types
and output the information in a similar format as
stride does here.
Here is a typical section of a stride output
(not the entire output):
SEQ 1 MDKNELVQKAKLAEQAERYDDMAACMKSVTEQGAELSNEERNLLSVAYKN 50 1A4O
STR HHHHHHHHHHHHHH HHHHHHHHHHHHHTTT HHHHHHHHHHHHH 1A4O
REM 1A4O
REM . . . . . 1A4O
SEQ 51 VVGARRSSWRVVSSIEQKEKKQQMAREYREKIETELRDICNDVLSLLEKF 100 1A4O
STR HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHT 1A4O
REM 1A4O
REM . . . . . 1A4O
SEQ 101 LIPNAAESKVFYLKMKGDYYRYLAEVAAGDDKKGIVDQSQQAYQEAFEIS 150 1A4O
STR TTTTT HHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHH 1A4O
REM 1A4O
REM . . . . 1A4O
SEQ 151 KKEMIRLGLALNFSVFYYACSLAKTAFDEAIAELLIMQLLRDNLTLW 197 1A4O
STR TTTTHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHHHH 1A4O
Notice that each line is prefaced by an identifier, which should make
collecting the different record types easy. Without even consulting
the documentation (a slightly dangerous but sometimes expedient
approach), you can see that the primary sequence has keyword
SEQ, the structure prediction has keyword
STR, and the data of interest lies from position
11 up to position 60 on each line. (We'll ignore everything
else for now.)
The following list shows the one-letter secondary structure codes
used by stride:
H
|
Alpha helix
|
G
|
3-10 helix
|
I
|
PI helix
|
E
|
Extended conformation
|
B or b
|
Isolated bridge
|
T
|
Turn
|
C
|
Coil (none of the above)
|
Using the substr function, the two
for loops alter each line of the two arrays by
saving the 11th to the 60th positions of those strings. This is where
the desired information lies.
Now let's examine the
subroutine
parse_stride in Example 11-7
that takes stride output and returns an array of
two strings, the primary sequence and the structure prediction.
This is a very "Perlish" subroutine that uses some
features that manipulate text. What's interesting is the
brevity of the program, which some of Perl's built-in functions
make possible.
First, you receive the output of the stride
program in the subroutine argument @_. Next, use
the grep function to extract those lines of
interest, which are easy to identify in this output, as they begin
with clear identifiers SEQ and
STR.
Next, you want to save just those positions (or columns) of these
lines that have the sequence or structure information; you
don't need the keywords, position numbers, or the PDB entry
name at the end of the lines.
Finally,
join
the arrays into single strings. Here, there's one detail to
handle; you need to remove any unneeded spaces from the ends of the
strings. Notice that stride sometimes leaves
spaces in the structure prediction, and in this example, has left
some at the end of the structure prediction. So you shouldn't
throw away all the spaces at the ends of the strings. Instead, throw
away all the spaces at the end of the sequence string, because they
are just superfluous spaces on the line. Now, see how many spaces
that was, and throw the equal amount away at the end of the structure
prediction string, thus preserving spaces that correspond to
undetermined secondary structure.
Example 11-7 contains a main program that calls two
subroutines, which, since they are short, are all included (so
there's no need here for the
BeginPerlBioinfo module). Here's the
output of Example 11-7:
GGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVMNSTLKGDNGWETEQRSTNFTL
TTTTTTBTTT EEEEEEETTTT EEEEEEEEETTEEEEEEEEEEEETTEEEEEEEEEETTGGG B EEE
The first line shows the amino acids, and the second line shows the
prediction of the secondary structure. Check the next section for a
subroutine that will improve that output.