Safari | Beginning Perl for Bioinformatics -> 11.5 Controlling Other Programs

Beginning Perl for Bioinformatics > 11. Protein Data Bank > 11.5 Controlling Other Programs

11.5 Controlling Other Programs

Perl makes it easy to start other programs and collect their output, all from within your Perl program. This is an extremely useful capability; for most programs, Perl makes it fairly simple to accomplish.

You may need to run some particular program many times, for instance over every file in PDB to extract secondary structure information. The program itself may not have a way to tell it "run yourself over all these files." Also, the output of the program may have all sorts of extraneous information. What you need is a much simpler report that just presents the information that interests you—perhaps in a format that could then be input to another program! With Perl you can write a program to do exactly this.

An important kind of program to automate is a web site that provides some useful program or data online. Using the appropriate Perl modules, you can connect to the web site, send it your input, collect the output, and then parse and reformat as you wish. It's actually not hard to do! O'Reilly's Perl Cookbook, a companion volume to Programming Perl, is an excellent source of short programs and helpful descriptions to get you started.

Perl is a great way to automate other programs. The next section shows an example of a Perl program that starts another program and collects, parses, reformats, and outputs the results. This program will control another program on the same computer. The example will be from a Unix or Linux environment; consult your Perl documentation on how to get the same functionality from your Windows or Macintosh platform.

11.5.1 The Stride Secondary Structure Predictor

We will use an external program to calculate the secondary structure from the 3D coordinates of a PDB file. As a secondary structure assignment engine, I use a program that outputs a secondary structure report, called stride. stride is available from EMBL (http://www.embl-heidelberg.de/stride/stride_info.html) and runs on Unix, Linux, Windows, Macintosh, and VMS systems. The program works very simply; just give it a command-line argument of a PDB filename and collect the output in the subroutine call_stride that follows.

Example 11-7 is the entire program: two subroutines and a main program, followed by a discussion.

Example 11-7. Call another program for secondary structure prediction

#!/usr/bin/perl
#  Call another program to perform secondary structure prediction

use strict;
use warnings;

# Call "stride" on a file, collect the report
my(@stride_output)  = call_stride('pdb/c1/pdb1c1f.ent');

# Parse the stride report into primary sequence, and secondary
#  structure prediction
my($sequence, $structure)  = parse_stride(@stride_output);

# Print out the beginnings of the sequence and the secondary structure
print substr($sequence, 0, 80), "\n";
print substr($structure, 0, 80), "\n";

exit;

################################################################################
# Subroutine for Example 11-7
################################################################################

# call_stride
#
# --given a PDB filename, return the output from the "stride"
#     secondary structure prediction program

sub call_stride {

    use strict;
    use warnings;

    my($filename) = @_;

    # The stride program options
    my($stride) = '/usr/local/bin/stride';
    my($options) = '';
    my(@results) = (  );

    # Check for presence of PDB file
    unless ( -e $filename ) {
        print "File \"$filename\" doesn\'t seem to exist!\n";
        exit;
    }

    # Start up the program, capture and return the output
    @results = `$stride $options $filename`;

    return @results;
}

# parse_stride
#
#--given stride output, extract the primary sequence and the
#    secondary structure prediction, returning them in a
#    two-element array.

sub parse_stride {

    use strict;
    use warnings;

    my(@stridereport) = @_;
    my($seq) = '';
    my($str) = '';
    my $length;

    # Extract the lines of interest
    my(@seq) = grep(/^SEQ /, @stridereport);

    my(@str) = grep(/^STR /, @stridereport);

    # Process those lines to discard all but the sequence
    #  or structure information
    for (@seq) { $_ = substr($_, 10, 50) }
    for (@str) { $_ = substr($_, 10, 50) }

    # Return the information as an array of two strings
    $seq = join('', @seq);
    $str = join('', @str);

    # Delete unwanted spaces from the ends of the strings.
    # ($seq has no spaces that are wanted, but $str may)
    $seq =~ s/(\s+)$//;

    $length = length($1);

    $str =~ s/\s{$length}$//;

    return( ($seq, $str) );
}

As you can see in the subroutine call_stride, variables have been made for the program name ($stride) and for the options you may want to pass ($options). Since these are parts of the program you may want to change, put them as variables near the top of the code, to make them easy to find and alter. The argument to the subroutine is the PDB filename ($filename). (Of course, if you expect the options to change frequently, you can make them another argument to the subroutine.)

Since you're dealing with a program that takes a file, do a little error checking to see if a file by that name actually exists. Use the -e file test operator. Or you can omit this and let the stride program figure it out, and capture its error output. But that requires parsing the stride output for its error output, which involves figuring out how stride reports errors. This can get complicated, so I'd stick with using the -e file test operator.

The actual running of the program and collecting its output happens in just one line. The program to be run is enclosed in backticks, which run the program (first expanding variables) and return the output as an array of lines.

There are other ways to run a program. One common way is the system function call. It behaves differently from the backticks: it doesn't return the output of the command it calls (it just returns the exit status, an integer indicating success or failure of the command). Other methods include qx , the open system call, and the fork and exec functions.

11.5.2 Parsing Stride Output

I don't go into too much detail here about parsing the output of stride. Let's just exhibit some code that extracts the primary sequence and the secondary structure prediction. See the exercises at the end of the chapter for a challenge to extract the secondary structure information from a PDB file's HELIX, SHEET, and TURN record types and output the information in a similar format as stride does here.

Here is a typical section of a stride output (not the entire output):

SEQ  1    MDKNELVQKAKLAEQAERYDDMAACMKSVTEQGAELSNEERNLLSVAYKN   50          1A4O
STR         HHHHHHHHHHHHHH  HHHHHHHHHHHHHTTT   HHHHHHHHHHHHH               1A4O
REM                                                                        1A4O
REM                .         .         .         .         .               1A4O
SEQ  51   VVGARRSSWRVVSSIEQKEKKQQMAREYREKIETELRDICNDVLSLLEKF  100          1A4O
STR       HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHT               1A4O
REM                                                                        1A4O
REM                .         .         .         .         .               1A4O
SEQ  101  LIPNAAESKVFYLKMKGDYYRYLAEVAAGDDKKGIVDQSQQAYQEAFEIS  150          1A4O
STR       TTTTT HHHHHHHHHHHHHHHHHHHH   HHHHHHHHHHHHHHHHHHHHH               1A4O
REM                                                                        1A4O
REM                .         .         .         .                         1A4O
SEQ  151  KKEMIRLGLALNFSVFYYACSLAKTAFDEAIAELLIMQLLRDNLTLW     197          1A4O
STR       TTTTHHHHHHHHHHHH   HHHHHHHHHHHHH  HHHHHHHHHH                     1A4O

Notice that each line is prefaced by an identifier, which should make collecting the different record types easy. Without even consulting the documentation (a slightly dangerous but sometimes expedient approach), you can see that the primary sequence has keyword SEQ, the structure prediction has keyword STR, and the data of interest lies from position 11 up to position 60 on each line. (We'll ignore everything else for now.)

The following list shows the one-letter secondary structure codes used by stride:

H

Alpha helix

G

3-10 helix

I

PI helix

E

Extended conformation

B or b

Isolated bridge

T

Turn

C

Coil (none of the above)

Using the substr function, the two for loops alter each line of the two arrays by saving the 11th to the 60th positions of those strings. This is where the desired information lies.

Now let's examine the subroutine parse_stride in Example 11-7 that takes stride output and returns an array of two strings, the primary sequence and the structure prediction.

This is a very "Perlish" subroutine that uses some features that manipulate text. What's interesting is the brevity of the program, which some of Perl's built-in functions make possible.

First, you receive the output of the stride program in the subroutine argument @_. Next, use the grep function to extract those lines of interest, which are easy to identify in this output, as they begin with clear identifiers SEQ and STR.

Next, you want to save just those positions (or columns) of these lines that have the sequence or structure information; you don't need the keywords, position numbers, or the PDB entry name at the end of the lines.

Finally, join the arrays into single strings. Here, there's one detail to handle; you need to remove any unneeded spaces from the ends of the strings. Notice that stride sometimes leaves spaces in the structure prediction, and in this example, has left some at the end of the structure prediction. So you shouldn't throw away all the spaces at the ends of the strings. Instead, throw away all the spaces at the end of the sequence string, because they are just superfluous spaces on the line. Now, see how many spaces that was, and throw the equal amount away at the end of the structure prediction string, thus preserving spaces that correspond to undetermined secondary structure.

Example 11-7 contains a main program that calls two subroutines, which, since they are short, are all included (so there's no need here for the BeginPerlBioinfo module). Here's the output of Example 11-7:

GGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVMNSTLKGDNGWETEQRSTNFTL
    TTTTTTBTTT EEEEEEETTTT EEEEEEEEETTEEEEEEEEEEEETTEEEEEEEEEETTGGG B   EEE

The first line shows the amino acids, and the second line shows the prediction of the secondary structure. Check the next section for a subroutine that will improve that output.

< BACK

CONTINUE >

Index terms contained in this section

-e (exists) file test
` (backticks), command input operator
automating programs
      secondary structure predictor
call_stride subroutine 2nd
exec function
exists (-e) file test
file test operators
      -e (exists)
fork function
grep function
HELIX, SHEET, and TURN record types (PDB files)
open system call
operators
      command input
      qx// (quoted execution)
output
      stride program, parsing
parsing
      stride program output
Perl
      controlling other programs from
programs
      external, automating with Perl
Protein Data Bank (PDB)
      controlling external programs from Perl
qx// (quoted execution) operator
records, PDB files
     types
            HELIX, SHEET, and TURN
removing
      whitespace
secondary structures, proteins
      predictor program (stride)
stride program, secondary structure predictor
      parsing output
strings
      whitespace, removing from end
subroutines
      call_stride 2nd
      parse_stride
system function
whitespace
      removing