Safari | Beginning Perl for Bioinformatics -> 6.1 Subroutines

Beginning Perl for Bioinformatics > 6. Subroutines and Bugs > 6.1 Subroutines

6.1 Subroutines

Subroutines are an important way to organize a program and are used in all major programming languages.

A subroutine wraps up a bit of code, gives the code a name, and provides a way to pass in some values for its calculations and then report back the results. The rest of the program can then use the subroutine's code just by calling its name, giving the needed values to pass in to the subroutine code and then collecting the results. This use or "invocation" of a subroutine is commonly referred to as calling the subroutine. You can think of a subroutine as a program within a program; just as you run programs to get results, so your programs call subroutines to get results. Once you have a subroutine, you can use it in a program simply by knowing which values to pass in and what kind of values to expect it to pass out.

6.1.1 Advantages of Subroutines

Subroutines provide several benefits. They endow programs with abstraction, modularization, and the ability to create large programs by organizing the code into manageable chunks with defined inputs and outputs.

Say you need to calculate something, for instance the mean of a distribution at several places in a program or in several different programs. By writing this calculation as a subroutine, you can write it once, and then call it whenever you need it, thus making your program:

Shorter, since you're reusing the code.
Easier to test, since you can test the subroutine separately.
Easier to understand, since it reduces clutter and better organizes programs.
More reliable, since you have less code when you reuse subroutines, so there are fewer opportunities for something to go wrong.
Faster to write, since you may, for example, have already written some subroutines that handle basic statistics and can just call the one that calculates the mean without having to write it again. Or better yet, you found a good statistics library someone else wrote, and you never had to write it at all.

There is another subtle, yet powerful idea at work here. Subroutines can themselves call other subroutines, that is, a subroutine can use another subroutine for help in its calculations.^[1] By writing a set of subroutines, each of which does one or a few things well, you can combine them in various ways to make new subroutines. You can then combine the new subroutines, and so on, and the end result can be large and flexible programming systems. Decomposing problems into sets of subroutines that can be conveniently combined allows you to create environments that can grow and adapt to changing conditions with a minimum of effort.

^[1] Subroutines can even call themselves, and this so-called recursion can be an elegant way to compute (see Chapter 11).

The trick of all this is in how you partition the code into subroutines. You want subroutines that encapsulate something that will be generally useful, and not just called once (although that sometimes can be useful too). There are various rules of thumb: a subroutine should do one thing well, and it should be no more than a page or two of code. These are not real rules, and exceptions are frequent, but they can help you divide your code into manageable chunks, suitable for subroutines.

6.1.2 Writing Subroutines

Let's look at how subroutines are used and then at how they're defined.

To use a subroutine, you pass data into the subroutine as arguments, and then you collect the return value(s) of the subroutine. For example, say you want a subroutine that, given some DNA, appends "ACGT" to the end of the DNA and returns the new, longer DNA. Let's call the subroutine addACGT. In Perl, you usually call a subroutine by typing its name, followed by a parenthesized list of arguments (if any). For example, here's a call to addACGT with the one argument $dna:

addACGT($dna);

When calling a subroutine, older versions of Perl required starting the name of a subroutine with the & (ampersand) character. It's still okay to do so (e.g., : &addACGT), but these days the ampersand is usually omitted.^[2]

^[2] There are times, even in the newer versions of Perl, when an ampersand is required; you'll see one such case in Chapter 11, in Section 11.2.3, which describes the File::Find module. (See also the defined and undef functions in the documentation or the perlref manpage).

Example 6-1 demonstrates a subroutine that shows in detail how this works.

Example 6-1. A subroutine to append ACGT to DNA

#!/usr/bin/perl -w
# A program with a subroutine to append ACGT to DNA

# The original DNA
$dna = 'CGACGTCTTCTCAGGCGA';

# The call to the subroutine "addACGT".
# The argument being passed in is $dna; the result is saved in $longer_dna
$longer_dna = addACGT($dna);

print "I added ACGT to $dna and got $longer_dna\n\n";

exit;

################################################################################
# Subroutines for Example 6-1
################################################################################

# Here is the definition for subroutine "addACGT"

sub addACGT {
    my($dna) = @_;

    $dna .= 'ACGT';
    return $dna;
}

Example 6-1 produces the following output:

I added ACGT to CGACGTCTTCTCAGGCGA and got CGACGTCTTCTCAGGCGAACGT

We'll now look at this code to see how subroutines are defined and used in a Perl program.

The first thing to notice, taking the large view, is that the program now has two sections. The first section starts from the beginning of the program and ends with the exit command. Following that (and announced by a blizzard of comments for easy reading) is a section for subroutine definitions, in this case, only the one definition for subroutine addACGT. It is common to place all subroutine definitions together at the end of a program, for ease in reading. Usually they're listed alphabetically or in some other convenient way.

Actually, it is legal to put the subroutine definitions almost anywhere in a program. This is because Perl first scans through the code and does things like check the syntax and learn subroutine definitions, before it starts to run the program. In particular, subroutine definitions can come after the point in the code where you use them (not necessarily before, which many people assume is the rule), and they don't have to be grouped together but can be scattered throughout the code. But our method of collecting them together at the end can make reading a program much easier. The possible exception is when a small subroutine is used in one section of code, as sometimes happens with the sort function, for instance. In this case having the definition right there can save the reader paging back and forth between the subroutine definition and its use. Usually, it's more convenient to read the program without the subroutine definitions, to get the overall flow of the program first, and then go back and look into the subroutines, if necessary.

As you see, Example 6-1 is very simple. It first stores some DNA into the variable $dna and then passes that variable as an argument to the subroutine call, which looks like this: addACGT($dna). The subroutine is called by its name, followed by parentheses containing the arguments to the subroutine. There may be no arguments, or if more than one, they are separated by commas. The value returned by the subroutine can be saved; in this program the value is saved in a variable called $longer_dna, which is then printed, and the program exits.

The part of the program from the beginning to the exit statement is called variously the main program or the main body of the program. By looking over this section of the code, you can see what happens from the beginning to the end of the program without looking into the details of the subroutines.

Now that you've looked over the main program of Example 6-1, it's time to look at the subroutine definition and how it uses the principal of scoping.

< BACK

CONTINUE >

Index terms contained in this section

& (ampersand)
      subroutine names, starting with
() (parentheses)
      enclosing subroutine arguments
, (comma)
      separating subroutine arguments
arguments
      separating with commas
calling
      subroutines
            from other subroutines
definitions of subroutines
main program or main body of a program
names
      subroutine
programs
      main section or main body of
return values, subroutine
subroutines
      advantages of
      appending ACGT to DNA (example)
      calling other subroutines
      definitions
            placement of
      return values
      writing
writing
      subroutines