< BACKCONTINUE >

8.4 Translating DNA into Proteins

Example 8-1 shows how the new codon2aa subroutine translates a whole DNA sequence into protein.

Example 8-1. Translate DNA into protein
#!/usr/bin/perl
# Translate DNA into protein

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Initialize variables
my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC';
my $protein = '';
my $codon;

# Translate each three-base codon into an amino acid, and append to a protein 
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
    $codon = substr($dna,$i,3);
    $protein .= codon2aa($codon);
}

print "I translated the DNA\n\n$dna\n\n  into the protein\n\n$protein\n\n";

exit;

To make this work, you'll need the BeginPerlBioinfo.pm module for your subroutines in a separate file the program can find, as discussed in Chapter 6. You also have to add the codon2aa subroutine to it. Alternatively, you can add the code for the subroutine condon2aa directly to the program in Example 8-1 and remove the reference to the BeginPerlBioinfo.pm module.

Here's the output from Example 8-1:

I translated the DNA

CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC

  into the protein

RRLRTGLARVGR

You've seen all the elements in Example 8-1 before, except for the way it loops through the DNA with this statement:

for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {

Recall that a for loop has three parts, delimited by the two semicolons. The first part initializes a counter: my $i=0 statically scopes the $i variable so it's visible only inside this block, and any other $i elsewhere in the code (well, in this case, there aren't any, but it can happen) is now invisible inside the block. The third part of the for loop increments the counter after all the statements in the block are executed and before returning to the beginning of the loop:

$i += 3

Since you're trying to march through the DNA three bases at a shot, you increment by three.

The second, middle part of the for loop tests whether the loop should continue:

$i < (length($dna) - 2)

The point is that if there are none, one, or two bases left, you should quit, because there's not enough to make a codon. Now, the positions in a string of DNA of a certain length are numbered from 0 to length-1. So if the position counter $i has reached length-2, there's only two more bases (at positions length-2 and length-1), and you should quit. Only if the position counter $i is less than length-2 will you still have at least three bases left, enough for a codon. So the test succeeds only if:

$i < (length($dna) -2)

(Notice also how the whole expression to the right of the less-than sign is enclosed in parentheses; we'll discuss this in Chapter 9 in Section 9.3.1.)

The line of code:

$codon = substr ($dna, $i 3);

actually extracts the 3-base codon from the DNA. The call to the substr function specifies a substring of $dna at position $i of length 3, and saves it in the variable $codon.

If you know you'll need to do this DNA-to-protein translation a lot, you can turn Example 8-1 into a subroutine. Whenever you write a subroutine, you have to think about which arguments you may want to give the subroutine. So you realize, there may come a time when you'll have some large DNA sequence but only want to translate a given part of it. Should you add two arguments to the subroutine as beginning and end points? You could, but decide not to. It's a judgment call—part of the art of decomposing a collection of code into useful fragments. But it might be better to have a subroutine that just translates; then you can make it part of a larger subroutine that picks endpoints in the sequence, if needed. The thinking is that you'll usually just translate the whole thing and always typing in 0 for the start and length($dna)-1 at the end, would be an annoyance. Of course, this depends on what you're doing, so this particular choice just illustrates your thinking when you write the code.

You should also remove the informative print statement at the end, because it's more suited to a main program than a subroutine.

Anyway, you've now thought through the design and just want a subroutine that takes one argument containing DNA and returns a peptide translation:

# dna2peptide 
#
# A subroutine to translate DNA sequence into a peptide

sub dna2peptide {

    my($dna) = @_;

    use strict;
    use warnings;
    use BeginPerlBioinfo;     # see Chapter 6 about this module

    # Initialize variables
    my $protein = '';

    # Translate each three-base codon to an amino acid, and append to a protein 
    for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
        $protein .= codon2aa( substr($dna,$i,3) );
    }

    return $protein;
}

Now add subroutine dna2peptide to the BeginPerlBioinfo.pm module.

Notice that you've eliminated one of the variables in making the subroutine out of Example 8-1: the variable $codon. Why?

Well, one reason is because you can. In Example 8-1, you were using substr to extract the codon from $dna, saving it in variable $codon and then passing it into the subroutine codon2aa. This new way eliminates the middleman. Put the call to substr that extracts the codon as the argument to the subroutine codon2aa so that the value is passed in just as before, but without having to copy it to the variable $codon first.

This has somewhat improved efficiency and speed. Since copying strings is one of the slower things computer programs do, eliminating a bunch of string copies is an easy and effective way to speed up a program.

But has it made the program less readable? You be the judge. I think it has, a little, but the comment right before the loop seems to make everything clear enough, for me, anyway. It's important to have readable code, so if you really need to boost the speed of a subroutine, but find it makes the code harder to read, be sure to include enough comments for the reader to be able to understand what's going on.

For the first time use function calls are being included in a subroutine instead of the main program:

use strict;
use warnings;
use BeginPerlBioinfo;

This may be redundant with the calls in the main program, but it doesn't do any harm (Perl checks and loads a module only once). If this subroutine should be called from a module that doesn't already load the modules, it's done some good after all.

Now let's improve how we deal with DNA in files.

< BACKCONTINUE >

Index terms contained in this section

copying
      strings, performance and
genetic code
      translating DNA into proteins
performance
      copying strings and
strings
      copying, performance and
subroutines
      translating DNA to peptide
translation
      DNA into proteins

© 2002, O'Reilly & Associates, Inc.