8.4
Translating DNA into Proteins
Example 8-1 shows how the new
codon2aa subroutine translates a whole DNA
sequence into protein.
Example 8-1. Translate DNA into protein
#!/usr/bin/perl
# Translate DNA into protein
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables
my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC';
my $protein = '';
my $codon;
# Translate each three-base codon into an amino acid, and append to a protein
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$codon = substr($dna,$i,3);
$protein .= codon2aa($codon);
}
print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n";
exit;
To make this work, you'll need the
BeginPerlBioinfo.pm module for your subroutines
in a separate file the program can find, as discussed in Chapter 6. You also have to add the
codon2aa subroutine to it. Alternatively, you
can add the code for the subroutine condon2aa
directly to the program in Example 8-1 and remove
the reference to the BeginPerlBioinfo.pm module.
Here's the output from Example 8-1:
I translated the DNA
CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC
into the protein
RRLRTGLARVGR
You've seen all the elements in Example 8-1
before, except for the way it loops through the DNA with this
statement:
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
Recall that a for loop has three parts, delimited
by the two semicolons. The first part initializes a counter:
my $i=0 statically scopes the
$i variable so it's visible only inside this
block, and any other $i elsewhere in the code
(well, in this case, there aren't any, but it can happen) is
now invisible inside the block. The third part of the
for loop increments the counter after all the
statements in the block are executed and before returning to the
beginning of the loop:
$i += 3
Since you're trying to march through the DNA three bases at a
shot, you increment by three.
The second, middle part of the for loop tests
whether the loop should continue:
$i < (length($dna) - 2)
The point is that if there are none, one, or two bases left, you
should quit, because there's not enough to make a codon. Now,
the positions in a string of DNA of a certain length are numbered
from 0 to length-1. So if the
position counter $i has reached
length-2, there's only two more bases (at
positions length-2 and
length-1), and you should quit. Only if the
position counter $i is less than
length-2 will you still have at least three bases
left, enough for a codon. So the test succeeds only if:
$i < (length($dna) -2)
(Notice also how the whole expression to the right of the less-than
sign is enclosed in parentheses; we'll discuss this in Chapter 9 in Section 9.3.1.)
The line of code:
$codon = substr ($dna, $i 3);
actually extracts the 3-base codon from the DNA. The call to the
substr function specifies a substring of
$dna at position $i of length
3, and saves it in the variable
$codon.
If you know you'll need to do this DNA-to-protein translation a
lot, you can turn Example 8-1 into a subroutine.
Whenever you write a subroutine, you have to think about which
arguments you may want to give the subroutine. So you realize, there
may come a time when you'll have some large DNA sequence but
only want to translate a given part of it. Should you add two
arguments to the subroutine as beginning and end points? You could,
but decide not to. It's a judgment call—part of the art
of decomposing a collection of code into useful fragments. But it
might be better to have a subroutine that just translates; then you
can make it part of a larger subroutine that picks endpoints in the
sequence, if needed. The thinking is that you'll usually just
translate the whole thing and always typing in 0
for the start and length($dna)-1 at the end, would
be an annoyance. Of course, this depends on what you're doing,
so this particular choice just illustrates your thinking when you
write the code.
You should also remove the informative print
statement at the end, because it's more suited to a main
program than a subroutine.
Anyway, you've now thought through the design and just want a
subroutine that takes one argument containing
DNA and returns a peptide
translation:
# dna2peptide
#
# A subroutine to translate DNA sequence into a peptide
sub dna2peptide {
my($dna) = @_;
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables
my $protein = '';
# Translate each three-base codon to an amino acid, and append to a protein
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$protein .= codon2aa( substr($dna,$i,3) );
}
return $protein;
}
Now add subroutine dna2peptide to the
BeginPerlBioinfo.pm module.
Notice that you've eliminated one of the variables in making
the subroutine out of Example 8-1: the variable
$codon. Why?
Well, one reason is because you can. In Example 8-1,
you were using substr to extract the codon from
$dna, saving it in variable
$codon and then passing it into the subroutine
codon2aa. This new way eliminates the middleman.
Put the call to substr that extracts the codon
as the argument to the subroutine codon2aa so
that the value is passed in just as before, but without having to
copy it to the variable $codon first.
This has somewhat improved efficiency and speed. Since
copying strings is one of the slower
things computer programs do, eliminating a bunch of string copies is
an easy and effective way to speed up a program.
But has it made the program less readable? You be the judge. I think
it has, a little, but the comment right before the loop seems to make
everything clear enough, for me, anyway. It's important to have
readable code, so if you really need to boost the speed of a
subroutine, but find it makes the code harder to read, be sure to
include enough comments for the reader to be able to understand
what's going on.
For the first time use function calls are being
included in a subroutine instead of the main program:
use strict;
use warnings;
use BeginPerlBioinfo;
This may be redundant with the calls in the main program, but it
doesn't do any harm (Perl checks and loads a module only once).
If this subroutine should be called from a module that doesn't
already load the modules, it's done some good after all.
Now let's improve how we deal with DNA in files.