8.6
Reading Frames
The biologist knows that, given a sequence of DNA, it is necessary to
examine all six reading
frames of the DNA to find the coding regions
the cell uses to make proteins.
8.6.1
What Are Reading Frames?
Very often you won't know
where in the DNA you're studying the cell actually begins
translating the DNA into protein. Only about 1-1.5% of human DNA is
in genes, which are the parts of DNA used for the translation into
proteins. Furthermore, genes very often occur in pieces that are
spliced together during the transcription/translation process.
If you don't know where the translation starts, you have to
consider the six possible reading frames. Since the codons are three
bases long, the translation happens in three "frames,"
for instance starting at the first base, or the second, or perhaps
the third. (The fourth would be the same as starting from the first.)
Each starting place gives a different series of codons, and, as a
result, a different series of amino acids.
Also, transcription and translation can happen on either strand of
the DNA; that is, either the DNA sequence, or its reverse complement,
might contain DNA code that is actually translated. The reverse
complement can also be read in any one of three frames. So a total of
six reading frames have to be considered when looking for
coding regions
, that part of the DNA that encodes
proteins.
It is therefore quite common to examine all six reading frames of a
DNA sequence and to look at the resulting protein translations for
long stretches of
amino
acids that lack stop codons.
The stop codons are
definite breaks in the DNAprotein
translation process. During translation (actually of RNA to protein,
but I'm being deliberately informal and vague about the
biochemistry), if a stop codon is reached, the translation stops, and
the growing peptide chain grows no more.
Long stretches of DNA that don't contain any stop codons are
called open reading
frames (ORFs) and are important clues to the
presence of a gene in the DNA under study. So gene finder programs
need to perform the type of reading frame analysis we'll do in
this chapter.
8.6.2
Translating Reading Frames
Based
on the facts just presented, let's
write some code that translates the DNA in all six reading frames.
In the real world, you'd look around for some subroutines that
are already written to do that task. Given the basic nature of the
task—something anyone who studies DNA has to
do—you'd likely find something. But this is a tutorial,
not the real world, so let's soldier on.
This problem doesn't sound too daunting. So, take stock of the
subroutines at your disposal, think of where you are and how you can
get to your destination.
Looking through the subroutines we've already written, recall
dna2peptide. You may recall considering adding
some arguments to specify starting and end points. Let's do
this now.
Remember that although
we calculated reverse complements
back in Chapter 4, we never made a subroutine out
of it. So let's start there:
# revcom
#
# A subroutine to compute the reverse complement of DNA sequence
sub revcom {
my($dna) = @_;
# First reverse the sequence
my($revcom) = reverse($dna);
# Next, complement the sequence, dealing with upper and lower case
# A->T, T->A, C->G, G->C
$revcom =~ tr/ACGTacgt/TGCAtgca/;
return $revcom;
}
Now, a little pseudocode to sketch an idea for the subroutine that
will translate specific ranges of DNA:
Given DNA sequence
subroutine translate_frame ( DNA, start, end)
return dna2peptide( substr( DNA, start, end - start + 1 ) )
}
That went well! Luckily, the substr built-in
Perl function made it easy to apply the desired start and end points,
while passing the DNA into the already written
dna2peptide subroutine.
Note that the length of the sequence is
end-start+1. To give a small example: if you start
at position 3 and end at position 5, you've got the bases at
positions 3, 4, and 5, three bases in all, which is exactly what 5 -
3 + 1 equals.
Dealing with indices like this has to
be done carefully, or the code won't work. For many programs,
this is the worst the mathematics gets.
|
Pay attention to the indices!
|
|
You have to
decide if you wish
to keep the numbering of positions from 0, which is Perl's way
to do it, or the first character of the sequence is in position 1,
which is the biologist's way to do it. Let's do it the
biologist's way. The positions will be decreased by one when
passed to the Perl function substr, which, of
course, does it Perl's way.
The corrected pseudocode looks like this:
Given DNA sequence
subroutine translate_frame ( DNA, start, end)
# start and end are numbering the sequence from 1 to length
return dna2peptide( substr( DNA, start - 1, end - start + 1 ) )
}
The length of the desired sequence doesn't change with the
change in indices, since:
(end - 1) - (start - 1) + 1 = end - start + 1
So let's write this subroutine:
# translate_frame
#
# A subroutine to translate a frame of DNA
sub translate_frame {
my($seq, $start, $end) = @_;
my $protein;
# To make the subroutine easier to use, you won't need to specify
# the end point--it will just go to the end of the sequence
# by default.
unless($end) {
$end = length($seq);
}
# Finally, calculate and return the translation
return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) );
}
Example 8-4 translates the DNA in all six reading
frames.
Example 8-4. Translate a DNA sequence in all six reading frames
#!/usr/bin/perl
# Translate a DNA sequence in all six reading frames
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables
my @file_data = ( );
my $dna = '';
my $revcom = '';
my $protein = '';
# Read in the contents of the file "sample.dna"
@file_data = get_file_data("sample.dna");
# Extract the sequence data from the contents of the file "sample.dna"
$dna = extract_sequence_from_fasta_data(@file_data);
# Translate the DNA to protein in six reading frames
# and print the protein in lines 70 characters long
print "\n -------Reading Frame 1--------\n\n";
$protein = translate_frame($dna, 1);
print_sequence($protein, 70);
print "\n -------Reading Frame 2--------\n\n";
$protein = translate_frame($dna, 2);
print_sequence($protein, 70);
print "\n -------Reading Frame 3--------\n\n";
$protein = translate_frame($dna, 3);
print_sequence($protein, 70);
# Calculate reverse complement
$revcom = revcom($dna);
print "\n -------Reading Frame 4--------\n\n";
$protein = translate_frame($revcom, 1);
print_sequence($protein, 70);
print "\n -------Reading Frame 5--------\n\n";
$protein = translate_frame($revcom, 2);
print_sequence($protein, 70);
print "\n -------Reading Frame 6--------\n\n";
$protein = translate_frame($revcom, 3);
print_sequence($protein, 70);
exit;
Here's the output of Example 8-4:
-------Reading Frame 1--------
RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAEWSVQVRGSLAGVVRE
CAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCDNCNEWFHGDCIRITEKMAKA
IREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLAR
GSASPHKSSPQPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCR
LRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATA
TPEPLSDEDL
-------Reading Frame 2--------
DGGAEGSWGL_AGHLLVCSGDDAWGLRNRSTLPGRRD_KRK_LWAPLQPPGTPPSGLCRFAGRWRGS_GS
APGAEIWREMVQTQSLQMPGRTASPRMGRMRPSTASAANRTSTAS_SGVTTAMSGSMGTASGSLRRWPRP
SGSGTVGSAERKTPS_RFAIGTRSHGSGMAMSGTAVSPGMRVEGARGLSLIQTCSAGQGQGQGLGPCLLG
ALLRPTNPLRSPWWPHPASITSSSSSRSNGQPACVVSVRHVGALRTVVTVISVGT_RSSGAPTRSGRSAG
CASASCGPGNRTSTSLPRSHQ_RPQSPCQGPAGHCPPNSSHSHHRS_GASVKMRGQWRHQQSRSLLRLQP
HLSHSQMRT
-------Reading Frame 3--------
MAALRGLGGSRPATYWFAAETTHGACAIGVRCLGGVTRSGSSCGRLCNRLGRRRVVCAGSRVAGGGREGV
RRERRYGGRWFRPRASRCRGGQQVREWGECAHLLHLPQTGHQLLHDRV_QLQ_VVPWGLHPDH_EDGQGH
PGVVLSGVQRERPQARDSLSAQEVTGAGWQ_AGQQ_APG_GWRAQEACP_SRPAAPGRVRDRGWGHACSG
LCFAPQILSAALGGHTQPASPAAAAADQTVSPHVW_V_GMSAH_GLWSL_FLSGHEEVRGPQQDPAEVPA
APVPAAGPGIVQVLPFLALTSDALRVPAKAPPATAHPTAATAITEVRAHP_R_GGSGVINSQGAS_GYSH
T_ATLR_GP
-------Reading Frame 4--------
_VLI_EWLRCGCSLRRLLDC_ _RHCPLIFTDAP_LL_WLWLLLGGQWPAGPWQGL_GRHW_ERGREVLVR
FPGPQLALAQPALLPDLVGAPELLHVPTEITVTTVLSAPTCLTLTTHAG_PFDLLLLLLVMLAGCGHQGL
RRGFVGRSRAPSKHGPNPCP_PCPALQVWIRDRPLAPSTLIPGLTAVPLIAIPLP_LLVPIANL_LGVFL
SALPTVPLPDGLGHLLSDPDAVPMEPLIAVVTPDHEAVDVRFAADAVDGRILPILGLAVLPGIWRLWV_T
ISLHISAPGALPHDPRQRPANLHRPLGGVPGGCKGAHNYFRF_SRLPGSVLLLRRPHASSPLQTSRWPA_
SPQDPSAPPS
-------Reading Frame 5--------
RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDSEGVTGESEEGKYLYD
SRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRHASHSPHMRADRLICCCCCW_CWLGVATKGC
GEDLWGEAEPRASMAPTPVPDPARRCRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFS
LHSRQYHSRMALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGGSGSEP
SPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAYSYCAGPMRRLRCKPVGGRPR
APKTPQRRH
-------Reading Frame 6--------
GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTLRASLVRARKGSTCTI
PGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADMPHTHHTCGLTV_SAAAAAGDAGWVWPPRAA
ERICGAKQSPEQAWPQPLSLTLPGAAGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSL
CTPDSTTPGWPWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEALGLNH
LPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRTPIAQAPCVVSAANQ_VAGLE
PPRPLSAAI