7.3
A Program to Simulate DNA Mutation
Example 7-1 gave you the tools you'll
need to mutate DNA. In the following
examples, you'll represent DNA, as usual, by a string made out
of the alphabet A, C, G, and T. You'll randomly select
positions in the string and then use the substr
function to alter the DNA.
This time, let's go about things a little differently and first
compose some of the useful subroutines you'll need before
showing the whole program.
7.3.1
Pseudocode Design
Starting with simple pseudocode, here's a design for a
subroutine that mutates a random position in DNA to a random
nucleotide:
-
Select a random position in the string of DNA.
-
Choose a random nucleotide.
-
Substitute the random nucleotide into the random position in the DNA.
This seems short and to the point. So you decide to make each of the
first two sentences into a subroutine.
7.3.1.1
Select a random position in a string
How can you randomly
select a position in a
string? Recall that the built-in
function length returns the length of a string. Also
recall that positions in strings are numbered from
0 to length-1, just like
positions in arrays. So you can use the same general idea as in Example 7-1, and make a
subroutine:
# randomposition
#
# A subroutine to randomly select a position in a string.
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomposition {
my($string) = @_;
# This expression returns a random number between 0 and length-1,
# which is how the positions in a string are numbered in Perl.
return int(rand(length($string)));
}
randomposition is really a short function, if
you don't count the comments. It's just like the idea in
Example 7-1 to select a random array element.
Of course, if you were really writing this code, you'd make a
little test to see if your subroutine worked:
#!/usr/bin/perl -w
# Test the randomposition subroutine
my $dna = 'AACCGTTAATGGGCATCGATGCTATGCGAGCT';
srand(time|$$);
for (my $i=0 ; $i < 20 ; ++$i ) {
print randomposition($dna), " ";
}
print "\n";
exit;
sub randomposition {
my($string) = @_;
return int rand length $string;
}
Here's some representative output of the test (your results
should vary):
28 26 20 1 29 7 1 27 2 24 8 1 23 7 13 14 2 12 13 27
Notice the new look of the for loop:
for (my $i=0 ; $i < 20 ; ++$i ) {
This shows how you can localize the counter variables (in this case,
$i) to the loop by declaring them with
my inside the for loop.
7.3.1.2
Choose a random nucleotide
Next, let's write a
subroutine that randomly
chooses one
of the four nucleotides:
# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucs) = @_;
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return $nucs[rand @nucs];
}
Again, this subroutine is short and sweet. (Most useful subroutines
are; although writing a short subroutine is no guarantee it will be
useful. In fact, you'll see in a bit how you can improve this
one.)
Let's test this one too:
#!/usr/bin/perl -w
# Test the randomnucleotide subroutine
my @nucleotides = ('A', 'C', 'G', 'T');
srand(time|$$);
for (my $i=0 ; $i < 20 ; ++$i ) {
print randomnucleotide(@nucleotides), " ";
}
print "\n";
exit;
sub randomnucleotide {
my(@nucs) = @_;
return $nucs[rand @nucs];
}
Here's some typical output (it's random, of course, so
there's a high probability your output will differ):
C A A A A T T T T T A C A C T A A G G G
7.3.1.3
Place a random nucleotide into a random position
Now for the third
and final subroutine, that
actually does the
mutation. Here's the code:
# mutate
#
# A subroutine to perform a mutation in a string of DNA
#
sub mutate {
my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA
my($position) = randomposition($dna);
# Pick a random nucleotide
my($newbase) = randomnucleotide(@nucleotides);
# Insert the random nucleotide into the random position in the DNA.
# The substr arguments mean the following:
# In the string $dna at position $position change 1 character to
# the string in $newbase
substr($dna,$position,1,$newbase);
return $dna;
}
Here, again, is a short program. As you look it over, notice that
it's relatively easy to read and understand. You mutate by
picking a random position then selecting a nucleotide at random and
substituting that nucleotide at that position in the string. (If
you've forgotten how
substr works, refer to Appendix B or other Perl documentation. If you're
like me, you probably have to do that a lot, especially to get the
order of the arguments right.)
There's a slightly different style used here for
declaring variables.
Whereas you've been declaring them at the beginning of a
program, here you're declaring each variable the first time
it's used. There are pros and cons for each programming style.
Having all the variables at the top of the program gives good
organization and can help in reading; declaring them on-the-fly can
seem like a more natural way to write. The choice is yours.
Also, notice how this subroutine is mostly built from other
subroutines, with a little bit added. That has a lot to do with its
readability. At this point, you may be thinking that you've
actually decomposed the problem pretty well, and the pieces are
fairly easy to build and, in the end, they fit together well. But do
they?
7.3.2
Improving the Design
You're about to pat yourself
on the back for
writing the program so quickly, but
you notice something. You keep having to declare that pesky
@nucleotides array and then pass it in to the
randomnucleotide
subroutine. But the only place you use
the array is inside the randomnucleotide
subroutine. So why not change your design a little? Here's a
new try:
# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucs) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return $nucs[rand @nucs];
}
Notice that this function now has no arguments. It's called
like so:
$randomnucleotide = randomnucleotide( );
It's asking for a random element from a very specific set. Of
course, you're always thinking, and you say, "It'd
be handy to have a subroutine that
randomly selects an element from any array.
I might not need it right now, but I bet I'll need it
soon!" So you define two
subroutines instead of one:
# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return randomelement(@nucleotides);
}
# randomelement
#
# A subroutine to randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];
}
Look back and notice that you didn't have to change your
subroutine mutate; just the internal workings of
randomnucleotide changed, not its behavior.
7.3.3
Combining the Subroutines to Simulate Mutation
Now you've got all your ducks in place, so you write your main program as in
Example 7-2 and see if your new subroutine works.
Example 7-2. Mutate DNA
#!/usr/bin/perl
# Mutate DNA
# using a random number generator to randomly select bases to mutate
use strict;
use warnings;
# Declare the variables
# The DNA is chosen to make it easy to see mutations:
my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA';
# $i is a common name for a counter variable, short for "integer"
my $i;
my $mutant;
# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);
# Let's test it, shall we?
$mutant = mutate($DNA);
print "\nMutate DNA\n\n";
print "\nHere is the original DNA:\n\n";
print "$DNA\n";
print "\nHere is the mutant DNA:\n\n";
print "$mutant\n";
# Let's put it in a loop and watch that bad boy accumulate mutations:
print "\nHere are 10 more successive mutations:\n\n";
for ($i=0 ; $i < 10 ; ++$i) {
$mutant = mutate($mutant);
print "$mutant\n";
}
exit;
################################################################################
# Subroutines for Example 7-2
################################################################################
# Notice, now that we have a fair number of subroutines, we
# list them alphabetically
# A subroutine to perform a mutation in a string of DNA
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub mutate {
my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA
my($position) = randomposition($dna);
# Pick a random nucleotide
my($newbase) = randomnucleotide(@nucleotides);
# Insert the random nucleotide into the random position in the DNA
# The substr arguments mean the following:
# In the string $dna at position $position change 1 character to
# the string in $newbase
substr($dna,$position,1,$newbase);
return $dna;
}
# A subroutine to randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];
}
# randomnucleotide
#
# A subroutine to select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return randomelement(@nucleotides);
}
# randomposition
#
# A subroutine to randomly select a position in a string.
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomposition {
my($string) = @_;
# Notice the "nested" arguments:
#
# $string is the argument to length
# length($string) is the argument to rand
# rand(length($string))) is the argument to int
# int(rand(length($string))) is the argument to return
# But we write it without parentheses, as permitted.
#
# rand returns a decimal number between 0 and its argument.
# int returns the integer portion of a decimal number.
#
# The whole expression returns a random number between 0 and length-1,
# which is how the positions in a string are numbered in Perl.
#
return int rand length $string;
}
Here's some typical output from Example 7-2:
Mutate DNA
Here is the original DNA:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Here is the mutant DNA:
AAAAAAAAAAAAAAAAAAAAGAAAAAAAAA
Here are 10 more successive mutations:
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAACAAAAAGACAAAAAAA
AAAAAAAAAAAAAACAACAAGACAAAAAAA
AAAAAAAAAAAAAACAACAAGACAAAAAAA
AAAAAAAAAGAAAACAACAAGACAAAAAAA
AAAAAATAAGAAAACAACAAGACAAAAAAA
AAAAAATAAGAAAACAACAAGACAAAAAAA
AAAAAATTAGAAAACAACAAGACAAAAAAA
Example 7-2 was something of a programming
challenge, but you end up with the satisfaction of seeing your
(simulated) DNA mutate. How about writing a
graphical display for this, so that
every time a base gets mutated, it makes a little explosion and the
color gets highlighted, so you can watch it happening in real-time?
Before you scoff, you should know how important good graphical
displays are for the success of most programs. This may be a
trivial-sounding graphic, but if you can demonstrate the most common
mutations in, for instance, the BRCA breast cancer genes in this way,
it might be useful.
7.3.4
A Bug in Your Program?
To return to the business at hand, you may have noticed
something when you looked over the output from Example 7-2. Look at the first two lines of the "10
more successive mutations." They are exactly the same! Could it
be that after patting yourself on the back and telling yourself what
a good bit of work you'd done, you've discovered a bug?
How can you track it down? You may want to step through the running
of the program with the Perl debugger, which you saw in Chapter 6. However, this time, you stop and think about
your design instead. You're replacing the bases at random
positions with randomly chosen bases. Aha! Sometimes the base at the
position you randomly choose is exactly the same as the base you
randomly choose to plug into its place! You're replacing a base
with itself on occasion![3]
Let's say you decide that behavior is not useful. At each
successive mutation, you need to see one base change. How can you
alter your code to ensure that? Let's start with some
pseudocode for the
mutate
subroutine:
Select a random position in the string of DNA
Repeat:
Choose a random nucleotide
Until: random nucleotide differs from the nucleotide in the random position
Substitute the random nucleotide into the random position in the DNA
This seems like something that should work, so you alter the
mutate subroutine, calling it the
mutate_better
subroutine:
# mutate_better
#
# Subroutine to perform a mutation in a string of DNA--version 2, in which
# it is guaranteed that one base will change on each call
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub mutate_better {
my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA
my($position) = randomposition($dna);
# Pick a random nucleotide
my($newbase);
do {
$newbase = randomnucleotide(@nucleotides);
# Make sure it's different than the nucleotide we're mutating
}until ( $newbase ne substr($dna, $position,1) );
# Insert the random nucleotide into the random position in the DNA
# The substr arguments mean the following:
# In the string $dna at position $position change 1 character to
# the string in $newbase
substr($dna,$position,1,$newbase);
return $dna;
}
When you plug this subroutine in place of mutate
and run the code, you get the following output:
Mutate DNA
Here is the original DNA:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Here is the mutant DNA:
AAAAAAAAAAAAATAAAAAAAAAAAAAAAA
Here are 10 more successive mutations:
AAAAAAAAAAAAATAAAAAAAACAAAAAAA
AAAAATAAAAAAATAAAAAAAACAAAAAAA
AAATATAAAAAAATAAAAAAAACAAAAAAA
AAATATAAAAAAATAAAAAAAACAACAAAA
AATTATAAAAAAATAAAAAAAACAACAAAA
AATTATTAAAAAATAAAAAAAACAACAAAA
AATTATTAAAAAATAAAAAAAACAACACAA
AATTATTAAAAAGTAAAAAAAACAACACAA
AATTATTAAAAAGTGAAAAAAACAACACAA
AATTATTAAAAAGTGATAAAAACAACACAA
which seems to indeed make a real change on every iteration.
Notice one more thing about
declaring variables.
In this code for mutate_better, if you'd
declared $newbase within the loop, since the loop
is enclosed in a block, the variable $newbase
would not then be visible outside of that loop. In particular, it
wouldn't be available in the substr call
that does the actual base change for the mutation. So, in
mutate_better, you had to declare the variable
outside of the loop.
This is a frequent source of confusion for programmers who like to
declare variables on the fly and a powerful argument for getting into
the habit of collecting variable definitions together at the top of
the program.
Even so, there are often times when you want to hide a variable
within a block, because that's the only place where you will
use
it. Then you may want to do the declaration in
the
block . (Perhaps at the top of the block,
if it's a long one?)