< BACKCONTINUE >

7.3 A Program to Simulate DNA Mutation

Example 7-1 gave you the tools you'll need to mutate DNA. In the following examples, you'll represent DNA, as usual, by a string made out of the alphabet A, C, G, and T. You'll randomly select positions in the string and then use the substr function to alter the DNA.

This time, let's go about things a little differently and first compose some of the useful subroutines you'll need before showing the whole program.

7.3.1 Pseudocode Design

Starting with simple pseudocode, here's a design for a subroutine that mutates a random position in DNA to a random nucleotide:

  1. Select a random position in the string of DNA.

  2. Choose a random nucleotide.

  3. Substitute the random nucleotide into the random position in the DNA.

This seems short and to the point. So you decide to make each of the first two sentences into a subroutine.

7.3.1.1 Select a random position in a string

How can you randomly select a position in a string? Recall that the built-in function length returns the length of a string. Also recall that positions in strings are numbered from 0 to length-1, just like positions in arrays. So you can use the same general idea as in Example 7-1, and make a subroutine:

# randomposition
#
# A subroutine to randomly select a position in a string.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomposition {

    my($string) = @_;

    # This expression returns a random number between 0 and length-1,
    # which is how the positions in a string are numbered in Perl.

    return int(rand(length($string)));
}

randomposition is really a short function, if you don't count the comments. It's just like the idea in Example 7-1 to select a random array element.

Of course, if you were really writing this code, you'd make a little test to see if your subroutine worked:

#!/usr/bin/perl -w
# Test the randomposition subroutine

my $dna = 'AACCGTTAATGGGCATCGATGCTATGCGAGCT';

srand(time|$$);

for (my $i=0 ; $i < 20 ; ++$i ) {
    print randomposition($dna), " ";
}

print "\n";

exit;

sub randomposition {
    my($string) = @_;
    return int rand length $string;
}

Here's some representative output of the test (your results should vary):

28 26 20 1 29 7 1 27 2 24 8 1 23 7 13 14 2 12 13 27 

Notice the new look of the for loop:

for (my $i=0 ; $i < 20 ; ++$i ) {

This shows how you can localize the counter variables (in this case, $i) to the loop by declaring them with my inside the for loop.

7.3.1.2 Choose a random nucleotide

Next, let's write a subroutine that randomly chooses one of the four nucleotides:

# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomnucleotide {

  my(@nucs) = @_;

  # scalar returns the size of an array. 
  # The elements of the array are numbered 0 to size-1
  return $nucs[rand @nucs];
}

Again, this subroutine is short and sweet. (Most useful subroutines are; although writing a short subroutine is no guarantee it will be useful. In fact, you'll see in a bit how you can improve this one.)

Let's test this one too:

#!/usr/bin/perl -w
# Test the randomnucleotide subroutine

my @nucleotides = ('A', 'C', 'G', 'T');

srand(time|$$);

for (my $i=0 ; $i < 20 ; ++$i ) {
    print randomnucleotide(@nucleotides), " ";
}

print "\n";

exit;

sub randomnucleotide {
    my(@nucs) = @_;

    return $nucs[rand @nucs];
}

Here's some typical output (it's random, of course, so there's a high probability your output will differ):

C A A A A T T T T T A C A C T A A G G G 
7.3.1.3 Place a random nucleotide into a random position

Now for the third and final subroutine, that actually does the mutation. Here's the code:

# mutate
#
# A subroutine to perform a mutation in a string of DNA
#

sub mutate {

    my($dna) = @_;
    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # Pick a random position in the DNA
    my($position) = randomposition($dna);

    # Pick a random nucleotide
    my($newbase) = randomnucleotide(@nucleotides);

    # Insert the random nucleotide into the random position in the DNA.
    # The substr arguments mean the following:
    #  In the string $dna at position $position change 1 character to
    #  the string in $newbase
    substr($dna,$position,1,$newbase);

    return $dna;
}

Here, again, is a short program. As you look it over, notice that it's relatively easy to read and understand. You mutate by picking a random position then selecting a nucleotide at random and substituting that nucleotide at that position in the string. (If you've forgotten how substr works, refer to Appendix B or other Perl documentation. If you're like me, you probably have to do that a lot, especially to get the order of the arguments right.)

There's a slightly different style used here for declaring variables. Whereas you've been declaring them at the beginning of a program, here you're declaring each variable the first time it's used. There are pros and cons for each programming style. Having all the variables at the top of the program gives good organization and can help in reading; declaring them on-the-fly can seem like a more natural way to write. The choice is yours.

Also, notice how this subroutine is mostly built from other subroutines, with a little bit added. That has a lot to do with its readability. At this point, you may be thinking that you've actually decomposed the problem pretty well, and the pieces are fairly easy to build and, in the end, they fit together well. But do they?

7.3.2 Improving the Design

You're about to pat yourself on the back for writing the program so quickly, but you notice something. You keep having to declare that pesky @nucleotides array and then pass it in to the randomnucleotide subroutine. But the only place you use the array is inside the randomnucleotide subroutine. So why not change your design a little? Here's a new try:

# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomnucleotide {
    my(@nucs) = ('A', 'C', 'G', 'T');

    # scalar returns the size of an array. 
    # The elements of the array are numbered 0 to size-1
    return $nucs[rand @nucs];
}

Notice that this function now has no arguments. It's called like so:

$randomnucleotide = randomnucleotide(  );

It's asking for a random element from a very specific set. Of course, you're always thinking, and you say, "It'd be handy to have a subroutine that randomly selects an element from any array. I might not need it right now, but I bet I'll need it soon!" So you define two subroutines instead of one:

# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomnucleotide {
    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # scalar returns the size of an array. 
    # The elements of the array are numbered 0 to size-1
    return randomelement(@nucleotides);
}

# randomelement
#
# A subroutine to randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomelement {

    my(@array) = @_;

    return $array[rand @array];
}

Look back and notice that you didn't have to change your subroutine mutate; just the internal workings of randomnucleotide changed, not its behavior.

7.3.3 Combining the Subroutines to Simulate Mutation

Now you've got all your ducks in place, so you write your main program as in Example 7-2 and see if your new subroutine works.

Example 7-2. Mutate DNA
#!/usr/bin/perl
# Mutate DNA
#  using a random number generator to randomly select bases to mutate

use strict;
use warnings;

# Declare the variables

# The DNA is chosen to make it easy to see mutations:
my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA';

# $i is a common name for a counter variable, short for "integer"
my $i;

my $mutant;

# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);

# Let's test it, shall we?
$mutant = mutate($DNA);

print "\nMutate DNA\n\n";

print "\nHere is the original DNA:\n\n";
print "$DNA\n";

print "\nHere is the mutant DNA:\n\n";
print "$mutant\n";

# Let's put it in a loop and watch that bad boy accumulate mutations:
print "\nHere are 10 more successive mutations:\n\n";

for ($i=0 ; $i < 10 ; ++$i) {
    $mutant = mutate($mutant);
    print "$mutant\n";
}

exit;
################################################################################
# Subroutines for Example 7-2
################################################################################

#  Notice, now that we have a fair number of subroutines, we
#  list them alphabetically

# A subroutine to perform a mutation in a string of DNA
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub mutate {

    my($dna) = @_;

    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # Pick a random position in the DNA
    my($position) = randomposition($dna);

    # Pick a random nucleotide
    my($newbase) = randomnucleotide(@nucleotides);

    # Insert the random nucleotide into the random position in the DNA
    # The substr arguments mean the following:
    #  In the string $dna at position $position change 1 character to
    #  the string in $newbase
    substr($dna,$position,1,$newbase);

    return $dna;
}

# A subroutine to randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomelement {

    my(@array) = @_;

    return $array[rand @array];
}

# randomnucleotide
#
# A subroutine to select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomnucleotide {

    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # scalar returns the size of an array. 
    # The elements of the array are numbered 0 to size-1
    return randomelement(@nucleotides);
}

# randomposition
#
# A subroutine to randomly select a position in a string.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomposition {

    my($string) = @_;

    # Notice the "nested" arguments:
    #
    # $string is the argument to length
    # length($string) is the argument to rand
    # rand(length($string))) is the argument to int
    # int(rand(length($string))) is the argument to return
    # But we write it without parentheses, as permitted.
    #
    # rand returns a decimal number between 0 and its argument.
    # int returns the integer portion of a decimal number.
    #
    # The whole expression returns a random number between 0 and length-1,
    #  which is how the positions in a string are numbered in Perl.
    #

    return int rand length $string;
}

Here's some typical output from Example 7-2:

Mutate DNA

Here is the original DNA:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Here is the mutant DNA:

AAAAAAAAAAAAAAAAAAAAGAAAAAAAAA

Here are 10 more successive mutations:

AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAACAAAAAGACAAAAAAA
AAAAAAAAAAAAAACAACAAGACAAAAAAA
AAAAAAAAAAAAAACAACAAGACAAAAAAA
AAAAAAAAAGAAAACAACAAGACAAAAAAA
AAAAAATAAGAAAACAACAAGACAAAAAAA
AAAAAATAAGAAAACAACAAGACAAAAAAA
AAAAAATTAGAAAACAACAAGACAAAAAAA

Example 7-2 was something of a programming challenge, but you end up with the satisfaction of seeing your (simulated) DNA mutate. How about writing a graphical display for this, so that every time a base gets mutated, it makes a little explosion and the color gets highlighted, so you can watch it happening in real-time?

Before you scoff, you should know how important good graphical displays are for the success of most programs. This may be a trivial-sounding graphic, but if you can demonstrate the most common mutations in, for instance, the BRCA breast cancer genes in this way, it might be useful.

7.3.4 A Bug in Your Program?

To return to the business at hand, you may have noticed something when you looked over the output from Example 7-2. Look at the first two lines of the "10 more successive mutations." They are exactly the same! Could it be that after patting yourself on the back and telling yourself what a good bit of work you'd done, you've discovered a bug?

How can you track it down? You may want to step through the running of the program with the Perl debugger, which you saw in Chapter 6. However, this time, you stop and think about your design instead. You're replacing the bases at random positions with randomly chosen bases. Aha! Sometimes the base at the position you randomly choose is exactly the same as the base you randomly choose to plug into its place! You're replacing a base with itself on occasion![3]

[3] How often? In DNA that's all one base, it's happening 1/4 of the time. In DNA that's equally populated with the four bases, it's happening...1/4 of the time!

Let's say you decide that behavior is not useful. At each successive mutation, you need to see one base change. How can you alter your code to ensure that? Let's start with some pseudocode for the mutate subroutine:

Select a random position in the string of DNA

Repeat:

    Choose a random nucleotide

Until: random nucleotide differs from the nucleotide in the random position

Substitute the random nucleotide into the random position in the DNA

This seems like something that should work, so you alter the mutate subroutine, calling it the mutate_better subroutine:

# mutate_better
#
# Subroutine to perform a mutation in a string of DNA--version 2, in which
#  it is guaranteed that one base will change on each call
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub mutate_better {

    my($dna) = @_;
    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # Pick a random position in the DNA
    my($position) = randomposition($dna);

    # Pick a random nucleotide
    my($newbase);

    do {
        $newbase = randomnucleotide(@nucleotides);

    # Make sure it's different than the nucleotide we're mutating
    }until ( $newbase ne substr($dna, $position,1) );

    # Insert the random nucleotide into the random position in the DNA
    # The substr arguments mean the following:
    #  In the string $dna at position $position change 1 character to
    #  the string in $newbase
    substr($dna,$position,1,$newbase);

    return $dna;
}

When you plug this subroutine in place of mutate and run the code, you get the following output:

Mutate DNA

Here is the original DNA:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Here is the mutant DNA:

AAAAAAAAAAAAATAAAAAAAAAAAAAAAA

Here are 10 more successive mutations:

AAAAAAAAAAAAATAAAAAAAACAAAAAAA
AAAAATAAAAAAATAAAAAAAACAAAAAAA
AAATATAAAAAAATAAAAAAAACAAAAAAA
AAATATAAAAAAATAAAAAAAACAACAAAA
AATTATAAAAAAATAAAAAAAACAACAAAA
AATTATTAAAAAATAAAAAAAACAACAAAA
AATTATTAAAAAATAAAAAAAACAACACAA
AATTATTAAAAAGTAAAAAAAACAACACAA
AATTATTAAAAAGTGAAAAAAACAACACAA
AATTATTAAAAAGTGATAAAAACAACACAA

which seems to indeed make a real change on every iteration.

Notice one more thing about declaring variables. In this code for mutate_better, if you'd declared $newbase within the loop, since the loop is enclosed in a block, the variable $newbase would not then be visible outside of that loop. In particular, it wouldn't be available in the substr call that does the actual base change for the mutation. So, in mutate_better, you had to declare the variable outside of the loop.

This is a frequent source of confusion for programmers who like to declare variables on the fly and a powerful argument for getting into the habit of collecting variable definitions together at the top of the program.

Even so, there are often times when you want to hide a variable within a block, because that's the only place where you will use it. Then you may want to do the declaration in the block . (Perhaps at the top of the block, if it's a long one?)

< BACKCONTINUE >

Index terms contained in this section

arrays
     randomly selecting elements from
            subroutines for
debugging
      Mutate DNA program (example)
declaring
     variables
            beginning of program vs. first use 2nd
DNA
     mutations, investigating with randomization
            program simulating mutation
graphical user interfaces (GUIs)
      displays for programs
length function
mutate subroutine (example)
      debugging
mutate_better subroutine (example)
mutations, investigating with randomization
      program simulating mutation
            mutate DNA (example)
            random position in string, selecting
nucleotides
      choosing random, subroutine for
randomelement subroutine (example)
randomization
      program simulating mutation
            mutate DNA (example)
            random nucleotide, choosing
            random nucleotide, placing in random position
            random nucleotide, selecting
            random position in string, selecting
randomnucleotide subroutine (example)
      improving design
simulating DNA mutation
      program (mutate DNA)
     program, designing
            random nucleotide, placing into random position
            random nucleotide, selecting 2nd
            random position in string, selecting
strings
      selecting random position in
subroutines
      mutate (example)
            debugging
      mutate_better (example)
      randomelement (example)
      randomnucleotide (example)
            improving design
      randomposition (example)
substr function
variables
     declaring
            beginning of program vs. first use 2nd

© 2002, O'Reilly & Associates, Inc.