< BACKCONTINUE >

7.4 Generating Random DNA

It's often useful to generate random data for test purposes. Random DNA can also be used to study the organization of actual DNA from an organism. In this section, we'll write some programs to generate random DNA sequences.

Such random DNA sequences have proved useful in several ways. For instance, the popular BLAST program (see Chapter 12) depends on the properties of random DNA for the analytic and empirical results that underpin the sequence similarity scores, statistics that are used to rank the "hits" that BLAST returns to the user.

Let's assume what's needed is a set of random DNA fragments of varying length. Your program will have to specify a maximum and a minimum length, as well as how many fragments to generate.

7.4.1 Bottom-up Versus Top-down

In Example 7-2, you wrote the basic subroutines, then a subroutine that called the basic subroutines, and finally the main program. If you ignore the pseudocode, this is an example of bottom-up design; start with the building blocks, then assemble them into a larger structure.

Now let's see what it's like to start with the main program, with its subroutine calls, and write the subroutines after you find a need for them. This is called top-down design.

7.4.2 Subroutines for Generating a Set of Random DNA

Given our goal of generating random DNA, perhaps what you want is a data-generating subroutine:

@random_DNA = make_random_DNA_set( $minimum_length, $maximum_length, $size_of_set );

This looks okay, but of course, it begs the question of how to actually accomplish the overall task. (That's top-down design for you!) So you need to move down and write pseudocode for the make_random_DNA_set subroutine:

repeat $size_of_set times:

    $length = random number between minimum and maximum length

    $dna = make_random_DNA ( $length );

    add $dna to @set
}

return @set

Now, continuing the top-down design, you need some pseudocode for the make_random_DNA subroutine:

from 1 to $size

    $base = randomnucleotide

    $dna .= $base
}

return $dna

Don't go any further: you've already got a randomnucleotide subroutine from Example 7-2.

(Are you bothered by the absence of balanced curly braces in the pseudocode? Here, you're relying on indentation and lining up the right braces to indicate the blocks. Since it's pseudocode, anything is allowed as long as it works.)

7.4.3 Turning the Design into Code

Now that we've got a top-down design, how to proceed with the coding? Let's follow the top-down design, just to see how it works.

Example 7-3 starts with the main program and proceeds, following the order of the top-down design you did in pseudocode, then followed by the subroutines.

Example 7-3. Generate random DNA
#!/usr/bin/perl
# Generate random DNA
#  using a random number generator to randomly select bases

use strict;
use warnings;

# Declare and initialize the variables
my $size_of_set = 12;
my $maximum_length = 30;
my $minimum_length = 15;

# An array, initialized to the empty list, to store the DNA in
my @random_DNA = (  );

# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);

# And here's the subroutine call to do the real work
@random_DNA = make_random_DNA_set( $minimum_length, $maximum_length, $size_of_set );

# Print the results, one per line
print "Here is an array of $size_of_set randomly generated DNA sequences\n";
print "  with lengths between $minimum_length and $maximum_length:\n\n";

foreach my $dna (@random_DNA) {

    print "$dna\n";
}

print "\n";

exit;

################################################################################
# Subroutines
################################################################################

# make_random_DNA_set
#
# Make a set of random DNA
#
#   Accept parameters setting the maximum and minimum length of
#     each string of DNA, and the number of DNA strings to make
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub make_random_DNA_set {

    # Collect arguments, declare variables
    my($minimum_length, $maximum_length, $size_of_set) = @_;

    # length of each DNA fragment
    my $length;
    
    # DNA fragment
    my $dna;

    # set of DNA fragments
    my @set;

    # Create set of random DNA
    for (my $i = 0; $i < $size_of_set ; ++$i) {

        # find a random length between min and max
        $length = randomlength ($minimum_length, $maximum_length);

        # make a random DNA fragment
        $dna = make_random_DNA ( $length );

        # add $dna fragment to @set
        push( @set, $dna );
    }

    return @set;
}

# Notice that we've just discovered a new subroutine that's
# needed: randomlength, which will return a random
# number between (or including) the min and max values.
# Let's write that first, then do make_random_DNA

# randomlength
#
# A subroutine that will pick a random number from
# $minlength to $maxlength, inclusive.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomlength {

    # Collect arguments, declare variables
    my($minlength, $maxlength) = @_;

    # Calculate and return a random number within the
    #  desired interval.
    # Notice how we need to add one to make the endpoints inclusive,
    #  and how we first subtract, then add back, $minlength to
    #  get the random number in the correct interval.
    return ( int(rand($maxlength - $minlength + 1)) + $minlength );
}

# make_random_DNA
#
# Make a string of random DNA of specified length.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub make_random_DNA {

    # Collect arguments, declare variables
    my($length) = @_;

    my $dna;

    for (my $i=0 ; $i < $length ; ++$i) {

        $dna .= randomnucleotide(  );
    }

    return $dna;
}

# We also need to include the previous subroutine
# randomnucleotide.
# Here it is again for completeness.

# randomnucleotide
#
# Select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomnucleotide {

    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # scalar returns the size of an array. 
    # The elements of the array are numbered 0 to size-1
    return randomelement(@nucleotides);
}

# randomelement
#
# randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomelement {

    my(@array) = @_;

    return $array[rand @array];
}

Here's the output from Example 7-3:

Here is an array of 12 randomly generated DNA sequences
  with lengths between 15 and 30:

TACGCTTGTGTTTTCGGGGGAC
GGGGTGTGGTAAGGCTGTCTCAGATGTGC
TGAACGACAACCTCCTGGACTTTACT
ATCTATGCTTTGCCATGCTAGT
CCGCTCATTCCTCTTCCTCGGC
TGTACCCCTAATACACTTTAGCCGAATTTA
ATAGGTCGGGGCGACAGCGCCGG
GATTGACCTCTGTAA
AAAATCTCTAGGATCGAGC
GTATGTGCTTGGGTAAAT
ATGGAGTTGCGAGGAAGTAGCTGAGT
GGCCCATGACCAGCATCCAGACAGCA
< BACKCONTINUE >

Index terms contained in this section

bottom-up program design
designing programs
      top-down vs. bottom up
DNA
     mutations, investigating with randomization
            generating random DNA
generate random DNA program (example)
make_random_DNA_set subroutine (example)
mutations, investigating with randomization
      generating random DNA
            bottom-up vs. top-down design
            generate random DNA program
            subroutines for
randomization
      generating random DNA
            bottom-up vs. top-down design
            generate random DNA program
            subroutines for
randomnucleotide subroutine (example)
subroutines
      random DNA set, generating
      randomnucleotide (example)
top-down program design

© 2002, O'Reilly & Associates, Inc.