7.4
Generating Random DNA
It's often useful to generate random data for test purposes. Random DNA
can also be used to study the organization of actual DNA from an
organism. In this section, we'll write some programs to
generate random DNA sequences.
Such random DNA sequences have proved useful in several ways. For
instance, the popular BLAST program (see Chapter 12) depends on the properties of random DNA for
the analytic and empirical results that underpin the sequence
similarity scores, statistics that are used to rank the
"hits" that BLAST returns to the user.
Let's assume what's needed is a set of random DNA
fragments of varying length. Your program will have to specify a
maximum and a minimum length, as well as how many fragments to
generate.
7.4.1
Bottom-up Versus Top-down
In Example 7-2, you wrote the basic
subroutines, then a subroutine that called
the basic subroutines, and finally the main program. If you ignore
the pseudocode, this is an example of bottom-up
design; start with the building blocks, then assemble them
into a larger structure.
Now let's see what it's like to start with the main
program, with its subroutine calls, and write the subroutines after
you find a need for them. This is called top-down
design.
7.4.2
Subroutines for Generating a Set of Random DNA
Given our goal of generating random DNA, perhaps
what you want is a data-generating
subroutine:
@random_DNA = make_random_DNA_set( $minimum_length, $maximum_length, $size_of_set );
This looks okay, but of course, it begs the question of how to
actually accomplish the overall task. (That's top-down design
for you!) So you need to move down and write pseudocode for the
make_random_DNA_set subroutine:
repeat $size_of_set times:
$length = random number between minimum and maximum length
$dna = make_random_DNA ( $length );
add $dna to @set
}
return @set
Now, continuing the top-down design, you need some pseudocode for the
make_random_DNA subroutine:
from 1 to $size
$base = randomnucleotide
$dna .= $base
}
return $dna
Don't go any further: you've already got a
randomnucleotide
subroutine from Example 7-2.
(Are you bothered by the absence of balanced curly braces in the
pseudocode? Here, you're relying on indentation and lining up
the right braces to indicate the blocks. Since it's pseudocode,
anything is allowed as long as it works.)
7.4.3
Turning the Design into Code
Now that we've got a top-down design, how to proceed with
the coding? Let's follow the top-down design, just to see how
it works.
Example 7-3 starts with the main program and
proceeds, following the order of the top-down design you did in
pseudocode, then followed by the subroutines.
Example 7-3. Generate random DNA
#!/usr/bin/perl
# Generate random DNA
# using a random number generator to randomly select bases
use strict;
use warnings;
# Declare and initialize the variables
my $size_of_set = 12;
my $maximum_length = 30;
my $minimum_length = 15;
# An array, initialized to the empty list, to store the DNA in
my @random_DNA = ( );
# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);
# And here's the subroutine call to do the real work
@random_DNA = make_random_DNA_set( $minimum_length, $maximum_length, $size_of_set );
# Print the results, one per line
print "Here is an array of $size_of_set randomly generated DNA sequences\n";
print " with lengths between $minimum_length and $maximum_length:\n\n";
foreach my $dna (@random_DNA) {
print "$dna\n";
}
print "\n";
exit;
################################################################################
# Subroutines
################################################################################
# make_random_DNA_set
#
# Make a set of random DNA
#
# Accept parameters setting the maximum and minimum length of
# each string of DNA, and the number of DNA strings to make
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub make_random_DNA_set {
# Collect arguments, declare variables
my($minimum_length, $maximum_length, $size_of_set) = @_;
# length of each DNA fragment
my $length;
# DNA fragment
my $dna;
# set of DNA fragments
my @set;
# Create set of random DNA
for (my $i = 0; $i < $size_of_set ; ++$i) {
# find a random length between min and max
$length = randomlength ($minimum_length, $maximum_length);
# make a random DNA fragment
$dna = make_random_DNA ( $length );
# add $dna fragment to @set
push( @set, $dna );
}
return @set;
}
# Notice that we've just discovered a new subroutine that's
# needed: randomlength, which will return a random
# number between (or including) the min and max values.
# Let's write that first, then do make_random_DNA
# randomlength
#
# A subroutine that will pick a random number from
# $minlength to $maxlength, inclusive.
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomlength {
# Collect arguments, declare variables
my($minlength, $maxlength) = @_;
# Calculate and return a random number within the
# desired interval.
# Notice how we need to add one to make the endpoints inclusive,
# and how we first subtract, then add back, $minlength to
# get the random number in the correct interval.
return ( int(rand($maxlength - $minlength + 1)) + $minlength );
}
# make_random_DNA
#
# Make a string of random DNA of specified length.
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub make_random_DNA {
# Collect arguments, declare variables
my($length) = @_;
my $dna;
for (my $i=0 ; $i < $length ; ++$i) {
$dna .= randomnucleotide( );
}
return $dna;
}
# We also need to include the previous subroutine
# randomnucleotide.
# Here it is again for completeness.
# randomnucleotide
#
# Select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return randomelement(@nucleotides);
}
# randomelement
#
# randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];
}
Here's the output from Example 7-3:
Here is an array of 12 randomly generated DNA sequences
with lengths between 15 and 30:
TACGCTTGTGTTTTCGGGGGAC
GGGGTGTGGTAAGGCTGTCTCAGATGTGC
TGAACGACAACCTCCTGGACTTTACT
ATCTATGCTTTGCCATGCTAGT
CCGCTCATTCCTCTTCCTCGGC
TGTACCCCTAATACACTTTAGCCGAATTTA
ATAGGTCGGGGCGACAGCGCCGG
GATTGACCTCTGTAA
AAAATCTCTAGGATCGAGC
GTATGTGCTTGGGTAAAT
ATGGAGTTGCGAGGAAGTAGCTGAGT
GGCCCATGACCAGCATCCAGACAGCA