7.5
Analyzing DNA
In this final
example dealing with
randomization, you'll collect some statistics on DNA in order
to answer the question: on average, what percentage of bases are the
same between two random DNA sequences? Although some simple
mathematics can answer the question for you, the point of the program
is to show that you now have the necessary programming ability to ask
and answer questions about your DNA sequences. (If you were using
real DNA, say a collection of some particular gene as it appears in
several organisms in slightly different forms, the answer would be
somewhat more interesting. You may want to try that later.)
So let's generate a set of random DNA, all the same length,
then ask the following question about the set. What's the
average percentage of positions that are the same between pairs of
DNA sequences in this set?
As usual, let's try to sketch an idea of the program in
pseudocode:
Generate a set of random DNA sequences, all the same length
For each pair of DNA sequences
How many positions in the two sequences are identical as a fraction?
}
Report the mean of the preceding calculations as a percentage
Clearly, to write this code, you can reuse at least some of the work
you've already done. You certainly know how to generate a set
of random DNA sequences. Also, although you don't have a
subroutine that compares, position by position, the bases in two
sequences, you know how to look at the positions in DNA strings. So
that subroutine shouldn't be hard to write. In fact,
let's write some pseudocode that compares each nucleotide in
one sequence with the nucleotide in the same position in another
sequence:
assuming DNA1 is the same length as DNA2,
for each position from 1 to length(DNA)
if the character at that position is the same in DNA_1 and DNA_2
++$count
}
}
return count/length
The whole problem now seems eminently do-able. You also have to write
the code that picks each pair of sequences, collects the results, and
finally takes the mean of the results and report it as a percentage.
That can all go into the main program.
Example 7-4 gives it a try, all in one shot.
Example 7-4. Calculate average % identity between pairs of random DNA sequences
#!/usr/bin/perl
# Calculate the average percentage of positions that are the same
# between two random DNA sequences, in a set of 10 sequences.
use strict;
use warnings;
# Declare and initialize the variables
my $percent;
my @percentages;
my $result;
# An array, initialized to the empty list, to store the DNA in
my @random_DNA = ( );
# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);
# Generate the data set of 10 DNA sequences.
@random_DNA = make_random_DNA_set( 10, 10, 10 );
# Iterate through all pairs of sequences
for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) {
for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) {
# Calculate and save the matching percentage
$percent = matching_percentage($random_DNA[$k], $random_DNA[$i]);
push(@percentages, $percent);
}
}
# Finally, the average result:
$result = 0;
foreach $percent (@percentages) {
$result += $percent;
}
$result = $result / scalar(@percentages);
#Turn result into a true percentage
$result = int ($result * 100);
print "In this run of the experiment, the average percentage of \n";
print "matching positions is $result%\n\n";
exit;
################################################################################
# Subroutines
################################################################################
# matching_percentage
#
# Subroutine to calculate the percentage of identical bases in two
# equal length DNA sequences
sub matching_percentage {
my($string1, $string2) = @_;
# we assume that the strings have the same length
my($length) = length($string1);
my($position);
my($count) = 0;
for ($position=0; $position < $length ; ++$position) {
if(substr($string1,$position,1) eq substr($string2,$position,1)) {
++$count;
}
}
return $count / $length;
}
# make_random_DNA_set
#
# Subroutine to make a set of random DNA
#
# Accept parameters setting the maximum and minimum length of
# each string of DNA, and the number of DNA strings to make
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub make_random_DNA_set {
# Collect arguments, declare variables
my($minimum_length, $maximum_length, $size_of_set) = @_;
# length of each DNA fragment
my $length;
# DNA fragment
my $dna;
# set of DNA fragments
my @set;
# Create set of random DNA
for (my $i = 0; $i < $size_of_set ; ++$i) {
# find a random length between min and max
$length = randomlength ($minimum_length, $maximum_length);
# make a random DNA fragment
$dna = make_random_DNA ( $length );
# add $dna fragment to @set
push( @set, $dna );
}
return @set;
}
# randomlength
#
# A subroutine that will pick a random number from
# $minlength to $maxlength, inclusive.
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomlength {
# Collect arguments, declare variables
my($minlength, $maxlength) = @_;
# Calculate and return a random number within the
# desired interval.
# Notice how we need to add one to make the endpoints inclusive,
# and how we first subtract, then add back, $minlength to
# get the random number in the correct interval.
return ( int(rand($maxlength - $minlength + 1)) + $minlength );
}
# make_random_DNA
#
# Make a string of random DNA of specified length.
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub make_random_DNA {
# Collect arguments, declare variables
my($length) = @_;
my $dna;
for (my $i=0 ; $i < $length ; ++$i) {
$dna .= randomnucleotide( );
}
return $dna;
}
# randomnucleotide
#
# Select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return randomelement(@nucleotides);
}
# randomelement
#
# randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];
}
If the code in Example 7-4 seems somewhat repetitive
of code from previous examples, it is. In the interest of
presentation, I included the subroutine code in the program.
(You'll start using modules in Chapter 8 as
a way to avoid this repetition.)
Here's the output of Example 7-4:
In this run of the experiment, the average number of
matching positions is 0.24%
Well, that seems reasonable. You might say, it's obvious: a
quarter of the positions match, and there are four bases. But the
point isn't to verify elementary probability, it's to
show you have enough programming under your belt to write some
programs that ask and answer questions about DNA sequences.
7.5.1
Some Notes About the Code
Notice in the main program that when it calls:
@random_DNA = make_random_DNA_set( 10, 10, 10 );
you don't need to declare and initialize variables such as
$minimum_length. You can just fill in the actual
numbers when you call the subroutine. (However it's often a
good idea to put such things in variables declared at the top of the
program, where it's easy to find and change them.) Here, you
set the maximum and minimum lengths to 10 and ask for 10 sequences.
Let's restate the problem we just solved. You have to compare
all pairs of DNA, and for each pair, calculate the
percentage of positions that have the
same nucleotides. Then, you have to take the mean of these
percentages.
Here's the code that accomplishes this in the main program of
Example 7-4:
# Iterate through all pairs of sequences
for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) {
for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) {
# Calculate and save the matching percentage
$percent = matching_percentage($random_DNA[$k], $random_DNA[$i]);
push(@percentages, $percent);
}
}
To look at each pair, you use a nested loop. A nested
loop
is simply a loop within another loop. These are fairly common in
programming but must be handled with care. They may seem a little
complex; take some time to see how the nested loop works, because
it's common to have to select all combinations of two (or more)
elements from a set.
The nested loop involves looking at (n * (n-1)) / 2 pairs of
sequences, which is a square function of the size of the data set.
This can get very big! Try gradually increasing the size of the data
set and rerunning the program, and you'll see your compute time
increase, and more than gradually.
See how the looping works? First sequence 0 (indexed by
$K) is paired with sequences 1,2,3,...,9, in turn
(indexed by $i). Then sequence 1 is paired with
2,3,...,9, etc. Finally, 8 is paired with 9. (Recall that array
elements are numbered starting at 0, so the last element of an array
with 10 elements is numbered 9. Also recall that scalar
@random_DNA returns the number of elements in the
array.)
You might find it a worthwhile exercise to let the number of
sequences be some small value, say 3 or 4, and think through (paper
and pencil in hand) how the nested loops and the variables
$k and $i evolve during the
running of the program. Or you can use the Perl debugger to
watch how it happens.