5.4
Counting Nucleotides
There are many things you might want to know about a piece of DNA. Is
it coding or noncoding?[3]
Does it contain a regulatory element?
Is it related to some other known DNA, and if so, how? How many of
each of the four nucleotides does the
DNA
contain? In fact, in some species the
coding regions have a specific
nucleotide bias, so this last question can be important in finding
the genes. Also, different species have different patterns of
nucleotide usage. So counting nucleotides can be interesting and
useful.
In the following sections are two programs, Examples 5-4 and 5-6,
that make a count of each type of nucleotide in some DNA. They
introduce a few new parts of Perl:
To get the count of each type of nucleotide in some DNA, you have to
look at each base, see what it is, and then keep four counts, one for
each nucleotide. We'll do this in two ways:
-
Explode the DNA into an array of single bases, and iterate over the
array (that is, deal with the elements of the array one by one)
-
Use the substr
Perl function to iterate over the
positions in the string of DNA while counting
First, let's start with some
pseudocode of the task. Afterwards,
we'll make more detailed pseudocode, and finally write the Perl
program for both approaches.
The following pseudocode describes generally what is needed:
for each base in the DNA
if base is A
count_of_A = count_of_A + 1
if base is C
count_of_C = count_of_C + 1
if base is G
count_of_G = count_of_G + 1
if base is T
count_of_T = count_of_T + 1
done
print count_of_A, count_of_C, count_of_G, count_of_T
As you can see, this is a pretty simple idea, mirroring what
you'd do by hand if you had to. (If you want to count the
relative frequencies of the bases in all human genes, you can't
do it by hand—there are too many of them—and you have to
use such a program. Thus bioinformatics.) Now let's see how it
can be coded in Perl.