< BACKCONTINUE >

5.4 Counting Nucleotides

There are many things you might want to know about a piece of DNA. Is it coding or noncoding?[3] Does it contain a regulatory element? Is it related to some other known DNA, and if so, how? How many of each of the four nucleotides does the DNA contain? In fact, in some species the coding regions have a specific nucleotide bias, so this last question can be important in finding the genes. Also, different species have different patterns of nucleotide usage. So counting nucleotides can be interesting and useful.

[3] Coding DNA is DNA that codes for a protein, that is, it is part of a gene. In many organisms, including humans, a large part of the DNA is noncoding—not part of genes and doesn't code for proteins. In humans, about 98-99% of DNA is noncoding.

In the following sections are two programs, Examples 5-4 and 5-6, that make a count of each type of nucleotide in some DNA. They introduce a few new parts of Perl:

  • "Exploding" a string

  • Looking at specific locations in strings

  • Iterating over an array

  • Iterating over the length of a string

To get the count of each type of nucleotide in some DNA, you have to look at each base, see what it is, and then keep four counts, one for each nucleotide. We'll do this in two ways:

  • Explode the DNA into an array of single bases, and iterate over the array (that is, deal with the elements of the array one by one)

  • Use the substr Perl function to iterate over the positions in the string of DNA while counting

First, let's start with some pseudocode of the task. Afterwards, we'll make more detailed pseudocode, and finally write the Perl program for both approaches.

The following pseudocode describes generally what is needed:

for each base in the DNA
    if base is A
        count_of_A = count_of_A + 1
    if base is C
        count_of_C = count_of_C + 1
    if base is G
        count_of_G = count_of_G + 1
    if base is T
        count_of_T = count_of_T + 1
done

print count_of_A, count_of_C, count_of_G, count_of_T

As you can see, this is a pretty simple idea, mirroring what you'd do by hand if you had to. (If you want to count the relative frequencies of the bases in all human genes, you can't do it by hand—there are too many of them—and you have to use such a program. Thus bioinformatics.) Now let's see how it can be coded in Perl.

< BACKCONTINUE >

Index terms contained in this section

arrays
      exploding strings into
bases
      counting
            pseudocode for
coding DNA
counting nucleotides
DNA
      coding and noncoding
      counting nucleotides in
exploding strings into arrays
genes
      coding DNA
noncoding DNA
nucleotides
      counting
            pseudocode for
proteins
      DNA coding for
pseudocode
      for counting nucleotides
strings
      exploding into arrays
      specific positions, examining
substr function

© 2002, O'Reilly & Associates, Inc.