< BACKCONTINUE >

5.5 Exploding Strings into Arrays

Let's say you decide to explode the string of DNA into an array. By explode I mean separating out each letter in the string—sort of like blowing the string into bits. In other words, the letters representing the bases of the DNA in the string are separated, and each letter becomes its own scalar value in an array. Then you can look at the array elements (each of which is a single character) one by one, making the count as you go along. This is the inverse of the join function in Section 5.3.2, which takes an array of strings and makes a single scalar value out of them. (After exploding a string into an array, you could then join the array back into an identical string using join, if you so desire.)

I'm also adding to this version of the pseudocode the instructions to get the DNA from a file and manipulate that file data until it's a single string of DNA sequence. So first, you join the data from the array of lines of the original file data, clean it up by removing whitespace until only sequence is left, and then explode it back into an array. But, of course, the point is that the last array has exactly what is needed, the data in a convenient form to use in the counting loop. Instead of an array of lines, with newlines and possibly other unwanted characters, there's an exact array of the individual bases.

read in the DNA from a file

join the lines of the file into a single string $DNA

# make an array out of the bases of $DNA
@DNA = explode $DNA

# initialize the counts
count_of_A = 0
count_of_C = 0
count_of_G = 0
count_of_T = 0

for each base in @DNA

    if base is A
        count_of_A = count_of_A + 1
    if base is C
        count_of_C = count_of_C + 1
    if base is G
        count_of_G = count_of_G + 1
    if base is T
        count_of_T = count_of_T + 1
done

print count_of_A, count_of_C, count_of_G, count_of_T

As promised, this version of the pseudocode is a bit more detailed. It suggests a method to look at each of the bases by exploding the string of DNA into an array of single characters. It also initializes the counts to zero to ensure they start off right. It's easier to see what's happening if you spell out the initialization in the program, and it can prevent certain kinds of errors from creeping into your code. (It's not a rule, however; sometimes, you may prefer to leave the values of variables undefined until they are used.) Perl assumes that an uninitialized variable has the value 0 if you try to use it as a number, for instance by adding another number to it. But you'll most likely get a warning if that is the case.

We now have a design for the program, let's turn it into Perl code. Example 5-4 is a workable program; you'll see other ways to accomplish the same task more quickly as you proceed in this chapter, but speed is not the main concern at this point.

Example 5-4. Determining frequency of nucleotides
#!/usr/bin/perl -w
# Determining frequency of nucleotides

# Get the name of the file with the DNA sequence data
print "Please type the filename of the DNA sequence data: ";

$dna_filename = <STDIN>;

# Remove the newline from the DNA filename
chomp $dna_filename;

# open the file, or exit
unless ( open(DNAFILE, $dna_filename) ) {

    print "Cannot open file \"$dna_filename\"\n\n";
    exit;
}

# Read the DNA sequence data from the file, and store it
# into the array variable @DNA
@DNA = <DNAFILE>;

# Close the file
close DNAFILE;

# From the lines of the DNA file,
# put the DNA sequence data into a single string.
$DNA = join( '', @DNA);

# Remove whitespace
$DNA =~ s/\s//g;

# Now explode the DNA into an array where each letter of the
# original string is now an element in the array.
# This will make it easy to look at each position.
# Notice that we're reusing the variable @DNA for this purpose.
@DNA = split( '', $DNA );

# Initialize the counts.
# Notice that we can use scalar variables to hold numbers.
$count_of_A = 0;
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$errors     = 0;

# In a loop, look at each base in turn, determine which of the
# four types of nucleotides it is, and increment the
# appropriate count.
foreach $base (@DNA) {

    if     ( $base eq 'A' ) {
        ++$count_of_A;
    } elsif ( $base eq 'C' ) {
        ++$count_of_C;
    } elsif ( $base eq 'G' ) {
        ++$count_of_G;
    } elsif ( $base eq 'T' ) {
        ++$count_of_T;
    } else {
        print "!!!!!!!! Error - I don\'t recognize this base: $base\n";
        ++$errors;
    }
}

# print the results
print "A = $count_of_A\n";
print "C = $count_of_C\n";
print "G = $count_of_G\n";
print "T = $count_of_T\n";
print "errors = $errors\n";

# exit the program
exit;

To demonstrate Example 5-4, I have created the following small file of DNA and called it small.dna:

AAAAAAAAAAAAAAGGGGGGGTTTTCCCCCCCC
CCCCCGTCGTAGTAAAGTATGCAGTAGCVG
CCCCCCCCCCGGGGGGGGAAAAAAAAAAAAAAATTTTTTAT
AAACG

The file small.dna can be typed into your computer using your favorite text editor, or you can download it from this book's web site.

Notice that there is a V in the file, an error.[4] Here is the output of Example 5-4:

[4] Files of DNA sequence data sometimes include such characters as N, meaning "some undetermined base," or other special characters. You sometimes have to look at the documentation for the source, say an ABI sequencer or a GenBank file or whatever, to discover which characters are used and what they mean.

Please type the filename of the DNA sequence data: small.dna
!!!!!!!! Error - I don't recognize this base: V

A = 40
C = 27
G = 24
T = 17

Now let's look at the new stuff in this program. Opening and reading the sequence data is the same as previous programs. The first new thing is at this line:

@DNA = split( '', $DNA);

which the comments say will explode the string $DNA into an array of single characters @DNA.

split is the companion to join, and it's a good idea to take a little while to look over the documentation for these two commands. Calling split with an empty string as the first argument causes the string to explode into individual characters; that's just what we want.[5]

[5] As you'll see in the documentation for the split function, the first argument can be any regular expression, such as /\s+/ (one or more adjacent whitespace characters.)

Next, there are five scalar variables initialized to 0, the variables $count_of_A and so forth. I nitializing means assigning an initial value, in this case, the value 0.

Example 5-4 illustrates the concepts of type and initialization. The type of a variable determines what kind of data it can hold, for instance, strings or numbers. Up to now we've been using scalar variables such as $DNA to store strings of letters such as A, C, G, and T. Example 5-4 shows that you can also use scalar variables to store numbers. For example, the variable $count_of_A keeps a running count of the character A.

Scalar variables can store integers (0, 1, -1, 2, -2, ...), decimal or floating-point numbers such as 6.544, and numbers in scientific notation such as 6.544E6, which translates as 6.544 x 106, or 6,544000. (See Appendix B for more details on types of numbers.)

In Example 5-4, the variables $count_of_A through $count_of_T are initialized to 0. Initializing a variable means giving it a value after it's declared. If you don't initialize your variables, they assume the value of 'undef'. In Perl, an undefined variable is 0 if it is asked for in numerical context; it's an empty string if used in a string operation. Although Perl programmers often choose not to initialize variables, it's a critical step in many other languages. In C for instance, uninitialized variables have unpredictable values. This can wreak havoc with your output. You should get in the habit of initializing variables; it makes the program easier to read and maintain, and that's important.

To declare a variable means to specify its name and other attributes such as an initial value and a scope (for scoping, see Chapter 6 and the discussion of my variables). Many languages require you to declare all variables before using them. For this book, up to now, declarations have been an unnecessary complication. The next chapter begins to require declarations. In Perl, you may declare a variable's scope (see Chapter 6 and the discussion of my variables) in addition to an initial value. Many languages also require you to declare the type of a variable, for example "integer," or "string," but Perl does not.

Perl is written to be smart about what's in a scalar variable. For instance, you can assign the number 1234 (without quotes) to a variable, or you can assign the string '1234' (with quotes). Perl treats the variable as a string for printing, and as a number for using in arithmetic operations, without your having to worry about it. Example 5-5 demonstrates this ability. In other words, Perl isn't strict about specifying the type of data a variable is used for.

Example 5-5. Demonstration of Perl's built-in knowledge about numbers and strings
#!/usr/bin/perl -w
# Demonstration of Perl's built-in knowledge about numbers and strings

$num = 1234;

$str = '1234';

# print the variables
print $num, " ", $str, "\n";

# add the variables as numbers
$num_or_str = $num + $str;

print $num_or_str, "\n";

# concatenate the variables as strings
$num_or_str = $num . $str;

print $num_or_str, "\n";

exit;

Example 5-5 produces the output:

1234 1234
2468
12341234

Example 5-5 illustrates the smart way Perl determines the datatype of a scalar variable, whether it's a string or a number, and whether you're trying to add or subtract it like a number or concatenate it like a string. Perl behaves accordingly, which makes your job as a programmer a little bit easier; Perl "does the right thing" for you.

Next is a new kind of loop, the foreach loop. This loop works over the elements of an array. The line:

foreach $base (@DNA) {

loops over the elements of the array @DNA, and each time through the loop, the scalar variable $base (or whatever name you choose) is set to the next element of the array.

The body of the loop checks for each base and increments the count for that base if found. There are four ways to add 1 to a number in Perl. Here, you put a ++ in front of the variable, like this:

++$count; 

You can also put the ++ after the variable:

$count++;

You can spell it out like this, a combination of adding and assignment:

$count = $count + 1;

or, as a shorthand of that, you can say:

$count += 1;

Almost an embarrassment of riches. The plus-plus (++) notation is convenient for incrementing counts, as we're doing here. The plus-equals (+=) notation saves some typing and is very popular for adding other numbers besides 1.

The foreach loop in Example 5-5 could have been written like this:

foreach (@DNA) {

    if     ( /A/ ) {
        ++$count_of_A;
    } elsif ( /C/ ) {
        ++$count_of_C;
    } elsif ( /G/ ) {
        ++$count_of_G;
    } elsif ( /T/ ) {
        ++$count_of_T;
    } else {
        print "!!!!!!!! Error - I don\'t recognize this base: ";
        print;
        print "\n";
        ++$errors;
    }
}

This version of the foreach loop:

foreach(@DNA) {.

doesn't have a scalar value. In a foreach loop, if you don't specify a scalar variable to hold the scalars that are being read from the array ($base served that function in the version of this loop in Example 5-5), Perl uses the special variable $_ .

Furthermore, many Perl built-in functions operate on this special variable if no argument is provided to them. Here, the conditional tests are simply patterns; Perl assumes you're doing a pattern match on the $_ variable, so it behaves as if you had said $_ =~ /A/, for instance. Finally, in the error message, the statement print; prints the value of the $_ variable.

This special variable $_ that doesn't have to be named appears in many Perl programs, although I don't use it extensively in this book.

< BACKCONTINUE >

Index terms contained in this section

$ (dollar sign)
      $_ variables
+ (plus sign)
      ++ (autoincrement) operator
      += (add assignment) operator
angle brackets ()
characters
      in DNA sequence data, checking source documentation
counting nucleotides
      exploding strings into arrays
declaring
      variables
foreach loops
incrementing variables
initializing
      variables
loops
      foreach
numbers
      incrementing
      storing in scalar variables
operators
      autoincrement
patterns (and regular expressions)
     matching
            on $_ variable
Perl
     variables
            $_
scalar variables
      storing numbers in
split function
variables
      $_
      declaring
      initializing

© 2002, O'Reilly & Associates, Inc.