< BACKCONTINUE >

5.6 Operating on Strings

It's not necessary to explode a string into an array in order to look at each character. In fact, sometimes you'd want to avoid that. A large string takes up a large amount of memory in your computer. So does a large array. When you explode a string into an array, the original string is still there, and you also have to make a copy of each character for the elements of the new array you're creating. If you have a large string, that already uses a good portion of available memory, creating an additional array can cause you to run out of memory. When you run out of memory, your computer performs poorly; it can slow to a crawl, crash, or freeze ("hang"). These haven't been worrisome considerations up to now, but if you use large data sets (such as the human genome), you have to take these things into account.

So let's say you'd like to avoid making a copy of the DNA sequence data into another variable. Is there a way to just look at the string $DNA and count the bases from it directly? Yes. Here's some pseudocode, followed by a Perl program:

read in the DNA from a file

join the lines of the file into a single string of $DNA

# initialize the counts
count_of_A = 0
count_of_C = 0
count_of_G = 0
count_of_T = 0

for each base at each position in $DNA

    if base is A
        count_of_A = count_of_A + 1
    if base is C
        count_of_C = count_of_C + 1
    if base is G
        count_of_G = count_of_G + 1
    if base is T
        count_of_T = count_of_T + 1
done

print count_of_A, count_of_C, count_of_G, count_of_T

Example 5-6 shows a program that examines each base in a string of DNA.

Example 5-6. Determining frequency of nucleotides, take 2
#!/usr/bin/perl -w
# Determining frequency of nucleotides, take 2

# Get the DNA sequence data
print "Please type the filename of the DNA sequence data: ";

$dna_filename = <STDIN>;

chomp $dna_filename;

# Does the file exist?
unless ( -e $dna_filename) {

    print "File \"$dna_filename\" doesn\'t seem to exist!!\n";
    exit;
}

# Can we open the file?
unless ( open(DNAFILE, $dna_filename) ) {

    print "Cannot open file \"$dna_filename\"\n\n";
    exit;
}

@DNA = <DNAFILE>;

close DNAFILE;

$DNA = join( '', @DNA);

# Remove whitespace
$DNA =~ s/\s//g;

# Initialize the counts.
# Notice that we can use scalar variables to hold numbers.
$count_of_A = 0;
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$errors     = 0;

# In a loop, look at each base in turn, determine which of the
# four types of nucleotides it is, and increment the
# appropriate count.
for ( $position = 0 ; $position < length $DNA ; ++$position ) {

    $base = substr($DNA, $position, 1);

    if     ( $base eq 'A' ) {
        ++$count_of_A;
    } elsif ( $base eq 'C' ) {
        ++$count_of_C;
    } elsif ( $base eq 'G' ) {
        ++$count_of_G;
    } elsif ( $base eq 'T' ) {
        ++$count_of_T;
    } else {
        print "!!!!!!!! Error - I don\'t recognize this base: $base\n";
        ++$errors;
    }
}

# print the results
print "A = $count_of_A\n";
print "C = $count_of_C\n";
print "G = $count_of_G\n";
print "T = $count_of_T\n";
print "errors = $errors\n";

# exit the program
exit;

Here's the output of Example 5-6:

Please type the filename of the DNA sequence data: small.dna
!!!!!!!! Error - I don't recognize this vase: V
A = 40
C = 27
G = 24
T = 17
errors = 1

In Example 5-6, I added a line of code to see if the file exists:

unless ( -e $dna_filename) {

There are file test operators for several conditions; see Appendix B or Perl documentation under -X. Note that files have several attributes, such as size, permission, location in the filesystem, and type of file, and that many of these things can be tested for easily with the file test operators.

Notice, also, that I have kept the detailed comments about the regular expression, because regular expressions can be hard to read, and a little commenting here helps a reader to skim the code.

Everything else is familiar, until you hit the for loop; it requires a little explanation:

for ( $position = 0 ; $position < length $DNA ; ++$position ) {

    # the statements in the block
}

This for loop is the equivalent of this while loop:

$position = 0;

while( $position < length $DNA ) {

    # the same statements in the block, plus ...

    ++$position;
}

Take a moment and compare these two loops. You'll see the same statements but in different locations.

As you can see, the for loop brings the initialization and increment of a counter ($position) into the loop statement, whereas in the while loop, they are separate statements. In the for loop, both the initialization and the increment statement are placed between parentheses, whereas you find only the conditional test in the while loop. In the for loop, you can put initializations before the first semicolon and increment statements after the second semicolon. The initialization statement is done just once before starting the loop, and the increment statement is done at the end of each iteration through the block before going back to the conditional test. It's really just a shorthand for the equivalent while loop as just shown.

The conditional test checks to see if the position reached in the string is less than the length of the string. It uses the length Perl function. Obviously, you don't want to check characters beyond the length of the string. But a word is in order here about the numbering of positions in strings and arrays.

By default, Perl assumes that a string begins at position 0 and its last character is at a position that's numbered one less than the length of the string. Why do it this way instead of numbering the positions from 1 up to and including the length of the string? There are reasons, but they're somewhat abstruse; see the documentation for enlightenment. If it's any comfort, many other programming languages make the same choice. (However, many do it the intuitive way, starting at 1. Ah well.)

This way of numbering is important to biologists because they are used to numbering sequences beginning with 1, not with 0 the way Perl does it. You sometimes have to add 1 to a position before printing out results so they'll make sense to nonprogrammers. It's mildly annoying, but you'll get used to it.

The same holds true for numbering the elements of an array. The first element of an array is element 0; the last is element $length-1.

Anyway, you see that the conditional test evaluates to true while the value of $position is length-1 or less and fails when $position reaches the same value as the length of the string. For example, say you have a string that contains the text "seeing." This has a length of six characters. The "s" is at position 0, and the "g" is at position 5, which is one less than the string length 6.

Back in the block, you call the substr function to look into the string:

$base = substr($DNA, $position, 1);

This is a fairly general-purpose function for working with strings; you can also insert and delete things. Here, you look at just one character, so you call substr on the string $DNA, ask it to look in position $position for one character, and save the result in scalar variable $base. Then you proceed to accumulate the count as in the preceding version of the program, Example 5-4.

< BACKCONTINUE >

Index terms contained in this section

arrays
     elements
            numbering positions of
attributes, file
bases
     counting
            directly from $DNA string
counting bases
      directly from $DNA string
DNA
     bases
            counting in strings
file test operators
files
      attributes, testing for
for loops
      initialization and increment of counter
incrementing variables
      for loop
length function
loops
      for and while
numbering positions in Perl strings and arrays
performance
      exploded strings and
strings
      length of
      operating on
substr function
while loops
      counter initialization and increment

© 2002, O'Reilly & Associates, Inc.