5.6
Operating on Strings
It's not necessary to explode a string into an array in order to look at
each character. In fact, sometimes you'd want to
avoid that. A large string takes up a
large amount of memory in your computer. So does a large array. When
you explode a string into an array, the original string is still
there, and you also have to make a copy of each character for the
elements of the new array you're creating. If you have a large
string, that already uses a good portion of available memory,
creating an additional array can cause you to run out of memory. When
you run out of memory, your computer performs poorly; it can slow to
a crawl, crash, or freeze ("hang"). These haven't
been worrisome considerations up to now, but if you use large data
sets (such as the human genome), you have to take these things into
account.
So let's say you'd like to avoid making a copy of the DNA
sequence data into another variable. Is there a way to just look at
the string $DNA and count the bases from it directly?
Yes. Here's some pseudocode, followed by a Perl program:
read in the DNA from a file
join the lines of the file into a single string of $DNA
# initialize the counts
count_of_A = 0
count_of_C = 0
count_of_G = 0
count_of_T = 0
for each base at each position in $DNA
if base is A
count_of_A = count_of_A + 1
if base is C
count_of_C = count_of_C + 1
if base is G
count_of_G = count_of_G + 1
if base is T
count_of_T = count_of_T + 1
done
print count_of_A, count_of_C, count_of_G, count_of_T
Example 5-6 shows a program that examines each base
in a string of DNA.
Example 5-6. Determining frequency of nucleotides, take 2
#!/usr/bin/perl -w
# Determining frequency of nucleotides, take 2
# Get the DNA sequence data
print "Please type the filename of the DNA sequence data: ";
$dna_filename = <STDIN>;
chomp $dna_filename;
# Does the file exist?
unless ( -e $dna_filename) {
print "File \"$dna_filename\" doesn\'t seem to exist!!\n";
exit;
}
# Can we open the file?
unless ( open(DNAFILE, $dna_filename) ) {
print "Cannot open file \"$dna_filename\"\n\n";
exit;
}
@DNA = <DNAFILE>;
close DNAFILE;
$DNA = join( '', @DNA);
# Remove whitespace
$DNA =~ s/\s//g;
# Initialize the counts.
# Notice that we can use scalar variables to hold numbers.
$count_of_A = 0;
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$errors = 0;
# In a loop, look at each base in turn, determine which of the
# four types of nucleotides it is, and increment the
# appropriate count.
for ( $position = 0 ; $position < length $DNA ; ++$position ) {
$base = substr($DNA, $position, 1);
if ( $base eq 'A' ) {
++$count_of_A;
} elsif ( $base eq 'C' ) {
++$count_of_C;
} elsif ( $base eq 'G' ) {
++$count_of_G;
} elsif ( $base eq 'T' ) {
++$count_of_T;
} else {
print "!!!!!!!! Error - I don\'t recognize this base: $base\n";
++$errors;
}
}
# print the results
print "A = $count_of_A\n";
print "C = $count_of_C\n";
print "G = $count_of_G\n";
print "T = $count_of_T\n";
print "errors = $errors\n";
# exit the program
exit;
Here's the output of Example 5-6:
Please type the filename of the DNA sequence data: small.dna
!!!!!!!! Error - I don't recognize this vase: V
A = 40
C = 27
G = 24
T = 17
errors = 1
In Example 5-6, I added a line of code to see if the
file exists:
unless ( -e $dna_filename) {
There are file test operators for several
conditions; see Appendix B or Perl documentation
under -X. Note that files have several
attributes, such as size, permission,
location in the filesystem, and type of file, and that many of these
things can be tested for easily with the file test operators.
Notice, also, that I have kept the detailed comments about the
regular expression, because regular expressions can be hard to read,
and a little commenting here helps a reader to skim the code.
Everything else is familiar,
until
you hit the for loop; it requires a little
explanation:
for ( $position = 0 ; $position < length $DNA ; ++$position ) {
# the statements in the block
}
This for loop is the equivalent of this
while loop:
$position = 0;
while( $position < length $DNA ) {
# the same statements in the block, plus ...
++$position;
}
Take a moment and compare these two loops. You'll see the same
statements but in different locations.
As you can see, the for loop brings the
initialization and increment of a counter
($position) into the loop statement, whereas in
the while loop, they are separate statements. In
the for loop, both the initialization and the
increment statement are placed between parentheses, whereas you find
only the conditional test in the while loop. In
the for loop, you can put initializations before
the first semicolon and increment statements after the second
semicolon. The initialization statement is done just once before
starting the loop, and the increment statement is done at the end of
each iteration through the block before going back to the conditional
test. It's really just a shorthand for the equivalent
while loop as just shown.
The conditional test checks to see if the position reached in the
string is less than the length of the string. It uses the
length
Perl function. Obviously, you don't
want to check characters beyond the length of the string. But a word
is in order here about the numbering of positions
in strings and
arrays.
By default, Perl assumes that a string begins at position
0 and its last character is at a position
that's numbered one less than the length of the string. Why do
it this way instead of numbering the positions from 1 up to and
including the length of the string? There are reasons, but
they're somewhat abstruse; see the documentation for
enlightenment. If it's any comfort, many other programming
languages make the same choice. (However, many do it the intuitive
way, starting at 1. Ah well.)
This way of numbering is important to biologists because they are
used to numbering sequences beginning with 1, not with
0 the way Perl does it. You sometimes have to add
1 to a position before printing out results so they'll make
sense to nonprogrammers. It's mildly annoying, but you'll
get used to it.
The same holds true for numbering the elements of an array. The first
element of an array is element 0; the last is
element $length-1.
Anyway, you see that the conditional test evaluates to
true while the value of
$position is length-1 or less
and fails when $position reaches the same value as
the length of the string. For example, say you have a string that
contains the text "seeing." This has a length of six
characters. The "s" is at position 0, and the
"g" is at position 5, which is one less than the string
length 6.
Back in the block, you call the
substr function to look into the string:
$base = substr($DNA, $position, 1);
This is a fairly general-purpose function for working with strings;
you can also insert and delete things. Here, you look at just one
character, so you call substr on the string
$DNA, ask it to look in position
$position for one character, and save the result
in scalar variable $base. Then you proceed to
accumulate the count as in the preceding version
of
the program, Example 5-4.