5.5
Exploding Strings into Arrays
Let's say you decide to explode the string of DNA into an
array. By explode I mean separating out each
letter in the string—sort of like blowing the string into bits.
In other words, the letters representing the bases of the DNA in the
string are separated, and each letter becomes its own scalar value in
an array. Then you can look at the array elements (each of which is a
single character) one by one, making the count as you go along. This
is the inverse of the join function in Section 5.3.2, which takes an array of strings
and makes a single scalar value out of them. (After exploding a
string into an array, you could then join the array back into an
identical string using join, if you so desire.)
I'm also adding to this version of the pseudocode the
instructions to get the DNA from a file and manipulate that file data
until it's a single string of DNA sequence. So first, you join
the data from the array of lines of the original file data, clean it
up by removing whitespace until only sequence is left, and then
explode it back into an array. But, of course, the point is that the
last array has exactly what is needed, the data in a convenient form
to use in the counting loop. Instead of an array of lines, with
newlines and possibly other unwanted characters, there's an
exact array of the individual bases.
read in the DNA from a file
join the lines of the file into a single string $DNA
# make an array out of the bases of $DNA
@DNA = explode $DNA
# initialize the counts
count_of_A = 0
count_of_C = 0
count_of_G = 0
count_of_T = 0
for each base in @DNA
if base is A
count_of_A = count_of_A + 1
if base is C
count_of_C = count_of_C + 1
if base is G
count_of_G = count_of_G + 1
if base is T
count_of_T = count_of_T + 1
done
print count_of_A, count_of_C, count_of_G, count_of_T
As promised, this version of the pseudocode is a bit more detailed.
It suggests a method to look at each of the bases by exploding the
string of DNA into an array of single characters. It also initializes
the counts to zero to ensure they start off right. It's easier
to see what's happening if you spell out the initialization in
the program, and it can prevent certain kinds of errors from creeping
into your code. (It's not a rule, however; sometimes, you may
prefer to leave the values of variables undefined until they are
used.) Perl assumes that an uninitialized variable has the value 0 if
you try to use it as a number, for instance by adding another number
to it. But you'll most likely get a warning if that is the
case.
We now have a design for the program, let's turn it into Perl
code. Example 5-4 is a workable program;
you'll see other ways to accomplish the same task more quickly
as you proceed in this chapter, but speed is not the main concern at
this point.
Example 5-4. Determining frequency of nucleotides
#!/usr/bin/perl -w
# Determining frequency of nucleotides
# Get the name of the file with the DNA sequence data
print "Please type the filename of the DNA sequence data: ";
$dna_filename = <STDIN>;
# Remove the newline from the DNA filename
chomp $dna_filename;
# open the file, or exit
unless ( open(DNAFILE, $dna_filename) ) {
print "Cannot open file \"$dna_filename\"\n\n";
exit;
}
# Read the DNA sequence data from the file, and store it
# into the array variable @DNA
@DNA = <DNAFILE>;
# Close the file
close DNAFILE;
# From the lines of the DNA file,
# put the DNA sequence data into a single string.
$DNA = join( '', @DNA);
# Remove whitespace
$DNA =~ s/\s//g;
# Now explode the DNA into an array where each letter of the
# original string is now an element in the array.
# This will make it easy to look at each position.
# Notice that we're reusing the variable @DNA for this purpose.
@DNA = split( '', $DNA );
# Initialize the counts.
# Notice that we can use scalar variables to hold numbers.
$count_of_A = 0;
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$errors = 0;
# In a loop, look at each base in turn, determine which of the
# four types of nucleotides it is, and increment the
# appropriate count.
foreach $base (@DNA) {
if ( $base eq 'A' ) {
++$count_of_A;
} elsif ( $base eq 'C' ) {
++$count_of_C;
} elsif ( $base eq 'G' ) {
++$count_of_G;
} elsif ( $base eq 'T' ) {
++$count_of_T;
} else {
print "!!!!!!!! Error - I don\'t recognize this base: $base\n";
++$errors;
}
}
# print the results
print "A = $count_of_A\n";
print "C = $count_of_C\n";
print "G = $count_of_G\n";
print "T = $count_of_T\n";
print "errors = $errors\n";
# exit the program
exit;
To demonstrate Example 5-4, I have created the
following small file of DNA and called it
small.dna:
AAAAAAAAAAAAAAGGGGGGGTTTTCCCCCCCC
CCCCCGTCGTAGTAAAGTATGCAGTAGCVG
CCCCCCCCCCGGGGGGGGAAAAAAAAAAAAAAATTTTTTAT
AAACG
The file small.dna can be typed into your
computer using your favorite text editor, or you can download it from
this book's web site.
Notice that there is a V in the file, an error.[4]
Here is
the output of Example 5-4:
Please type the filename of the DNA sequence data: small.dna
!!!!!!!! Error - I don't recognize this base: V
A = 40
C = 27
G = 24
T = 17
Now let's look at the new stuff in this program. Opening and
reading the sequence data is the same as previous programs. The first
new thing is at this line:
@DNA = split( '', $DNA);
which the comments say will explode the string
$DNA into an array of single characters
@DNA.
split is the
companion to join,
and it's a good idea to take a little while to look over the
documentation for these two commands. Calling
split with an empty string as the first argument
causes the string to explode into individual characters; that's
just what we want.[5]
Next, there are five scalar variables initialized to
0, the variables $count_of_A
and so forth.
I
nitializing
means assigning an initial value, in this case, the value
0.
Example 5-4 illustrates the concepts of
type and initialization.
The type of a variable determines what kind of data it can hold, for
instance, strings or numbers. Up to now we've been using scalar
variables such as $DNA to store strings of letters
such as A, C, G, and T. Example 5-4 shows that you
can also use scalar variables to store numbers. For example, the
variable $count_of_A keeps a running count of the
character A.
Scalar variables can store
integers (0, 1, -1, 2, -2, ...), decimal or
floating-point numbers such as 6.544, and numbers in scientific
notation such as 6.544E6, which translates as 6.544 x 106, or
6,544000. (See Appendix B for more details on types
of numbers.)
In Example 5-4, the variables
$count_of_A through $count_of_T
are initialized to 0. Initializing a variable
means giving it a value after it's declared. If you don't
initialize your variables, they assume the value of
'undef'. In Perl, an undefined variable is 0 if it
is asked for in numerical context; it's an empty string if used
in a string operation. Although Perl programmers often choose not to
initialize variables, it's a critical step in many other
languages. In C for instance, uninitialized variables have
unpredictable values. This can wreak havoc with your output. You
should get in the habit of initializing variables; it makes the
program easier to read and maintain, and that's important.
To
declare
a variable means to specify its name and other attributes such as an
initial value and a scope (for scoping, see Chapter 6 and the discussion of my
variables). Many languages require you to declare all variables
before using them. For this book, up to now, declarations have been
an unnecessary complication. The next chapter begins to require
declarations. In Perl, you may declare a variable's scope (see
Chapter 6 and the discussion of
my variables) in addition to an initial value.
Many languages also require you to declare the type of a variable,
for example "integer," or "string," but Perl
does not.
Perl is written to be smart about what's in a scalar variable.
For instance, you can assign the number 1234
(without quotes) to a variable, or you can assign the string
'1234' (with quotes). Perl treats the variable as
a string for printing, and as a number for using in arithmetic
operations, without your having to worry about it. Example 5-5 demonstrates this ability. In other words,
Perl isn't strict about specifying the type of data a variable
is used for.
Example 5-5. Demonstration of Perl's built-in knowledge about numbers and strings
#!/usr/bin/perl -w
# Demonstration of Perl's built-in knowledge about numbers and strings
$num = 1234;
$str = '1234';
# print the variables
print $num, " ", $str, "\n";
# add the variables as numbers
$num_or_str = $num + $str;
print $num_or_str, "\n";
# concatenate the variables as strings
$num_or_str = $num . $str;
print $num_or_str, "\n";
exit;
Example 5-5 produces the output:
1234 1234
2468
12341234
Example 5-5 illustrates the smart way Perl determines the datatype of
a scalar variable, whether it's a string or a number, and
whether you're trying to add or subtract it like a number or
concatenate it like a string. Perl behaves accordingly, which makes
your job as a programmer a little bit easier; Perl "does the
right thing" for you.
Next is a new kind of loop, the
foreach
loop. This loop works over the elements of an array. The line:
foreach $base (@DNA) {
loops over the elements of the array @DNA, and
each time through the loop, the scalar variable
$base (or whatever name you choose) is set to the
next element of the array.
The body of the loop checks for each base and
increments the count for that base if found.
There are four ways to add 1 to a
number
in Perl. Here, you put a
++
in front of the variable, like this:
++$count;
You can also put the ++ after the variable:
$count++;
You can spell it out like this, a combination of adding and
assignment:
$count = $count + 1;
or, as a shorthand of that, you can say:
$count += 1;
Almost an embarrassment of riches. The plus-plus
(++) notation is convenient for incrementing
counts, as we're doing here. The plus-equals
(+=) notation saves some typing and is
very popular for adding other numbers besides 1.
The foreach loop in Example 5-5
could have been written like this:
foreach (@DNA) {
if ( /A/ ) {
++$count_of_A;
} elsif ( /C/ ) {
++$count_of_C;
} elsif ( /G/ ) {
++$count_of_G;
} elsif ( /T/ ) {
++$count_of_T;
} else {
print "!!!!!!!! Error - I don\'t recognize this base: ";
print;
print "\n";
++$errors;
}
}
This version of the foreach loop:
foreach(@DNA) {.
doesn't have a scalar value. In a foreach
loop, if you don't specify a scalar variable to hold the
scalars that are being read from the array ($base
served that function in the version of this loop in Example 5-5), Perl uses the special variable
$_
.
Furthermore, many Perl built-in functions operate on
this special variable if no argument is provided to them. Here, the
conditional tests are simply patterns; Perl assumes you're
doing a pattern match on the $_ variable, so it
behaves as if you had said $_ =~ /A/, for
instance. Finally, in the error message, the statement
print; prints the value of the
$_ variable.
This special variable $_ that doesn't have
to be named appears in many Perl programs, although I don't use
it
extensively in this book.