5.7
Writing to Files
Example 5-7 shows one more way to
count nucleotides in a string
of DNA. It uses a Perl trick that was designed with exactly this kind
of job in mind. It puts a global regular expression search in the
test for a while
loop, and as you'll see,
it's a compact way of counting characters in a string.
One of the nice things about Perl is that if you need to do something
fairly regularly, the language has probably got a relatively succinct
way to do it. (The downside of this is that Perl has a lot of things
about it to learn.)
The results of Example 5-7, besides being printed to
the screen, will also be written to a file. The code that
accomplishes this writing to a file is as follows:
# Also write the results to a file called "countbase"
$outputfile = "countbase";
(
unless ( open(COUNTBASE, ">$outputfile") ) {
print "Cannot open file \"$outputfile\" to write to!!\n\n";
exit;
}
print COUNTBASE "A=$a C=$c G=$g T=$t errors=$e\n";
close(COUNTBASE);
As you see, to write to a file, you do an
open call, just as when reading from a
file, but with a difference: you prepend a greater-than sign > to
the filename. The filehandle becomes a first argument to a
print statement (but
without a comma following it). This makes the
print statement direct its output into the
file.[6]
Example 5-7 is the third version of the Perl program
that examines each base in a string of DNA.
Example 5-7. Determining frequency of nucleotides, take 3
#!/usr/bin/perl -w
# Determining frequency of nucleotides, take 3
# Get the DNA sequence data
print "Please type the filename of the DNA sequence data: ";
$dna_filename = <STDIN>;
chomp $dna_filename;
# Does the file exist?
unless ( -e $dna_filename) {
print "File \"$dna_filename\" doesn\'t seem to exist!!\n";
exit;
}
# Can we open the file?
unless ( open(DNAFILE, $dna_filename) ) {
print "Cannot open file \"$dna_filename\"\n\n";
exit;
}
@DNA = <DNAFILE>;
close DNAFILE;
$DNA = join( '', @DNA);
# Remove whitespace
$DNA =~ s/\s//g;
# Initialize the counts.
# Notice that we can use scalar variables to hold numbers.
$a = 0; $c = 0; $g = 0; $t = 0; $e = 0;
# Use a regular expression "trick", and five while loops,
# to find the counts of the four bases plus errors
while($DNA =~ /a/ig){$a++}
while($DNA =~ /c/ig){$c++}
while($DNA =~ /g/ig){$g++}
while($DNA =~ /t/ig){$t++}
while($DNA =~ /[^acgt]/ig){$e++}
print "A=$a C=$c G=$g T=$t errors=$e\n";
# Also write the results to a file called "countbase"
$outputfile = "countbase";
unless ( open(COUNTBASE, ">$outputfile") ) {
print "Cannot open file \"$outputfile\" to write to!!\n\n";
exit;
}
print COUNTBASE "A=$a C=$c G=$g T=$t errors=$e\n";
close(COUNTBASE);
# exit the program
exit;
Example 5-7 looks like this when you run it:
Please type the filename of the DNA sequence data: small.dna
A=40 C=27 G=24 T=17 errors=1
The output file countbase has the following
contents after you run Example 5-7:
A=40 C=27 G=24 T=17 errors=1
The while loop:
while($dna =~ /a/ig){$a++}
has as its conditional test, within the parentheses, a
string-matching expression:
$dna =~ /a/ig
This expression is looking for the regular expression
/a/, that is, the letter a.
Since it has the i modifier, it's a
case-insensitive match, which means it matches a
or A. It also has the global modifier, which means
match all the a's in the string. (Without
the global modifier, it just keeps returning true
every time through the loop, if there is an "a" in
$dna.)
Now, this string-matching expression, in the context of a
while loop, causes the
while loop to execute its block on every match
of the regular expression. So, append the one-statement block:
{$a++}
to increment the counter at each match of the regular expression; in
other words, you're counting all the
a's.
One other point should be made about this third version of the
program. You'll notice some of the statements have been changed
and shortened this time around. Some variables have shorter names,
some statements are lined up on one line, and the
print statement at the end is more concise.
These are just alternative ways of writing. As you program,
you'll find yourself experimenting with different
approaches: try some on for size.
The way to count bases in this third version is
flexible; for instance, it allows you to count non-ACGT characters
without specifying them individually. In later chapters, you'll
use those while loops to good effect. However,
there's an even faster way to count bases. You
can
use the tr transliteration
function from Chapter 4; it's faster, which
is helpful if you have a lot of DNA to count:
$a = ($dna =~ tr/Aa//);
$c = ($dna =~ tr/Cc//);
$g = ($dna =~ tr/Gg//);
$t = ($dna =~ tr/Tt//);
The tr function returns the count of the
specified characters it finds in the string, and if the set of
replacement characters is empty, it doesn't actually change the
string. So it makes a good character counter. Notice that with
tr, you have to spell out the upper- and
lowercase letters. Also, because tr
doesn't accept character classes, there's no direct way
to count nonbases. You could, however, say:
$basecount = ($dna = ~ tr/ACGTacgt//);
$ nonbase = (length $dna) - $basecount)
The program however, runs faster using tr than
using the while loops of Example 5-7.
You may find it a bit much to have three (really, four) versions of
this base-counting program, especially since much of the code in each
version is identical. The only part of the program that really
changed was the part that did the counting of the bases.
Wouldn't it have been convenient to have a way to just alter
the part that counts the bases? In Chapter 6,
you'll see how subroutines allow you to partition your programs
in just such a way.