4.2
A Program to Store a DNA Sequence
Let's write a small program that stores some
DNA in a variable and prints it to the
screen. The DNA is written in the usual fashion, as a string made of
the letters A, C, G, and T, and we'll call the
variable $DNA. In
other words, $DNA is the name of the DNA sequence
data used in the program. Note that in Perl, a variable is really the
name for some data you wish to use. The name gives you full access to
the data. Example 4-1 shows the entire program.
Example 4-1. Putting DNA into the computer
#!/usr/bin/perl -w
# Storing DNA in a variable, and printing it out
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit.
exit;
Using what you've already learned about text editors and
running Perl programs in Chapter 2, enter the code
(or copy it from the book's web site) and save it to a file.
Remember to save the program as ASCII or text-only format, or Perl
may have trouble reading the resulting file.
The second step is to run the program. The details of how to run a
program depend on the type of computer you have (see Chapter 2). Let's say the program is on your
computer in a file called example4-1. As you
recall from Chapter 2, if you are running this
program on Unix or Linux, you type the following in a shell window:
perl example4-1
On a Mac, open the file with the MacPerl application and save it as a
droplet,
then just double-click on the droplet. On Windows, type the following
in an MS-DOS command window:
perl example4 -1
If you've successfully run the program, you'll see the
output printed on your computer screen.
4.2.1
Control Flow
Example 4-1 illustrates many of the ideas all our
Perl programs will rely on. One of these ideas is control
flow
, or the order in which the statements in
the program are executed by the computer.
Every program starts at the first line and executes the statements
one after the other until it reaches the end, unless it is explicitly
told to do otherwise. Example 4-1 simply proceeds
from top to bottom, with no detours.
In later chapters, you'll learn how programs can control the
flow of execution.
4.2.2
Comments Revisited
Now let's take a look at the parts of Example 4-1. You'll notice lots of blank lines.
They're there to make the program easy for a human to read.
Next, notice the
comments that begin with the #
sign. Remember from Chapter 3 that when Perl runs,
it throws these away along with the blank lines. In fact, to Perl,
the following is exactly the same program as Example 4-1:
#!/usr/bin/perl -w
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print $DNA; exit;
In Example 4-1, I've made liberal use of
comments. Comments at the beginning of code can make it clear what
the program is for, who wrote it, and present other information that
can be helpful when someone needs to understand the code. Comments
also explain what each section of the code is for and sometimes give
explanations on how the code achieves its goals.
It's tempting to belabor the point about the importance of
comments. Suffice it to say that in most university-level,
computer-science class assignments, the program without comments
typically gets a low or failing grade; also, the programmer on the
job who doesn't comment code is liable to have a short and
unsuccessful career.
4.2.3
Command Interpretation
Because it starts with a # sign, the first line of the program looks
like a comment, but it doesn't seem like a very informative
comment:
#!/usr/bin/perl -w
This is a special line called command interpretation that tells the
computer running Unix and Linux that this is a Perl program. It may
look slightly different on different computers. On some machines,
it's also unnecessary because the computer recognizes Perl from
other information. A Windows machine is usually configured to assume
that any program ending in .pl is a Perl
program. In Unix or Linux, a Windows command window, or a MacOS X
shell, you can type perl my_program, and your Perl
program my_program won't need the special
line. However, it's commonly used, so we'll have it at
start all our programs.
Notice that the first line of code uses a flag -w.
The "w" stands for warnings, and it causes Perl to print
messages in case of an error. Very often the
error message suggests the line number
where it thinks the error began. Sometimes the line number is wrong,
but the error is usually on or just before the line the message
suggests. Later in the book, you'll also see the statement
use warnings as an alternative
to -w.
4.2.4
Statements
The next line of Example 4-1 stores the
DNA
in a variable:
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
This is a very common, very important thing to do in a computer
language, so let's take a leisurely look at it. You'll
see some basic features about Perl and about programming languages in
general, so this is a good place to stop skimming and actually read.
This line of code is called a statement. In
Perl, statements end in a semicolon (;). The use of the
semicolon is similar to the use of the period in the English
language.
To be more accurate, this line of code is an
assignment
statement. Its purpose in this program is to store some DNA into a
variable called $DNA. There are several
fundamental things happening here as you will see in the next
sections.
4.2.4.1
Variables
First, let's look at the
variable
$DNA. Its name is somewhat arbitrary. You can pick
another name for it, and the program behaves the same way. For
instance, if you replace the two lines:
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print $DNA;
with these:
$A_poem_by_Seamus_Heaney = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print $A_poem_by_Seamus_Heaney;
the program behaves in exactly the same way, printing out the DNA to
the computer screen. The point is that the names of variables in a
computer program are your choice. (Within certain restrictions: in
Perl, a variable name must be composed from upper- or lowercase
letters, digits, and the underscore _ character. Also the first
character must not be a digit.)
This is another important point along the same lines as the remarks
I've already made about using blank lines and comments to make
your code more easily read by humans. The computer attaches no
meaning to the use of the variable name $DNA
instead of $A_poem_by_Seamus_Heaney, but whoever
reads the program certainly will. One name makes perfect sense,
clearly indicates what the variable is for in the program, and eases
the chore of understanding the program. The other name makes it
unclear what the program is doing or what the variable is for. Using
well-chosen variable names is part of what's called
self-documenting
code. You'll still need comments, but perhaps not as many, if
you pick your variable names well.
You've noticed that the variable name $DNA
starts with dollar sign. In Perl this kind of variable is called a
scalar variable, which is a variable that
holds a single item of data. Scalar variables are used for such data
as strings or various kinds of numbers (e.g., the string
hello or numbers such as 25, 6.234, 3.5E10,
-0.8373). A scalar variable holds just one item of data at a time.
4.2.4.2
Strings
In Example 4-1, the scalar variable
$DNA is holding
some DNA, represented in the usual way
by the letters A, C, G, and T. As stated earlier, in computer science
a sequence of letters is called a string. In Perl you designate a
string by putting it in quotes. You can use single quotes, as in
Example 4-1, or double quotes. (You'll learn
the difference later.) The DNA is thus represented by:
'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
4.2.4.3
Assignment
In Perl, to set a variable to a certain value, you use the
=
sign. The = sign is called the
assignment
operator
.
In Example 4-1, the value:
'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
is assigned to the variable $DNA. After the
assignment, you can use the name of the variable to get the value, as
in the print statement in Example 4-1.
The order of the parts is important in an
assignment statement. The value
assigned to something appears to the right of the assignment
operator. The variable that is assigned a value is always to the left
of the assignment operator. In programming manuals, you sometimes
come across the terms
lvalue and rvalue to
refer to the left and right sides of the assignment operator.
This use of the = sign has a long
history in programming languages. However, it can be a source of
confusion: for instance, in most mathematics, using
= means that the two things on either side of
the sign are equal. So it's important to note that in Perl, the
= sign doesn't mean equality. It assigns a
value to a variable. (Later, we'll see how to represent
equality.)
So, to summarize what we've learned so far about this statement:
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
It's an assignment statement that sets the value of the
scalar
variable $DNA to a string representing some DNA.
4.2.4.4
Print
The statement:
print $DNA;
prints ACGGGAGGACGGGAAAATTACTACGGCATTAGC out to
the computer screen. Notice that the
print statement deals with
scalar variables by printing out their
values—in this case, the string that the variable
$DNA contains. You'll see more about
printing later.
4.2.4.5
Exit
Finally, the statement
exit; tells the computer to exit the
program.
Perl doesn't require an exit statement at
the end of a program; once you get to the end, the program exits
automatically. But it doesn't hurt to put one in, and it
clearly indicates the program is over. You'll see other
programs that exit if something goes wrong before the program
normally finishes, so the exit statement is
definitely useful.