3.5
The Programming Process
You've been assigned to write a program that counts the
regulatory elements in DNA. If you've
never programmed you probably have no idea of how to start.
Let's talk about what you need to know to write the program.
Here's a summary of the steps we'll cover:
-
Identify the required inputs, such as data or information given by
the user.
-
Make an overall design for the program, including the general
method—the algorithm—by which the program computes the
output.
-
Decide how the outputs will print; for example, to files or displayed
graphically.
-
Refine the overall design by specifying more detail.
-
Write the Perl program code.
These steps may be different for shorter or longer programs, but this
is the general approach you will take for most of your programming.
3.5.1
The Design Phase
First, you need to
conceive a
plan for how the program is going to work. This is the overall design
of the program and an important step that's usually done before
the actual writing of the program begins. Programs are often compared
to kitchen recipes, in that they are specific instructions on how to
accomplish some task. For instance, you need an idea of what inputs
and outputs the program will have. In our example, the input would be
the new DNA. You then need a strategy for how the program will do the
necessary computing to calculate the desired output from the input.
In our example, the program first needs to collect information from
the user: namely, where is the DNA? (This information can be the name
of a file that contains the computer representation of the DNA
sequence.) The program needs to allow the user to type in the name of
a datafile, maybe from the computer screen or from a web page. Then
the program has to check if the file exists (and complain if not, as
might happen, for instance, if the user misspelled the name) and
finally open the file and read in the DNA before continuing.
This simple step deserves some comment. You can put the DNA directly
into the program code and avoid having to write this whole part of
the program. But by designing the program to read in the DNA,
it's more useful, because you won't have to rewrite the
program every time you get some new DNA. It's a simple, even
obvious idea, but very powerful.
The data your program uses to compute is called the
input
. Input can come from files, from other
programs, from users running the program, from forms filled out on
web sites, from email messages, and so forth. Most programs read in
some form of input; some programs don't.
Let's add the list of regulatory elements to the actual program
code. You can ask for a file that contains this list, as we did with
the DNA, and have the program be capable of searching different lists
of regulatory elements. However, in this case, the list you will use
isn't going to change, so why bother the user with inputting
the name of another file?
Now that we have the DNA and the list of regulatory elements you have
to decide in general terms how the program is actually going to
search for each regulatory element in the DNA. This step is obviously
the critical one, so make sure you get it right. For instance, you
want the program to run quickly enough, if the speed of the program
is an important consideration.
This is the problem of choosing the correct algorithm for the job. An
algorithm is a design for computing a problem (I'll say more
about it in a minute). For instance, you may decide to take each
regulatory element in turn and search through the DNA from beginning
to end for that element before going on to the next one. Or perhaps
you may decide to go through the DNA only once, and at each position
check each of the regulatory elements to see if it is present. Is
there be any advantage to one way or the other? Can you sort the list
of regulatory elements so your search can proceed more quickly? For
now, let's just say that your choice of algorithm is important.
The final part of the design is to provide some form of
output for the results. Perhaps
you want the results displayed on a web page, as a simple list on the
computer screen, in a printable file, or perhaps all of the above. At
this stage, you may need to ask the user for a filename to save the
output.
This brings up the problem of how to display results. This question
is actually a critically important one. The ideal solution is to
display the results in a way that shows the user at a glance the
salient features of the computation. You can use graphics, color,
maps, little bouncing balls over the unexpected result: there are
many options. A program that outputs results that are hard to read is
clearly not doing a good job. In fact, output that makes the salient
results hard to find or understand can completely negate all the
effort you put into writing an elegant program. Enough said for now.
There are several strategies employed by programmers to help create
good overall designs. Usually, any program but the smallest is
written in several small but interconnecting parts. (We'll see
lots of this as we proceed in later chapters.) What will the parts
be, and how will they interconnect? The field of software engineering
addresses these kinds of issues. At this point I only want to point
out that they are very important and mention some of the ways
programmers address the need for design.
There are many design methodologies; each have their dedicated
adherents. The best approach is to learn what is available and use
the best methodology for the job at hand. For instance, in this book
I'm teaching a style of programming called imperative
programming
, relying
on dividing a problem into interacting
procedures or subroutines
(see Chapter 6), known as structured
design. Another popular style is called
object-oriented
programming, which is also supported by Perl.
If you're working in a large group of programmers on a big
project, the design phase can be very formal and may even be done by
different people than the programmers themselves. On the other end of
the scale, you will find solitary programmers who just start writing,
developing a plan as they write the code. There is no one best way
that works for everyone. But no matter how you approach it, as a
beginner you still need to have some sort of design in mind before
you start writing code.
3.5.2
Algorithms
An
algorithm
is the design, or plan, for the computation done by a computer
program. (It's actually a tricky term to define, outside of a
formal mathematical system, but this is a reasonable definition.) An
algorithm is implemented by coding it in a specific computer
language, but the algorithm is the idea of the computation.
It's often well represented in
pseudocode, which
gives the idea of a program without actually being a real computer
program.
Most programs do simple things. They get filenames from users, open
the files, and read in the data. They perform simple calculations and
display the results. These are the types of algorithms you'll
learn here.
However, the science of algorithms is a deep and fruitful one, with
many important implications for bioinformatics. Algorithms can be
designed to find new ways of analyzing biological data and of
discovering new scientific results. There are certainly many problems
in biology whose solutions could be, and will be, substantially
advanced by inventing new algorithms.
The science of algorithms includes many clever techniques. As a
beginning programmer, you needn't worry about them just yet. At
this stage, an introductory chapter in a beginning tutorial on
programming, it's not reasonable to go into details about
algorithmic methods. Your first task is just to learn how to write in
some programming language. But if you keep at it, you'll start
to learn the techniques. A decent textbook to keep around as a
reference is a good investment for a serious programmer (see Appendix A).
In the current example that counts regulatory elements in DNA, I
suggest a way of proceeding. Take each regulatory element in turn,
and search through the DNA for it, before proceeding to the next
regulatory element. Other algorithms are also possible; in fact, this
is one example from the general problem called string
matching
,
which is one of the most important for bioinformatics, and the study
of which has resulted in a variety of clever algorithms.
Algorithms are usually grouped by such problems or by technique, and
there is a wealth of material available. For the practical
programmer, some of the most valuable materials are collections of
algorithms written in specific languages, that can be incorporated
into your programs. Use Appendix A as a starting
place. Using the collections of code and books given there,
it's possible to incorporate many algorithmic techniques in
your Perl code with relative ease.
3.5.3
Pseudocode and Code
Now you
have an overall design, including input, algorithm, and output. How
do you actually turn this general idea into a design for a program?
A common implementation strategy is to begin by writing what is
called pseudo-code. Pseudocode is an informal
program, in which there are no details, and formal syntax isn't
followed.[2]
It doesn't actually run as a
program; its purpose is to flesh out an idea of the overall design of
a program in a quick and informal way.
For example, in an actual Perl program you might write a bit of
code
called a subroutine (see Chapter 6), in this case,
a subroutine that gets an answer from a user typing at the keyboard.
Such a subroutine may look like this:
sub getanswer {
print "Type in your answer here :";
my $answer = <STDIN>;
chomp $answer;
return $answer;
}
But in pseudocode, you might just say:
getanswer
and worry about the details later.
Here's an example of
pseudocode for the program
I've been discussing:
get the name of DNAfile from the user
read in the DNA from the DNAfile
for each regulatory element
if element is in DNA, then
add one to the count
print count
3.5.4
Comments
Comments are parts of Perl
source code that
are used as an aid to understanding what the program does. Anything
from a # sign to the
end of a line is considered a comment and is ignored by the Perl
interpreter. (The exception is the first line of many Perl programs,
which looks something like this: #!/usr/bin/perl;
see Section 4.2.3 in Chapter 4.)
Comments are of considerable importance in keeping
code useful. They typically include a discussion of the overall
purpose and design of the program, examples of how to use the
program, and detailed notes interspersed throughout the code
explaining why that code is there and what it does. In general, a
good programmer writes good comments as an integral part of the
program. You'll see comments in all the programming examples in
this book.
This is important: your code has to be readable by humans as well as
computers.
Comments can also be useful when
debugging misbehaving programs. If
you're having trouble figuring out where a program is going
wrong, you can try to selectively comment out different parts of the
code. If you find a section that, when commented out, removes the
problem, you can then narrow down the part you've commented out
until you have a fairly short section of code in which you know where
the problem is. This is often a useful debugging approach.
Comments can be used when you turn
pseudocode into Perl source code.
Pseudocode is not Perl code, so the Perl interpreter will complain
about any pseudocode that is not commented out. You can
comment out the pseudocode by placing
# signs at the beginning of all pseudocode lines:
#get the name of DNAfile from the user
#read in the DNA from the DNAfile
#for each regulatory element
# if element is in DNA, then
# add one to the count
#print count
As you expand your pseudocode design into Perl code, you can
uncomment the Perl code by removing the # signs.
In this way you may have a mixture of Perl and pseudocode, but you
can run and test the Perl parts; the Perl interpreter simply ignores
commented-out lines.
You can even leave the complete pseudocode design, commented out,
intact in the program. This leaves an outline of the program's
design that may come in handy when you or someone else tries to read
or modify the code.
We've now reached the point where we're ready for actual
Perl programming. In Chapter 4 you will learn Perl
syntax and begin programming in Perl. As you do, remember the initial
phase of designing your program, followed by the cycle you will spend
most of your time in: editing the program, running the
program, and revising the program.