Safari | Beginning Perl for Bioinformatics -> 3.5 The Programming Process

Beginning Perl for Bioinformatics > 3. The Art of Programming > 3.5 The Programming Process

3.5 The Programming Process

You've been assigned to write a program that counts the regulatory elements in DNA. If you've never programmed you probably have no idea of how to start. Let's talk about what you need to know to write the program.

Here's a summary of the steps we'll cover:

Identify the required inputs, such as data or information given by the user.
Make an overall design for the program, including the general method—the algorithm—by which the program computes the output.
Decide how the outputs will print; for example, to files or displayed graphically.
Refine the overall design by specifying more detail.
Write the Perl program code.

These steps may be different for shorter or longer programs, but this is the general approach you will take for most of your programming.

3.5.1 The Design Phase

First, you need to conceive a plan for how the program is going to work. This is the overall design of the program and an important step that's usually done before the actual writing of the program begins. Programs are often compared to kitchen recipes, in that they are specific instructions on how to accomplish some task. For instance, you need an idea of what inputs and outputs the program will have. In our example, the input would be the new DNA. You then need a strategy for how the program will do the necessary computing to calculate the desired output from the input.

In our example, the program first needs to collect information from the user: namely, where is the DNA? (This information can be the name of a file that contains the computer representation of the DNA sequence.) The program needs to allow the user to type in the name of a datafile, maybe from the computer screen or from a web page. Then the program has to check if the file exists (and complain if not, as might happen, for instance, if the user misspelled the name) and finally open the file and read in the DNA before continuing.

This simple step deserves some comment. You can put the DNA directly into the program code and avoid having to write this whole part of the program. But by designing the program to read in the DNA, it's more useful, because you won't have to rewrite the program every time you get some new DNA. It's a simple, even obvious idea, but very powerful.

The data your program uses to compute is called the input . Input can come from files, from other programs, from users running the program, from forms filled out on web sites, from email messages, and so forth. Most programs read in some form of input; some programs don't.

Let's add the list of regulatory elements to the actual program code. You can ask for a file that contains this list, as we did with the DNA, and have the program be capable of searching different lists of regulatory elements. However, in this case, the list you will use isn't going to change, so why bother the user with inputting the name of another file?

Now that we have the DNA and the list of regulatory elements you have to decide in general terms how the program is actually going to search for each regulatory element in the DNA. This step is obviously the critical one, so make sure you get it right. For instance, you want the program to run quickly enough, if the speed of the program is an important consideration.

This is the problem of choosing the correct algorithm for the job. An algorithm is a design for computing a problem (I'll say more about it in a minute). For instance, you may decide to take each regulatory element in turn and search through the DNA from beginning to end for that element before going on to the next one. Or perhaps you may decide to go through the DNA only once, and at each position check each of the regulatory elements to see if it is present. Is there be any advantage to one way or the other? Can you sort the list of regulatory elements so your search can proceed more quickly? For now, let's just say that your choice of algorithm is important.

The final part of the design is to provide some form of output for the results. Perhaps you want the results displayed on a web page, as a simple list on the computer screen, in a printable file, or perhaps all of the above. At this stage, you may need to ask the user for a filename to save the output.

This brings up the problem of how to display results. This question is actually a critically important one. The ideal solution is to display the results in a way that shows the user at a glance the salient features of the computation. You can use graphics, color, maps, little bouncing balls over the unexpected result: there are many options. A program that outputs results that are hard to read is clearly not doing a good job. In fact, output that makes the salient results hard to find or understand can completely negate all the effort you put into writing an elegant program. Enough said for now.

There are several strategies employed by programmers to help create good overall designs. Usually, any program but the smallest is written in several small but interconnecting parts. (We'll see lots of this as we proceed in later chapters.) What will the parts be, and how will they interconnect? The field of software engineering addresses these kinds of issues. At this point I only want to point out that they are very important and mention some of the ways programmers address the need for design.

There are many design methodologies; each have their dedicated adherents. The best approach is to learn what is available and use the best methodology for the job at hand. For instance, in this book I'm teaching a style of programming called imperative programming , relying on dividing a problem into interacting procedures or subroutines (see Chapter 6), known as structured design. Another popular style is called object-oriented programming, which is also supported by Perl.

If you're working in a large group of programmers on a big project, the design phase can be very formal and may even be done by different people than the programmers themselves. On the other end of the scale, you will find solitary programmers who just start writing, developing a plan as they write the code. There is no one best way that works for everyone. But no matter how you approach it, as a beginner you still need to have some sort of design in mind before you start writing code.

3.5.2 Algorithms

An algorithm is the design, or plan, for the computation done by a computer program. (It's actually a tricky term to define, outside of a formal mathematical system, but this is a reasonable definition.) An algorithm is implemented by coding it in a specific computer language, but the algorithm is the idea of the computation. It's often well represented in pseudocode, which gives the idea of a program without actually being a real computer program.

Most programs do simple things. They get filenames from users, open the files, and read in the data. They perform simple calculations and display the results. These are the types of algorithms you'll learn here.

However, the science of algorithms is a deep and fruitful one, with many important implications for bioinformatics. Algorithms can be designed to find new ways of analyzing biological data and of discovering new scientific results. There are certainly many problems in biology whose solutions could be, and will be, substantially advanced by inventing new algorithms.

The science of algorithms includes many clever techniques. As a beginning programmer, you needn't worry about them just yet. At this stage, an introductory chapter in a beginning tutorial on programming, it's not reasonable to go into details about algorithmic methods. Your first task is just to learn how to write in some programming language. But if you keep at it, you'll start to learn the techniques. A decent textbook to keep around as a reference is a good investment for a serious programmer (see Appendix A).

In the current example that counts regulatory elements in DNA, I suggest a way of proceeding. Take each regulatory element in turn, and search through the DNA for it, before proceeding to the next regulatory element. Other algorithms are also possible; in fact, this is one example from the general problem called string matching , which is one of the most important for bioinformatics, and the study of which has resulted in a variety of clever algorithms.

Algorithms are usually grouped by such problems or by technique, and there is a wealth of material available. For the practical programmer, some of the most valuable materials are collections of algorithms written in specific languages, that can be incorporated into your programs. Use Appendix A as a starting place. Using the collections of code and books given there, it's possible to incorporate many algorithmic techniques in your Perl code with relative ease.

3.5.3 Pseudocode and Code

Now you have an overall design, including input, algorithm, and output. How do you actually turn this general idea into a design for a program?

A common implementation strategy is to begin by writing what is called pseudo-code. Pseudocode is an informal program, in which there are no details, and formal syntax isn't followed.^[2] It doesn't actually run as a program; its purpose is to flesh out an idea of the overall design of a program in a quick and informal way.

^[2] Syntax refers to the rules of grammar. English syntax decrees, "Go to school" not "School go to." Programming languages also have syntax rules.

For example, in an actual Perl program you might write a bit of code called a subroutine (see Chapter 6), in this case, a subroutine that gets an answer from a user typing at the keyboard. Such a subroutine may look like this:

sub getanswer {
    print "Type in your answer here :";
    my $answer  = <STDIN>;
    chomp $answer;
    return $answer;
}

But in pseudocode, you might just say:

getanswer

and worry about the details later.

Here's an example of pseudocode for the program I've been discussing:

get the name of DNAfile from the user

read in the DNA from the DNAfile

for each regulatory element
    if element is in DNA, then
        add one to the count

print count

3.5.4 Comments

Comments are parts of Perl source code that are used as an aid to understanding what the program does. Anything from a # sign to the end of a line is considered a comment and is ignored by the Perl interpreter. (The exception is the first line of many Perl programs, which looks something like this: #!/usr/bin/perl; see Section 4.2.3 in Chapter 4.)

Comments are of considerable importance in keeping code useful. They typically include a discussion of the overall purpose and design of the program, examples of how to use the program, and detailed notes interspersed throughout the code explaining why that code is there and what it does. In general, a good programmer writes good comments as an integral part of the program. You'll see comments in all the programming examples in this book.

This is important: your code has to be readable by humans as well as computers.

Comments can also be useful when debugging misbehaving programs. If you're having trouble figuring out where a program is going wrong, you can try to selectively comment out different parts of the code. If you find a section that, when commented out, removes the problem, you can then narrow down the part you've commented out until you have a fairly short section of code in which you know where the problem is. This is often a useful debugging approach.

Comments can be used when you turn pseudocode into Perl source code. Pseudocode is not Perl code, so the Perl interpreter will complain about any pseudocode that is not commented out. You can comment out the pseudocode by placing # signs at the beginning of all pseudocode lines:

#get the name of DNAfile from the user

#read in the DNA from the DNAfile

#for each regulatory element
#    if element is in DNA, then
#          add one to the count

#print count

As you expand your pseudocode design into Perl code, you can uncomment the Perl code by removing the # signs. In this way you may have a mixture of Perl and pseudocode, but you can run and test the Perl parts; the Perl interpreter simply ignores commented-out lines.

You can even leave the complete pseudocode design, commented out, intact in the program. This leaves an outline of the program's design that may come in handy when you or someone else tries to read or modify the code.

We've now reached the point where we're ready for actual Perl programming. In Chapter 4 you will learn Perl syntax and begin programming in Perl. As you do, remember the initial phase of designing your program, followed by the cycle you will spend most of your time in: editing the program, running the program, and revising the program.

< BACK

CONTINUE >

Index terms contained in this section

# (sharp)
      in Perl comments
algorithms
code
      subroutines
comments
      importance of
computation, algorithms for
debugging
      comments, use in
designing programs
      input and output
DNA
      regulatory elements in, program to count
input
object-oriented programming
output
      program results, designing for
pattern matching
patterns (and regular expressions)
      matching
Perl
      comments
procedural programming
programming
      algorithms
      process of
            design phase
            pseudocode and code
pseudocode
      commenting out
      DNA regulatory elements, counting
      getanswer (example)
regulatory elements in DNA, program that counts
      design phase
source code
      comments
strings
      matching
structured design
subroutines
      getanswer (example)