3.4
Programming Strategies
In order to
give you, the beginning programmer, an
idea of how programming is done, let's see how an experienced
programmer goes about solving problems by giving a couple of
instructive case studies.
Imagine that you want to count all the regulatory elements[1]
in a large
chunk of DNA that you just got from the sequencing lab. You're
a professional bioinformatics programmer. What do you do? There are
two possible solutions: find a program or write one yourself.
It's likely there is already a perfectly good, working, and
maybe even free program that does exactly what you need. Very often,
you can find exactly what you need on the Web and avoid the cost and
expense of reinventing the wheel. This is programming at its
best—minimal work for maximal effect. It's the classic
case of the experimentalist's adage: a day in the library can
save you six months in the lab.
An important part of the art of programming is to keep aware of
collections of programs that are available. Then you can simply use
the code if it does exactly what you need, or you can take an
existing program and alter it to suit your own needs. Of course,
copyright laws must be observed, but much is available at no cost,
especially to educational and nonprofit organizations. Most Perl
module code has a copyright, but you are allowed to use it and modify
it given certain restrictions. Details are available at the Perl web
site and with the particular modules.
How do you find this wonderful, free, and already existing program?
The Perl community has an organized collection of such programming
code at the Comprehensive Perl Archive Network (CPAN) web site,
http://www.CPAN.org. Try
exploring: you'll find it's organized by topic, so
it's possible to quickly find, for example, web, statistics, or
graphics programs. In our case, you will find the Bioperl module,
which includes several useful bioinformatics functions. A
module is a collection of Perl code that can
be easily loaded and used by your Perl programs.
The most useful kinds of code are convenient libraries or modules
that package a suite of functions. These packages offer a great deal
of flexibility in creating new programs. Although you still have to
program, the job may be only a small fraction of the work of writing
the whole program from scratch. For instance, to continue our example
of looking for regulatory elements, your search may turn up a
convenient module that lists the regulatory elements plus code that
takes a list of elements and searches for them in a DNA library. Then
all you have to do is combine the existing code, provide the DNA
library, and with a little bit of programming, you're done.
There are lots of other places to look for already existing code. You
can search the Internet with your favorite search engines. You can
browse collections of links for bioinformatics, looking for programs.
You can also search the other sources we've already covered,
such as newsgroups, relevant experts, etc.
If you haven't hit paydirt yet, and you know that the program
will take a significant amount of time to write yourself, you may
want to search the literature in the library, and perhaps enlist the
aid of a librarian. You can search Medline for articles about
regulatory elements, since often an article will advertise code (an
actual program in a language like Perl) that the authors will
forward. You can consult conference proceedings, books, and journals.
Conferences and trade shows are also great places to look around,
meet people, and ask questions.
In many cases you succeed, and despite the effort involved, you saved
yourself and your laboratory days, weeks, or months of effort.
However, one big warning about modifying existing code: depending on
how much alteration is required, it can sometimes be more difficult
to modify existing code than to write a whole program from scratch.
Why? Well, depending on who wrote the program, it may be difficult
just to see what the different parts of the code do. You can't
make modifications if you can't understand what methods the
program uses in the first place. (We'll talk more about writing
readable code, and the importance of comments in code, later.) This
factor alone accounts for a large part of the expense of programming;
many programs can't be easily read, or understood, so they
can't be maintained. Also, testing the program may be difficult
for various reasons, and it may take a lot of time and effort to
assure yourself that your modifications are working correctly.
Okay, let's say that you spent three days looking for an
existing program, and there really wasn't anything available.
(Well, there was one program, but it cost $30,000 which is way
outside your budget, and your local programming expert was too busy
to write one for you.) So you absolutely have to write the program
yourself.
How do you start from scratch and come up with a program that counts
the regulatory elements in some DNA? Read on.