8.2
Data Structures and Algorithms for Biology
Biologists explore biological data
and try to figure out how to do things with it based on its existing
structure in living systems. Bioinformatics is often used to model
that existing structure as closely as possible. (Bear with me;
I'm speaking in generalities!)
Bioinformatics also can take a slightly different approach. It thinks
about what it wants to do with the data and then tries to figure out
how to organize it to accomplish that goal. In other words, it tries
to produce an algorithm by representing the data in a convenient data
structure.
Now that you've got the three datatypes of Perl in
hand—namely scalars, arrays, and hashes—it's time
to take a look at these interrelated topics of
algorithms and data structures.
We've already talked about algorithms in Chapter 3. The present discussion highlights the
importance of the organization of the data for algorithms, in other
words, the data structures for the algorithm.
The most important point here is that different algorithms often
require different data structures.
8.2.1
A Gene Expression Database
Let's consider a typical problem. Say you're studying
an organism that has a total of about
30,000 genes. (Yep, you're right, it's human.) Say
you're looking at a type of cell that's never been well
characterized under certain interesting environmental conditions, and
you are determining, for each gene, whether it's being
expressed.[1]
You have a
nice microarray facility that has given you the expression
information for that cell. Now, for each gene, you need to look up
whether it's expressed in the cell. You have to put this
look-up capability on your web site, so visitors who read your
results in your upcoming paper can find the expression data for the
genes.
There are several ways to proceed. Let's look at a few
alternatives as a short and gentle introduction to the art and
science of algorithms and data structures.
What is your data? For simplicity, let's say you have the names
for all the genes in the organism and a number for the expressed
genes indicating the level of the expression in your experiment; the
unexpressed genes have the number 0.
8.2.2
Gene Expression Data Using Unsorted Arrays
Now let's suppose you want to know if the genes were
expressed, but not the expression levels, and you want to solve this
programming problem using
arrays. After all, you are
somewhat familiar with arrays by this point. How do you proceed?
You might store in the array only the names of the genes that are
being expressed and discard the other gene names. Say there were
8,000 expressed genes. Then, for any query, the answer requires
looking through the array and comparing the query with each gene in
the array until either you find it or get to the end of the array
without finding it.
That works, but there are problems. Mainly, it's kind of slow.
This isn't bad if you just do it now and then, but if
you've got a lot of people hitting your web site asking
questions about this new expression data, it can be a problem. On
average, a lookup for an expressed gene requires looking through
4,000 gene names. A lookup for an unexpressed gene takes 8,000
comparisons.
Also, if someone asked about a gene missing from your study, you
couldn't respond, since you discarded the unexpressed gene
names. The query gives a negative response, not an error message
saying the gene being searched for isn't part of your
experiment. This might even be a false negative if the query gene
that wasn't part of your study actually is expressed in the
cell type (but you just missed it). You'd prefer it if your
program would report to the user that no gene by that name was
studied.
So you decide to keep all 30,000 genes in the array. (Of course, now
a search will be slower.) But how to distinguish the expressed from
the unexpressed genes? You can load each gene's name into the
array and then append the expression measurement after the name of
each gene. Then you will definitely know if a gene is missing from
your experiment.
However, the program is still a bit slow. You still have to search
through the entire array until you find the gene or determine that it
wasn't studied. You may find it right away if it's the
first element in the array, or you may have to wait until the last
element. On average, you have to search through half of the array.
Plus, you have to compare the name of the searched-for gene with the
names of the genes in the array one by one. It will average 15,000
comparisons per query: slow. (Actually, on a modern computer, not too
horribly slow, really, but I'm making a point. These sorts of
things do add up with a program that runs too slowly.)
Another problem is that you're now keeping two values in one
scalar: the gene name and the expression measurement. To do anything
with this data, you have to also separate the gene name from the
measurement of the expression of the gene.
Despite these drawbacks, this method will work. Now, let's
think about alternatives.
8.2.3
Gene Expression Data Using Sorted Arrays and Binary Search
You might try
storing all the
gene names in alphabetical order in the
array and then use the following search technique. First, look at the
middle element. (You can tell the size of the array, as we've seen,
with the expression scalar
@array). If your gene name is before that middle
element alphabetically, you ignore the second half of the array and
pick the middle element of the remaining half of the array. You
continue, at each step narrowing the search to half the previous
number of elements, until you find a match or discover there is none.
Here it is in pseudocode:
Given a sorted array, and an element:
Until you find the element or discover it's not there,
Pick the midpoint of the array, $array[scalar(@array)/2]
Compare your element with the element at the midpoint
If that matches your element, you're done.
Else, ignore the half of the array that your element is not in
}
To compare two strings alphabetically in Perl, you use the
cmp
operator, which returns 0 if the
two strings are the same, -1 if they are in alphabetical order, and 1
if they are in reverse alphabetical order. For example, the following
returns 0:
'ZZZ' cmp 'ZZZ';
This returns -1:
'AAA' cmp 'ZZZ';
Finally, this returns 1:
'ZZZ' cmp 'AAA';
This algorithm is called a binary
search, and it considerably speeds up the process of
searching in an array, for example, to search 30,000 genes takes only
about 15 times through the loop, maximum. (As compared to 15,000
comparisons, average, for the unsorted array.) Of course, you also
have to sort the list, which might take awhile. If you need to keep
adding elements, you have to either insert them in the right place or
add them to the end and sort the array again. All that inserting or
sorting might slow things down considerably. But if you're just
sorting it once and then doing lots of lookups, a binary search might
be worth doing.
While we're at it, let's look at how to sort an array. Here's
how to
sort an array of strings alphabetically:
@array = sort @array;
Here's how to
sort an array of numbers in ascending
order:
@array = sort { $a <=> $b } @array;
Many other kinds of sorting can be done, but these are the most
common. For more details, see the Perl documentation for the
sort function.
8.2.4
Gene Expression Data Using Hashes
You can also use hashes to find a gene in your data. To do so,
you
can load the hash so that the keys are the gene names and the values
are the expression measurement. Then a single call on the hash, with
the name of the desired gene as a key, returns the results of the
experiment for that gene, and you've got your answer. This
process is also cleaner than storing the gene name and the expression
result in one scalar string; here the key is a scalar, and the value
is a separate scalar.
Furthermore, due to how hashes are made, you get an answer back very
quickly, because decent hashes don't have to search hard to
find the value of a key. Using hashes is typically faster than binary
searches. Plus, you'd know if the gene being searched for was
in the data, because you can explicitly ask if a hash value is
defined by saying something like:
if( defined $myhash{'mykey'} ) { ... }
Also, you'll get an error message if you have warnings turned
on, and you refer to an undefined value.
Another advantage of hashes over binary searching is that you can add
or subtract elements to hashes without resorting the entire array.
Finally, because hashes are built into Perl as a basic datatype, they
are easy to use, and you won't have to do much programming to
accomplish your goal. It is usually the case that it's more
important to save time writing a program then it is to save time
running it. I mention this in Chapter 3, but
it's worth emphasizing. To a programmer, the lazy way is often
the most efficient way: let the machine do the work!
Don't get the idea that
hashes are always the right way to go,
however. For instance, they don't store their elements in a
sorted order, so if you need to look at the data that way, you have
to explicitly sort it, like so:
@sorted_keys = sort keys %my_hash;
This is do-able, but it can be a bit slow on a large array. (You
could also sort the values, of course.)
To conclude the discussion of
data
structures for our expression data example, here's an informal
survey of the properties of some different data structures in Perl
for searching, adding and deleting, and maintaining sorted order in a
set of gene names:
-
Use a hash if you just need to see if
something is in a set and don't need to list the set in order.
-
A
sorted array combined with a binary
search algorithm will do if you need an ordered set and pretty fast
lookup and don't need to add or subtract elements very often.
-
An array, in conjunction with the Perl
functions push and pop,
works well if you don't need to sort the elements but do need
to quickly get at the most recently added element.
-
A Perl
array with
the functions push and
shift will serve if you don't need the
elements sorted but need to add elements. It's especially
useful to always remove the "oldest" element (the element
that has been in the array the longest).
For more information, see Appendix A and especially
Mastering Algorithms with
Perl (published by O'Reilly).
8.2.5
Relational Databases
Databases are
programs
that store and retrieve large amounts of data. They provide the most
common forms of datatypes to use in algorithms. There are several
popular databases. Some good ones that are free of charge (the best
ones are very expensive), and Perl provides access to all the most
popular ones. The Perl/DBI modules, for instance, provide convenient
access to relational databases from Perl programs.
Most databases are called relational, which
describes how they store data. Another common name for these types of
databases is relational database management
systems, or RDMS.
Relational databases store data organized in tables. The data is
usually entered and extracted with a query language called
Structured Query
Language
, or SQL, which is
a fairly simple language for accessing the data in the tables and
following links between the tables.
Relational databases are the most popular way to store and retrieve
large amounts of data, but they do require a fair bit of learning.
Programming with relational databases is beyond the scope of this
book, but if you end up doing a lot of programming with Perl,
you'll find that knowing the basics of using a database is a
valuable skill. See the discussion in Chapter 13.
In particular, it's perfectly reasonable to store your gene
expression data in a relational database and use that in your program
to respond to queries made on your web site.
8.2.6
DBM
Perl has a simple, built-in way
to store
hash data, called
database management (DBM). It's simple
to use: after starting up, it "ties" a hash to a file on
your computer disk, so you can save a hash to reuse at a later date.
This is, in effect, a simple (and very useful) database. Apart from
the initialization, you use it as you would any other hash. You can
store your genes and expression data in a DBM file and then use it as
a hash. There's more on DBM in Chapter 10