9.2
Restriction Maps and Restriction Enzymes
One of the great discoveries in molecular
biology, which paved the way for the current golden age in biological
research, was the discovery of restriction enzymes. For the
nonbiologist, and to help set up the programming material that
follows, here's a short overview.
9.2.1
Background
Restriction
enzymes are proteins that cut DNA at short,
specific sequences; for example, the popular restriction enzymes
EcoRI and HindIII are widely used in the
lab. EcoRI cuts where it finds GAATTC, between the
G and A. Actually, it cuts both
complementary strands, leaving an overhang on each end. These
"sticky ends" of a few bases in single strands make it
possible for the fragments to re-form, making possible the insertion
of
DNA into vectors for cloning and
sequencing, for instance. HindIII cuts at AAGCTT
and cuts between the As. Some restriction enzymes
cut in the middle and result in "blunt ends" with no
overhang. About 1,000 restriction enzymes are known.
If you look at the reverse complement of the restriction
enzyme EcoRI, you see it's GAATTC, the same
sequence. This is a biological version of a palindrome, a word that
reads the same in reverse. Many restriction sites are palindromes.
Computing restriction
maps is a common and practical bioinformatics calculation in the
laboratory. Restriction maps are computed to plan experiments, to
find the best way to cut DNA to insert a gene, to make a
site-specific mutation, or for several other applications of
recombinant DNA techniques. By computing first, the laboratory
scientist saves considerably on the necessary trial-and-error at the
laboratory bench. Look for more about restriction enzymes at http://www.neb.com/rebase/rebase.html.
We'll now write a program that does something useful in the
lab: it will look for restriction enzymes in a sequence of DNA and
report back with a restriction map of exactly where in the DNA the
restriction enzymes appear.
9.2.2
Planning the Program
Back in Chapter 5, you saw how to look for regular
expressions in text. So you've an idea of how to find motifs in
sequences with Perl. Now let's think about how to use those
techniques to create restriction maps. Here are some questions to
ask:
-
Where do I find restriction enzyme data?
-
Restriction enzyme data can be found at the
Restriction Enzyme
Database, (REBASE), which is on the Web at http://www.neb.com/rebase/rebase.html.
-
How do I represent restriction enzymes in regular expressions?
-
Exploring that site, you'll see that restriction enzymes are
represented in their own language. We'll try to translate that
language into the language of regular expressions.
-
How do I store restriction enzyme data?
-
There are about 1,000
restriction enzymes with names and
definitions. This makes them candidates for the fast key-value type
of lookup hashes provide. When you write a real
application, say for the Web, it's a good idea to create a DBM
file to store the information, ready to use when a program needs a
lookup. I will cover DBM files in Chapter 10; here,
I'll just demonstrate the principle. We'll keep only a
few restriction enzyme definitions in the program.
-
How do I accept queries from the user?
-
You can ask for a restriction enzyme name, or you can
allow the user to type in a regular expression directly. We'll
do the first. Also, you want to let the user specify which sequence
to use. Again, to simplify matters, you'll just read in the
data from a sample DNA file.
-
How do I report back the restriction map to the user?
-
This is an
important question. The simplest way is to generate a list of
positions with the names of the restriction enzymes found there. This
is useful for further processing, as it presents the information very
simply.
But what if you don't want to do further processing; you just
want to communicate the restriction map to the user? Then, perhaps
it'd be more useful to present a graphical display, perhaps
print out the sequence with a line above it that flags the presence
of the enzymes.
There are lots of fancy bells and whistles you can use, but
let's do it the simple way for now and output a list.
So, the plan is to write a program that includes restriction enzyme
data translated into regular expressions, stored as the values of the
keys of the restriction enzyme names. DNA sequence data will be used
from the file, and the user will be prompted for names of restriction
enzymes. The appropriate regular expression will be retrieved from
the hash, and we'll search for all instances of that regular
expression, plus their locations. Finally, the list of locations
found will be returned.
9.2.3
Restriction Enzyme Data
The restriction enzyme data is
available in a variety of formats, as a visit to the REBASE web site
will show you. After looking around, you decide to get the
information from the
bionet file, which has a fairly
simple layout. Here's the header and a few restriction enzymes
from that file:
REBASE version 104 bionet.104
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
REBASE, The Restriction Enzyme Database http://rebase.neb.com
Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rich Roberts Mar 30 2001
AaaI (XmaIII) C^GGCCG
AacI (BamHI) GGATCC
AaeI (BamHI) GGATCC
AagI (ClaI) AT^CGAT
AaqI (ApaLI) GTGCAC
AarI CACCTGCNNNN^
AarI ^NNNNNNNNGCAGGTG
AatI (StuI) AGG^CCT
AatII GACGT^C
AauI (Bsp1407I) T^GTACA
AbaI (BclI) T^GATCA
AbeI (BbvCI) CC^TCAGC
AbeI (BbvCI) GC^TGAGG
AbrI (XhoI) C^TCGAG
AcaI (AsuII) TTCGAA
AcaII (BamHI) GGATCC
AcaIII (MstI) TGCGCA
AcaIV (HaeIII) GGCC
AccI GT^MKAC
AccII (FnuDII) CG^CG
AccIII (BspMII) T^CCGGA
Acc16I (MstI) TGC^GCA
Acc36I (BspMI) ACCTGCNNNN^
Acc36I (BspMI) ^NNNNNNNNGCAGGT
Acc38I (EcoRII) CCWGG
Acc65I (KpnI) G^GTACC
Acc113I (ScaI) AGT^ACT
AccB1I (HgiCI) G^GYRCC
AccB2I (HaeII) RGCGC^Y
AccB7I (PflMI) CCANNNN^NTGG
AccBSI (BsrBI) CCG^CTC
AccBSI (BsrBI) GAG^CGG
AccEBI (BamHI) G^GATCC
AceI (TseI) G^CWGC
AceII (NheI) GCTAG^C
AceIII CAGCTCNNNNNNN^
AceIII ^NNNNNNNNNNNGAGCTG
AciI C^CGC
AciI G^CGG
AclI AA^CGTT
AclNI (SpeI) A^CTAGT
AclWI (BinI) GGATCNNNN^
Your first task is to
read this file and get the names and the
recognition site (or
restriction site) for each
enzyme.To simplify matters for now, simply discard the parenthesized
enzyme names.
How can this data be read?
Discard header lines
For each data line:
remove parenthesized names, for simplicity's sake
get and store the name and the recognition site
Translate the recognition sites to regular expressions
--but keep the recognition site, for printing out results
}
return the names, recognition sites, and the regular expressions
This is high-level undetailed pseudocode, so let's refine and
expand it. (Notice that the curly brace isn't properly matched.
That's okay, because there are no syntax rules for pseudocode;
do whatever works for you!) Here's some pseudocode that
discards the
header lines:
foreach line
if /Rich Roberts/
break out of the foreach loop
}
This is based on the format of the file, in which the string
you're looking for is the last text before the data lines
start. (Of course, if the format of the file should change, this
might no longer work.)
Now let's further expand the pseudocode, thinking how to do the
tasks involved:
# Discard header lines
# This keeps reading lines, up to a line containing "Rich Roberts"
foreach line
if /Rich Roberts/
break out of the foreach loop
}
For each data line:
# Split the two or three (if there's a parenthesized name) fields
@fields = split( " ", $_);
# Get and store the name and the recognition site
$name = shift @fields;
$site = pop @fields;
# Translate the recognition sites to regular expressions
--but keep the recognition site, for printing out results
}
return the names, recognition sites, and the regular expressions
This isn't the translation, but let's look at what
you've done.
First, you want to extract the name and recognition site data from a
string. The most common way to separate words in a line of Perl,
especially if the string is nicely formatted, is with the Perl
built-in function split
.
If you have two or three per
line that have whitespace and are separated from each other by
whitespace, you can get them into an array with the following simple
call to split (which acts on the line as stored in the special
variable @_.:
($name, $site) = split(" ")
The @fields array may have two or three elements
depending on whether there was a parenthesized alternate enzyme
named. But you always want the first and the last elements:
$name = shift@fields;
$site = pop@fields;
You now have the problem of translating the recognition site to a
regular expression.
Looking over the
recognition sites and having read
the documentation on REBASE you found on its web site, you know that
the cut site is represented by the caret (^). This doesn't help
make a regular expression that finds the site in sequence, so you
should remove it (see Exercise 9.6 in the Section 9.4 section).
Also notice that the
bases
given in the recognition sites are not just the bases A, C, G, and T,
but they also use the more extended alphabet presented in Table 4-1. These additional letters include a letter for
every possible group of two, three, or four bases. They're
really like abbreviations for character classes in that respect. Aha!
Let's write a subroutine that substitutes character classes for
these codes, and then we'll have our regular expression.
Of course, REBASE uses them, because a given restriction enzyme might
well match a few different recognition sites.
Example 9-1 is a subroutine that, given a string,
translates these codes into
character classes.
Example 9-1. Translate IUB ambiguity codes to regular expressions
# IUB_to_regexp
#
# A subroutine that, given a sequence with IUB ambiguity codes,
# outputs a translation with IUB codes changed to regular expressions
#
# These are the IUB ambiguity codes
# (Eur. J. Biochem. 150: 1-5, 1985):
# R = G or A
# Y = C or T
# M = A or C
# K = G or T
# S = G or C
# W = A or T
# B = not A (C or G or T)
# D = not C (A or G or T)
# H = not G (A or C or T)
# V = not T (A or C or G)
# N = A or C or G or T
sub IUB_to_regexp {
my($iub) = @_;
my $regular_expression = '';
my %iub2character_class = (
A => 'A',
C => 'C',
G => 'G',
T => 'T',
R => '[GA]',
Y => '[CT]',
M => '[AC]',
K => '[GT]',
S => '[GC]',
W => '[AT]',
B => '[CGT]',
D => '[AGT]',
H => '[ACT]',
V => '[ACG]',
N => '[ACGT]',
);
# Remove the ^ signs from the recognition sites
$iub =~ s/\^//g;
# Translate each character in the iub sequence
for ( my $i = 0 ; $i < length($iub) ; ++$i ) {
$regular_expression
.= $iub2character_class{substr($iub, $i, 1)};
}
return $regular_expression;
}
It seems you're almost ready to write a subroutine to get the
data from the REBASE datafile. But there's one important item
you haven't addressed: what exactly is the data you want to
return?
You plan to return three data items per line of the original REBASE
file: the enzyme name, the recognition site, and the regular
expression. This doesn't fit easily into a hash. You can return
an array that stores these three data items in three consecutive
slots. This can work: to read the data, you'd have to read
groups of three items from the array. It's doable but might
make lookup a little difficult. As you get into more advanced Perl,
you'll find that you can create your own complex data
structures.
Since you've learned about split, maybe
you can have a
hash in which the key is the enzyme name,
and the value is a string with the recognition site and the regular
expression separated by whitespace. Then you can look up the data
fast and just extract the desired values
using split.
Example 9-2 shows this method.
Example 9-2. Subroutine to parse a REBASE datafile
# parseREBASE--Parse REBASE bionet file
#
# A subroutine to return a hash where
# key = restriction enzyme name
# value = whitespace-separated recognition site and regular expression
sub parseREBASE {
my($rebasefile) = @_;
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare variables
my @rebasefile = ( );
my %rebase_hash = ( );
my $name;
my $site;
my $regexp;
# Read in the REBASE file
@rebasefile = get_file_data($rebasefile);
foreach ( @rebasefile ) {
# Discard header lines
( 1 .. /Rich Roberts/ ) and next;
# Discard blank lines
/^\s*$/ and next;
# Split the two (or three if includes parenthesized name) fields
my @fields = split( " ", $_);
# Get and store the name and the recognition site
# Remove parenthesized names, for simplicity's sake,
# by not saving the middle field, if any,
# just the first and last
$name = shift @fields;
$site = pop @fields;
# Translate the recognition sites to regular expressions
$regexp = IUB_to_regexp($site);
# Store the data into the hash
$rebase_hash{$name} = "$site $regexp";
}
# Return the hash containing the reformatted REBASE data
return %rebase_hash;
}
This parseREBASE subroutine does quite a lot. Is
there, however, too much in one subroutine; should it be rewritten?
It's a good question to ask yourself as you're writing
code. In this case, let's leave it as it is. However, in
addition to doing a lot, it also does it in a few new ways, which
we'll look at now.
9.2.4
Logical Operators and the Range Operator
You're using
a
foreach loop to process the lines of the
bionet file stored in the
@rebasefile array.
Within that loop you use a new feature of Perl to skip the header
lines, called the range operator (..), which
is used in this line:
( 1 .. /Rich Roberts/ ) and next;
This has the effect of skipping everything from the first line up to
and including the line with "Rich Roberts," in other
words, the header lines. (Range operators must have at least one of
their endpoints given as a number to work like this.)
The and
function is a logical operator. Logical
operators are available in most programming languages. In Perl
they've become very popular, so although we haven't used
them a great deal in this book, you'll often come across code
that does. In fact, you'll start to see them a bit more as the
book continues.
Logical operators can test if two conditions are both
true, for instance:
if( $string eq 'kinase' and $num == 3) {
...
}
Only if both the conditions are true is the entire
statement true.
Similarly, with logical operators you can test if at least one of the
conditions is true using the
or operator, for
instance:
if( $string eq 'kinase' or $num == 3) {
...
}
Here, the if statement is true
if either or both of the conditionals are true.
There is also the not
logical operator, a negation operator with which you can test if
something is false:
if( not 6 == 9 ) {
...
}
6 == 9
returns false, which is negated by the
not operator, so the entire conditional returns
true.
There are also the closely related operators,
&& for
and, || for
or, and ! for
not. These have slightly different behavior
(actually, different
precedence);
most Perl code uses the versions I've shown, but both are
common.
When in doubt about precedence, you can always parenthesize
expressions to ensure your statement means what you intend it to
mean. (See Section 9.3.1 later in
this chapter.)
Logical operators also have an order of
evaluation, which makes them useful for controlling the flow of
programs. Let's take a look at how the and
operator evaluates its two arguments. It first evaluates the left
argument, and if it's true, evaluates and
returns the right. If the left argument evaluates to
false, the right argument is never touched. So the
and operator can act like a mini
if statement. For instance, the following two
examples are equivalent:
if( $verbose ) {
print $helpful_but_verbose_message;
}
$verbose and print $helpful_but_verbose_message;
Of course, the if statement is more flexible,
because it allows you to easily add more statements to the block, and
elsif and else conditions
to their own blocks. But for simple situations, the
and operator works well.[1]
The logical operator or evaluates and returns
the left argument if it's true; if the left
argument doesn't evaluate to true, the
or operator then evaluates and returns the right
argument. So here's another way to write a one-line statement
that you'll often see in Perl programs:
open(MYFILE, $file) or die "I cannot open file $file: $!";
This is basically equivalent to our frequent:
unless(open(MYFILE, $file)) {
print "I cannot open file $file\n";
exit;
}
Let's go back and take a look at the
parseREBASE
subroutine with the line:
( 1 .. /Rich Roberts/ ) and next;
The left argument is the range 1 .. /Rich
Roberts/. When you're in that range of lines, the
range operator returns a true value. Because
it's true, the and
boolean operator goes on to see if the value on the other side is
true and finds the next
function, which evaluates to true, even as it
takes you back to the "next" iteration of the enclosing
foreach loop. So if you're between the
first line and the Rich Roberts
line, you skip the rest of the loop.
Similarly, the line:
/^\s*$/ and next;
takes you back to the next iteration of the
foreach if the left argument, which matches a
blank line, is true.
The other parts of this parseREBASE subroutine
have
already been discussed, during the
design phase.
9.2.5
Finding the Restriction Sites
So now it's time to write a main program
and see our code in action. Let's start with a little
pseudocode to see what still needs to be done:
#
# Get DNA
#
get_file_data
extract_sequence_from_fasta_data
#
# Get the REBASE data into a hash, from file "bionet"
#
parseREBASE('bionet');
for each user query
If query is defined in the hash
Get positions of query in DNA
Report on positions, if any
}
You now need to write a subroutine that finds the positions of the
query in the DNA. Remember that trick of putting a global search in a
while
loop from Example 5-7 and take heart. No sooner said
than:
Given arguments $query and $dna
while ( $dna =~ /$query/ig ) {
save the position of the match
}
return @positions
When you used this trick before, you just counted how many matches
there were, not what the positions were. Let's check the
documentation for clues, specifically the list of built-in functions
in the documentation. It looks like the
pos function will solve the problem. It
gives the location of the last match of a variable in an
m//g search. Example 9-3 shows
the main program followed by the required subroutine. It's a
simple subroutine, given the Perl functions like
pos that make it
easy.
Example 9-3. Make restriction map from user queries
#!/usr/bin/perl
# Make restriction map from user queries on names of restriction enzymes
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables
my %rebase_hash = ( );
my @file_data = ( );
my $query = '';
my $dna = '';
my $recognition_site = '';
my $regexp = '';
my @locations = ( );
# Read in the file "sample.dna"
@file_data = get_file_data("sample.dna");
# Extract the DNA sequence data from the contents of the file "sample.dna"
$dna = extract_sequence_from_fasta_data(@file_data);
# Get the REBASE data into a hash, from file "bionet"
%rebase_hash = parseREBASE('bionet');
# Prompt user for restriction enzyme names, create restriction map
do {
print "Search for what restriction site for (or quit)?: ";
$query = <STDIN>;
chomp $query;
# Exit if empty query
if ($query =~ /^\s*$/ ) {
exit;
}
# Perform the search in the DNA sequence
if ( exists $rebase_hash{$query} ) {
($recognition_site, $regexp) = split ( " ", $rebase_hash{$query});
# Create the restriction map
@locations = match_positions($regexp, $dna);
# Report the restriction map to the user
if (@locations) {
print "Searching for $query $recognition_site $regexp\n";
print "A restriction site for $query at locations:\n";
print join(" ", @locations), "\n";
} else {
print "A restriction site for $query is not in the DNA:\n";
}
}
print "\n";
} until ( $query =~ /quit/ );
exit;
################################################################################
#
# Subroutine
#
# Find locations of a match of a regular expression in a string
#
#
# return an array of positions where the regular expression
# appears in the string
#
sub match_positions {
my($regexp, $sequence) = @_;
use strict;
use BeginPerlBioinfo; # see Chapter 6 about this module
#
# Declare variables
#
my @positions = ( );
#
# Determine positions of regular expression matches
#
while ( $sequence =~ /$regexp/ig ) {
push ( @positions, pos($sequence) - length($&) + 1);
}
return @positions;
}
Here is some sample output from Example 9-3:
Search for what restriction enzyme (or quit)?: AceI
Searching for AceI G^CWGC GC[AT]GC
A restriction site for AceI at locations:
54 94 582 660 696 702 840 855 957
Search for what restriction enzyme (or quit)?: AccII
Searching for AccII CG^CG CGCG
A restriction site for AccII at locations:
181
Search for what restriction enzyme (or quit)?: AaeI
A restriction site for AaeI is not in the DNA:
Search for what restriction enzyme (or quit)?: quit
Notice the length($&) in the
subroutine
match_positions. That $&
is a special variable that's set after a successful
regular-expression match. It stands for the sequence that matched the
regular expression. Since pos gives the position
of the first base following the match, you have
to subtract the length of the matching sequences, plus one (to make
the bases start at position 1 instead of position 0) to report the
starting position of the match. Other special variables include
$` which contains everything in the string before
the successful match; and
$´,
which contains everything in the string after the successful match.
So, for example: '123456' =~
/34/ succeeds at setting these special
variables like so:
$`=
'12', $&
= '34', and
$´ =
'56'.
What we have here is admittedly bare bones, but it does work. See the
exercises at the end of the chapter for ways to extend this
code.