< BACKCONTINUE >

9.2 Restriction Maps and Restriction Enzymes

One of the great discoveries in molecular biology, which paved the way for the current golden age in biological research, was the discovery of restriction enzymes. For the nonbiologist, and to help set up the programming material that follows, here's a short overview.

9.2.1 Background

Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, the popular restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts where it finds GAATTC, between the G and A. Actually, it cuts both complementary strands, leaving an overhang on each end. These "sticky ends" of a few bases in single strands make it possible for the fragments to re-form, making possible the insertion of DNA into vectors for cloning and sequencing, for instance. HindIII cuts at AAGCTT and cuts between the As. Some restriction enzymes cut in the middle and result in "blunt ends" with no overhang. About 1,000 restriction enzymes are known.

If you look at the reverse complement of the restriction enzyme EcoRI, you see it's GAATTC, the same sequence. This is a biological version of a palindrome, a word that reads the same in reverse. Many restriction sites are palindromes.

Computing restriction maps is a common and practical bioinformatics calculation in the laboratory. Restriction maps are computed to plan experiments, to find the best way to cut DNA to insert a gene, to make a site-specific mutation, or for several other applications of recombinant DNA techniques. By computing first, the laboratory scientist saves considerably on the necessary trial-and-error at the laboratory bench. Look for more about restriction enzymes at http://www.neb.com/rebase/rebase.html.

We'll now write a program that does something useful in the lab: it will look for restriction enzymes in a sequence of DNA and report back with a restriction map of exactly where in the DNA the restriction enzymes appear.

9.2.2 Planning the Program

Back in Chapter 5, you saw how to look for regular expressions in text. So you've an idea of how to find motifs in sequences with Perl. Now let's think about how to use those techniques to create restriction maps. Here are some questions to ask:

Where do I find restriction enzyme data?

Restriction enzyme data can be found at the Restriction Enzyme Database, (REBASE), which is on the Web at http://www.neb.com/rebase/rebase.html.

How do I represent restriction enzymes in regular expressions?

Exploring that site, you'll see that restriction enzymes are represented in their own language. We'll try to translate that language into the language of regular expressions.

How do I store restriction enzyme data?

There are about 1,000 restriction enzymes with names and definitions. This makes them candidates for the fast key-value type of lookup hashes provide. When you write a real application, say for the Web, it's a good idea to create a DBM file to store the information, ready to use when a program needs a lookup. I will cover DBM files in Chapter 10; here, I'll just demonstrate the principle. We'll keep only a few restriction enzyme definitions in the program.

How do I accept queries from the user?

You can ask for a restriction enzyme name, or you can allow the user to type in a regular expression directly. We'll do the first. Also, you want to let the user specify which sequence to use. Again, to simplify matters, you'll just read in the data from a sample DNA file.

How do I report back the restriction map to the user?

This is an important question. The simplest way is to generate a list of positions with the names of the restriction enzymes found there. This is useful for further processing, as it presents the information very simply.

But what if you don't want to do further processing; you just want to communicate the restriction map to the user? Then, perhaps it'd be more useful to present a graphical display, perhaps print out the sequence with a line above it that flags the presence of the enzymes.

There are lots of fancy bells and whistles you can use, but let's do it the simple way for now and output a list.

So, the plan is to write a program that includes restriction enzyme data translated into regular expressions, stored as the values of the keys of the restriction enzyme names. DNA sequence data will be used from the file, and the user will be prompted for names of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and we'll search for all instances of that regular expression, plus their locations. Finally, the list of locations found will be returned.

9.2.3 Restriction Enzyme Data

The restriction enzyme data is available in a variety of formats, as a visit to the REBASE web site will show you. After looking around, you decide to get the information from the bionet file, which has a fairly simple layout. Here's the header and a few restriction enzymes from that file:

REBASE version 104                                              bionet.104
 
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    REBASE, The Restriction Enzyme Database   http://rebase.neb.com
    Copyright (c)  Dr. Richard J. Roberts, 2001.   All rights reserved.
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
Rich Roberts                                                    Mar 30 2001
 
AaaI (XmaIII)                     C^GGCCG
AacI (BamHI)                      GGATCC
AaeI (BamHI)                      GGATCC
AagI (ClaI)                       AT^CGAT
AaqI (ApaLI)                      GTGCAC
AarI                              CACCTGCNNNN^
AarI                              ^NNNNNNNNGCAGGTG
AatI (StuI)                       AGG^CCT
AatII                             GACGT^C
AauI (Bsp1407I)                   T^GTACA
AbaI (BclI)                       T^GATCA
AbeI (BbvCI)                      CC^TCAGC
AbeI (BbvCI)                      GC^TGAGG
AbrI (XhoI)                       C^TCGAG
AcaI (AsuII)                      TTCGAA
AcaII (BamHI)                     GGATCC
AcaIII (MstI)                     TGCGCA
AcaIV (HaeIII)                    GGCC
AccI                              GT^MKAC
AccII (FnuDII)                    CG^CG
AccIII (BspMII)                   T^CCGGA
Acc16I (MstI)                     TGC^GCA
Acc36I (BspMI)                    ACCTGCNNNN^
Acc36I (BspMI)                    ^NNNNNNNNGCAGGT
Acc38I (EcoRII)                   CCWGG
Acc65I (KpnI)                     G^GTACC
Acc113I (ScaI)                    AGT^ACT
AccB1I (HgiCI)                    G^GYRCC
AccB2I (HaeII)                    RGCGC^Y
AccB7I (PflMI)                    CCANNNN^NTGG
AccBSI (BsrBI)                    CCG^CTC
AccBSI (BsrBI)                    GAG^CGG
AccEBI (BamHI)                    G^GATCC
AceI (TseI)                       G^CWGC
AceII (NheI)                      GCTAG^C
AceIII                            CAGCTCNNNNNNN^
AceIII                            ^NNNNNNNNNNNGAGCTG
AciI                              C^CGC
AciI                              G^CGG
AclI                              AA^CGTT
AclNI (SpeI)                      A^CTAGT
AclWI (BinI)                      GGATCNNNN^

Your first task is to read this file and get the names and the recognition site (or restriction site) for each enzyme.To simplify matters for now, simply discard the parenthesized enzyme names.

How can this data be read?

Discard header lines 

For each data line:

    remove parenthesized names, for simplicity's sake

    get and store the name and the recognition site

    Translate the recognition sites to regular expressions
        --but keep the recognition site, for printing out results
}

return the names, recognition sites, and the regular expressions

This is high-level undetailed pseudocode, so let's refine and expand it. (Notice that the curly brace isn't properly matched. That's okay, because there are no syntax rules for pseudocode; do whatever works for you!) Here's some pseudocode that discards the header lines:

foreach line 

    if /Rich Roberts/ 
            
        break out of the foreach loop

}

This is based on the format of the file, in which the string you're looking for is the last text before the data lines start. (Of course, if the format of the file should change, this might no longer work.)

Now let's further expand the pseudocode, thinking how to do the tasks involved:

# Discard header lines
# This keeps reading lines, up to a line containing "Rich Roberts"
foreach line 
    if /Rich Roberts/ 
        break out of the foreach loop
}

For each data line:

    # Split the two or three (if there's a parenthesized name) fields
    @fields = split( " ", $_);

    # Get and store the name and the recognition site
    $name = shift @fields;

    $site = pop @fields;

    # Translate the recognition sites to regular expressions
        --but keep the recognition site, for printing out results
}

return the names, recognition sites, and the regular expressions

This isn't the translation, but let's look at what you've done.

First, you want to extract the name and recognition site data from a string. The most common way to separate words in a line of Perl, especially if the string is nicely formatted, is with the Perl built-in function split .

If you have two or three per line that have whitespace and are separated from each other by whitespace, you can get them into an array with the following simple call to split (which acts on the line as stored in the special variable @_.:

($name, $site) = split(" ")

The @fields array may have two or three elements depending on whether there was a parenthesized alternate enzyme named. But you always want the first and the last elements:

$name = shift@fields;
$site = pop@fields;

You now have the problem of translating the recognition site to a regular expression.

Looking over the recognition sites and having read the documentation on REBASE you found on its web site, you know that the cut site is represented by the caret (^). This doesn't help make a regular expression that finds the site in sequence, so you should remove it (see Exercise 9.6 in the Section 9.4 section).

Also notice that the bases given in the recognition sites are not just the bases A, C, G, and T, but they also use the more extended alphabet presented in Table 4-1. These additional letters include a letter for every possible group of two, three, or four bases. They're really like abbreviations for character classes in that respect. Aha! Let's write a subroutine that substitutes character classes for these codes, and then we'll have our regular expression.

Of course, REBASE uses them, because a given restriction enzyme might well match a few different recognition sites.

Example 9-1 is a subroutine that, given a string, translates these codes into character classes.

Example 9-1. Translate IUB ambiguity codes to regular expressions
# IUB_to_regexp
#
# A subroutine that, given a sequence with IUB ambiguity codes,
# outputs a translation with IUB codes changed to regular expressions
#
# These are the IUB ambiguity codes
# (Eur. J. Biochem. 150: 1-5, 1985):
# R = G or A
# Y = C or T
# M = A or C
# K = G or T
# S = G or C
# W = A or T
# B = not A (C or G or T)
# D = not C (A or G or T)
# H = not G (A or C or T)
# V = not T (A or C or G)
# N = A or C or G or T 

sub IUB_to_regexp {

    my($iub) = @_;

    my $regular_expression = '';

    my %iub2character_class = (
    
        A => 'A',
        C => 'C',
        G => 'G',
        T => 'T',
        R => '[GA]',
        Y => '[CT]',
        M => '[AC]',
        K => '[GT]',
        S => '[GC]',
        W => '[AT]',
        B => '[CGT]',
        D => '[AGT]',
        H => '[ACT]',
        V => '[ACG]',
        N => '[ACGT]',
    );

    # Remove the ^ signs from the recognition sites
    $iub =~ s/\^//g;

    # Translate each character in the iub sequence
    for ( my $i = 0 ; $i < length($iub) ; ++$i ) {
        $regular_expression
          .= $iub2character_class{substr($iub, $i, 1)};
    }

    return $regular_expression;
}

It seems you're almost ready to write a subroutine to get the data from the REBASE datafile. But there's one important item you haven't addressed: what exactly is the data you want to return?

You plan to return three data items per line of the original REBASE file: the enzyme name, the recognition site, and the regular expression. This doesn't fit easily into a hash. You can return an array that stores these three data items in three consecutive slots. This can work: to read the data, you'd have to read groups of three items from the array. It's doable but might make lookup a little difficult. As you get into more advanced Perl, you'll find that you can create your own complex data structures.

Since you've learned about split, maybe you can have a hash in which the key is the enzyme name, and the value is a string with the recognition site and the regular expression separated by whitespace. Then you can look up the data fast and just extract the desired values using split. Example 9-2 shows this method.

Example 9-2. Subroutine to parse a REBASE datafile
# parseREBASE--Parse REBASE bionet file
#
# A subroutine to return a hash where
#    key   = restriction enzyme name
#    value = whitespace-separated recognition site and regular expression

sub parseREBASE {

    my($rebasefile) = @_;

    use strict;
    use warnings;
    use BeginPerlBioinfo;     # see Chapter 6 about this module

    # Declare variables
    my @rebasefile = (  );
    my %rebase_hash = (  );
    my $name;
    my $site;
    my $regexp;

    # Read in the REBASE file
    @rebasefile = get_file_data($rebasefile);

    foreach ( @rebasefile ) {

        # Discard header lines
        ( 1 .. /Rich Roberts/ ) and next;

        # Discard blank lines
        /^\s*$/ and next;
    
        # Split the two (or three if includes parenthesized name) fields
        my @fields = split( " ", $_);

        # Get and store the name and the recognition site

        # Remove parenthesized names, for simplicity's sake,
        # by not saving the middle field, if any,
        # just the first and last
        $name = shift @fields;

        $site = pop @fields;

        # Translate the recognition sites to regular expressions
        $regexp = IUB_to_regexp($site);

        # Store the data into the hash
        $rebase_hash{$name} = "$site $regexp";
    }

    # Return the hash containing the reformatted REBASE data
    return %rebase_hash;
}

This parseREBASE subroutine does quite a lot. Is there, however, too much in one subroutine; should it be rewritten? It's a good question to ask yourself as you're writing code. In this case, let's leave it as it is. However, in addition to doing a lot, it also does it in a few new ways, which we'll look at now.

9.2.4 Logical Operators and the Range Operator

You're using a foreach loop to process the lines of the bionet file stored in the @rebasefile array.

Within that loop you use a new feature of Perl to skip the header lines, called the range operator (..), which is used in this line:

( 1 .. /Rich Roberts/ ) and next;

This has the effect of skipping everything from the first line up to and including the line with "Rich Roberts," in other words, the header lines. (Range operators must have at least one of their endpoints given as a number to work like this.)

The and function is a logical operator. Logical operators are available in most programming languages. In Perl they've become very popular, so although we haven't used them a great deal in this book, you'll often come across code that does. In fact, you'll start to see them a bit more as the book continues.

Logical operators can test if two conditions are both true, for instance:

if( $string eq 'kinase'  and   $num == 3) {
  ... 
}

Only if both the conditions are true is the entire statement true.

Similarly, with logical operators you can test if at least one of the conditions is true using the or operator, for instance:

if( $string eq 'kinase'  or   $num == 3) {
  ... 
}

Here, the if statement is true if either or both of the conditionals are true.

There is also the not logical operator, a negation operator with which you can test if something is false:

if( not  6 == 9 ) {
  ...
}

6 == 9 returns false, which is negated by the not operator, so the entire conditional returns true.

There are also the closely related operators, && for and, || for or, and ! for not. These have slightly different behavior (actually, different precedence); most Perl code uses the versions I've shown, but both are common.

When in doubt about precedence, you can always parenthesize expressions to ensure your statement means what you intend it to mean. (See Section 9.3.1 later in this chapter.)

Logical operators also have an order of evaluation, which makes them useful for controlling the flow of programs. Let's take a look at how the and operator evaluates its two arguments. It first evaluates the left argument, and if it's true, evaluates and returns the right. If the left argument evaluates to false, the right argument is never touched. So the and operator can act like a mini if statement. For instance, the following two examples are equivalent:

if( $verbose ) {
    print $helpful_but_verbose_message;
}

$verbose and print $helpful_but_verbose_message;

Of course, the if statement is more flexible, because it allows you to easily add more statements to the block, and elsif and else conditions to their own blocks. But for simple situations, the and operator works well.[1]

[1] You can even chain logical operators one after the other to build up more complicated expressions and use parentheses to group them. Personally, I don't like that style much, but in Perl, there's more than one way to do it!

The logical operator or evaluates and returns the left argument if it's true; if the left argument doesn't evaluate to true, the or operator then evaluates and returns the right argument. So here's another way to write a one-line statement that you'll often see in Perl programs:

open(MYFILE, $file) or die "I cannot open file $file: $!";

This is basically equivalent to our frequent:

unless(open(MYFILE, $file)) {
    print "I cannot open file $file\n";
    exit;
}

Let's go back and take a look at the parseREBASE subroutine with the line:

( 1 .. /Rich Roberts/ ) and next;

The left argument is the range 1 .. /Rich Roberts/. When you're in that range of lines, the range operator returns a true value. Because it's true, the and boolean operator goes on to see if the value on the other side is true and finds the next function, which evaluates to true, even as it takes you back to the "next" iteration of the enclosing foreach loop. So if you're between the first line and the Rich Roberts line, you skip the rest of the loop.

Similarly, the line:

/^\s*$/ and next;

takes you back to the next iteration of the foreach if the left argument, which matches a blank line, is true.

The other parts of this parseREBASE subroutine have already been discussed, during the design phase.

9.2.5 Finding the Restriction Sites

So now it's time to write a main program and see our code in action. Let's start with a little pseudocode to see what still needs to be done:

#
# Get DNA
#
get_file_data

extract_sequence_from_fasta_data

#
# Get the REBASE data into a hash, from file "bionet"
#
parseREBASE('bionet');

for each user query

    If query is defined in the hash
        Get positions of query in DNA

    Report on positions, if any

}

You now need to write a subroutine that finds the positions of the query in the DNA. Remember that trick of putting a global search in a while loop from Example 5-7 and take heart. No sooner said than:

Given arguments $query and $dna

while ( $dna =~ /$query/ig ) {
    save the position of the match
}

return @positions

When you used this trick before, you just counted how many matches there were, not what the positions were. Let's check the documentation for clues, specifically the list of built-in functions in the documentation. It looks like the pos function will solve the problem. It gives the location of the last match of a variable in an m//g search. Example 9-3 shows the main program followed by the required subroutine. It's a simple subroutine, given the Perl functions like pos that make it easy.

Example 9-3. Make restriction map from user queries
#!/usr/bin/perl
# Make restriction map from user queries on names of restriction enzymes

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

# Declare and initialize variables
my %rebase_hash = (  );
my @file_data = (  );
my $query = '';
my $dna = '';
my $recognition_site = '';
my $regexp = '';
my @locations = (  );

# Read in the file "sample.dna"
@file_data = get_file_data("sample.dna");

# Extract the DNA sequence data from the contents of the file "sample.dna"
$dna = extract_sequence_from_fasta_data(@file_data);

# Get the REBASE data into a hash, from file "bionet"
%rebase_hash = parseREBASE('bionet');

# Prompt user for restriction enzyme names, create restriction map
do {
    print "Search for what restriction site for (or quit)?: ";
    
    $query = <STDIN>;

    chomp $query;

    # Exit if empty query
    if ($query =~ /^\s*$/ ) {

        exit;
    }

    # Perform the search in the DNA sequence
    if ( exists $rebase_hash{$query} ) {

        ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query});

        # Create the restriction map
        @locations = match_positions($regexp, $dna);

        # Report the restriction map to the user
        if (@locations) {
            print "Searching for $query $recognition_site $regexp\n";
            print "A restriction site for $query at locations:\n";
            print join(" ", @locations), "\n";
        } else {
            print "A restriction site for $query is not in the DNA:\n";
        }
    }
    print "\n";
} until ( $query =~ /quit/ );

exit;

################################################################################
#
# Subroutine
#
# Find locations of a match of a regular expression in a string
#
# 
# return an array of positions where the regular expression
#  appears in the string
#

sub match_positions {

    my($regexp, $sequence) = @_;

    use strict;

    use BeginPerlBioinfo;     # see Chapter 6 about this module

    #
    # Declare variables
    #

    my @positions = (  );

    #
    # Determine positions of regular expression matches
    #
    
    while ( $sequence =~ /$regexp/ig ) {

        push ( @positions, pos($sequence) - length($&) + 1);
    }

    return @positions;
}

Here is some sample output from Example 9-3:

Search for what restriction enzyme (or quit)?: AceI
Searching for AceI G^CWGC GC[AT]GC
A restriction site for AceI at locations:
54 94 582 660 696 702 840 855 957

Search for what restriction enzyme (or quit)?: AccII
Searching for AccII CG^CG CGCG
A restriction site for AccII at locations:
181

Search for what restriction enzyme (or quit)?: AaeI
A restriction site for AaeI is not in the DNA:

Search for what restriction enzyme (or quit)?: quit

Notice the length($&) in the subroutine match_positions. That $& is a special variable that's set after a successful regular-expression match. It stands for the sequence that matched the regular expression. Since pos gives the position of the first base following the match, you have to subtract the length of the matching sequences, plus one (to make the bases start at position 1 instead of position 0) to report the starting position of the match. Other special variables include $` which contains everything in the string before the successful match; and $´, which contains everything in the string after the successful match. So, for example: '123456' =~ /34/ succeeds at setting these special variables like so: $`= '12', $& = '34', and = '56'.

What we have here is admittedly bare bones, but it does work. See the exercises at the end of the chapter for ways to extend this code.

< BACKCONTINUE >

Index terms contained in this section

! (bang)
      ! (logical negation) operator
$ (dollar sign)
      $& variables
      $' variables 2nd
      $` variables
& (ampersand)
      && (logical and) operator
. (dot)
      .. (range) operator
^ (caret)
      restriction enzyme cut site (REBASE)
| (vertical bar)
      || (logical or) operator
and operator
      logical and 2nd
bases
      restriction enzymes, recognition sites
bionet file, restriction enzymes
chaining, logical operators
character classes
      IUB ambiguity codes, translating to
characters
      letters representing nucleic acids
DBM (database management)
      restriction enzymes, storing in
DNA
      cutting to insert a gene
EcoRI restriction enzyme
genes
      cutting DNA to insert
global searches in loops
hashes
      recognition enzymes
      of restriction enzymes
headers
      discarding from restriction enzyme files
HindIII restriction enzyme
IUB (nucleic acid) ambiguity codes, translating to regular expressions
letters representing nucleic acids
logical operators
      range operator, use with
            precedence
match_positions subroutine (example)
not operator
      logical negation
operators
     logical
            range operator, using with
or operator
      logical or
order of evaluation
parsing
      REBASE files 2nd
patterns (and regular expressions)
      IUB ambiguity codes, translating to
      restriction enzymes, representing in
      translating restriction enzymes into
pos function
precedence
      logical operators
range operator (..), using with logical operators
reading
      bionet file for restriction sites
      restriction enzyme data
REBASE (Restriction Enzyme Database)
      data file, subroutine for parsing
recognition sites
      codes for
      IUB nucleic acid codes, translating to regular expressions
Restriction Enzyme Database (REBASE)
      data file, subroutine for parsing
restriction enzymes 2nd
      data, bionet file format
      reading data from file
      storing data
      user requests for names
      web site information on
restriction maps
      computing
      recognition sites (IUB nucleic acid codes), translating to regular expressions
      reporting to user
     restriction sites
            finding
            reading bionet file for
reverse complements
      restriction enzyme EcoRI
split function
      hash values, using on
      strings, extracting data from
strings
      split function, extracting data with
subroutines
      IUB ambiguity codes, translating to regular expressions
      match_positions (example)
      parsing REBASE data file
      parsing REBASE files
user queries on restriction enzymes, making restriction map from
variables
      $&, $`, and $'
      split function, separating with
web sites
      restriction enzymes
while loops
      global searches in

© 2002, O'Reilly & Associates, Inc.