Safari | Beginning Perl for Bioinformatics -> 4.3 Concatenating DNA Fragments

Beginning Perl for Bioinformatics
	Copyright
	Table of Contents
	Preface
	1. Biology and Computer Science
	2. Getting Started with Perl
	3. The Art of Programming
	4. Sequences and Strings
		4.1 Representing Sequence Data
		4.2 A Program to Store a DNA Sequence
		4.3 Concatenating DNA Fragments
		4.4 Transcription: DNA to RNA
		4.5 Using the Perl Documentation
		4.6 Calculating the Reverse Complement in Perl
		4.7 Proteins, Files, and Arrays
		4.8 Reading Proteins in Files
		4.9 Arrays
		4.10 Scalar and List Context
		4.11 Exercises
	5. Motifs and Loops
	6. Subroutines and Bugs
	7. Mutations and Randomization
	8. The Genetic Code
	9. Restriction Maps and Regular Expressions
	10. GenBank
	11. Protein Data Bank
	12. BLAST
	13. Further Topics
	A. Resources
	B. Perl Summary
	Colophon
	Index

Beginning Perl for Bioinformatics > 4. Sequences and Strings > 4.3 Concatenating DNA Fragments

< BACK

CONTINUE >

4.3 Concatenating DNA Fragments

Now we'll make a simple modification of Example 4-1 to show how to concatenate two DNA fragments. Concatenation is attaching something to the end of something else. A biologist is well aware that joining DNA sequences is a common task in the biology lab, for instance when a clone is inserted into a cell vector or when splicing exons together during the expression of a gene. Many bioinformatics software packages have to deal with such operations; hence its choice as an example.

Example 4-2 demonstrates a few more things to do with strings, variables, and print statements.

Example 4-2. Concatenating DNA

#!/usr/bin/perl -w
# Concatenating DNA

# Store two DNA fragments into two variables called $DNA1 and $DNA2
$DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
$DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';

# Print the DNA onto the screen
print "Here are the original two DNA fragments:\n\n";

print $DNA1, "\n";

print $DNA2, "\n\n";

# Concatenate the DNA fragments into a third variable and print them
# Using "string interpolation"
$DNA3 = "$DNA1$DNA2";

print "Here is the concatenation of the first two fragments (version 1):\n\n";

print "$DNA3\n\n";

# An alternative way using the "dot operator":
# Concatenate the DNA fragments into a third variable and print them
$DNA3 = $DNA1 . $DNA2;

print "Here is the concatenation of the first two fragments (version 2):\n\n";

print "$DNA3\n\n";

# Print the same thing without using the variable $DNA3
print "Here is the concatenation of the first two fragments (version 3):\n\n";

print $DNA1, $DNA2, "\n";

exit;

As you can see, there are three variables here, $DNA1, $DNA2, and $DNA3. I've added print statements for a running commentary, so that the output of the program that appears on the computer screen makes more sense and isn't simply some DNA fragments one after the other.

Here's what the output of Example 4-2 looks like:

Here are the original two DNA fragments:

ACGGGAGGACGGGAAAATTACTACGGCATTAGC
ATAGTGCCGTGAGAGTGATGTAGTA

Here is the concatenation of the first two fragments (version 1):

ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA

Here is the concatenation of the first two fragments (version 2):

ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA

Here is the concatenation of the first two fragments (version 3):

ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA

Example 4-2 has many similarities to Example 4-1. Let's look at the differences. To start with, the print statements have some extra, unintuitive parts:

print $DNA1, "\n";

print $DNA2, "\n\n";

The print statements have variables containing the DNA, as before, but now they also have a comma and then "\n" or "\n\n". These are instructions to print newlines. A newline is invisible on the page or screen, but it tells the computer to go on to the beginning of the next line for subsequent printing. One newline, "\n", simply positions you at the beginning of the next line. Two new lines, "\n\n", moves to the next line and then positions you at the beginning of the line after that, leaving a blank line in between.

Look at the code for Example 4-2 and to make sure you see what these newline directives do to the output. A blank line is a line with nothing printed on it. Depending on your operating system, it may be just a newline character or a combination formfeed and carriage return (in which cases, it may also be called an empty line), or it may include nonprinting whitespace characters such as spaces and tabs. Notice that the newlines are enclosed in double quotes, which means they are parts of strings. (Here's one difference between single and double quotes, as mentioned earlier: "\n" prints a newline; '\n' prints \n as written.)

Notice the comma in the print statement. A comma separates items in a list. The print statement prints all the items that are listed. Simple as that.

Now let's look at the statement that concatenates the two DNA fragments $DNA1 and $DNA2 into the variable $DNA3:

$DNA3 = "$DNA1$DNA2";

The assignment to $DNA3 is just a typical assignment as you saw in Example 4-1, a variable name followed by the = sign, followed by a value to be assigned.

The value to the right of the assignment statement is a string enclosed in double quotes. The double quotes allow the variables in the string to be replaced with their values. This is called string interpolation .^[2] So, in effect, the string here is just the DNA of variable $DNA1, followed directly by the DNA of variable $DNA2. That concatenation of the two DNA fragments is then assigned to variable $DNA3.

^[2] There are occasions when you might add curly braces during string interpolation. The extra curly braces make sure the variable names aren't confused with anything else in the double-quoted string. For example, if you had variable $prefix and tried to interpolate it into the string I am $prefixinterested, Perl might not recognize the variable, confusing it with a nonexistent variable $prefixinterested. But the string I am ${prefix}interested is unambiguous to Perl.

After assigning the concatenated DNA to variable $DNA3, you print it out, followed by a blank line:

print "$DNA3\n\n";

One of the Perl catch phrases is, "There's more than one way to do it." So, the next part of the program shows another way to concatenate two strings, using the dot operator. The dot operator, when placed between two strings, creates a single string that concatenates the two original strings. So the line:

$DNA3 = $DNA1 . $DNA2;

illustrates the use of this operator.

An operator in a computer language takes some arguments—in this case, the strings $DNA1 and $DNA2—and does something to them, returning a value—in this case, the concatenated string placed in the variable $DNA3. The most familiar operators from arithmetic—plus, minus, multiply, and divide—are all operators that take two numbers as arguments and return a number as a value.

Finally, just to exercise the different parts of the language, let's accomplish the same concatenation using only the print statement:

print $DNA1, $DNA2, "\n";

Here the print statement has three parts, separated by commas: the two DNA fragments in the two variables and a newline. You can achieve the same result with the following print statement:

print "$DNA1$DNA2\n";

Maybe the Perl slogan should be, "There are more than two ways to do it."

Before leaving this section, let's look ahead to other uses of Perl variables. You've seen the use of variables to hold strings of DNA sequence data. There are other types of data, and programming languages need variables for them, too. In Perl, a scalar variable such as $DNA can hold a string, an integer, a floating-point number (with a decimal point), a boolean (true or false) value, and more. When it's required, Perl figures out what kind of data is in the variable. For now, try adding the following lines to Example 4-1 or Example 4-2, storing a number in a scalar variable and printing it out:

$number = 17;
print $number,"\n";

< BACK

CONTINUE >

Index terms contained in this section

" (quotes, double)
      string interpolation
, (comma)
      separating items in lists
. (dot)
      string concatenation operator
\\\\n or \\\\n\\\\n (newlines)
{} (curly braces)
      in string interpolation
assignment
      in concatenation of DNA fragments
concatenating DNA fragments
      print statement, using
concatenating strings
      . (dot) operator
DNA
     sequences
            concatenating
dot (.) operator
interpolating variables
joining DNA sequences
lists, separating items with commas
newlines
numbers
     storing in scalar variables
            and printing
operators
      concatenation
print statements
      concatenating DNA fragments with
      newlines
printing
      number stored in scalar variable
scalar variables
      datatypes held in
sequences
     DNA
            concatenating
string interpolation
strings
      concatenating
            . (dot) operator, using
            print statement, using
variables
      interpolating
      scalar