4.8
Reading Proteins in Files
Programs interact
with files on a computer disk. These files can be on hard disk, CD,
floppy disk, Zip drive, magnetic tape—any kind of permanent
storage.
Let's take a look at how to read protein sequence data from a
file. First, create a file on your computer (use your text editor)
and put some protein sequence data into it. Call the file
NM_021964fragment.pep (you can download it from this
book's web site). You will be using the following data (part of
the human zinc finger protein NM_021964):
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
You can use any name, except one that's already in use in the
same folder.
Just as well-chosen variable names can be critical to understanding a
program, well-chosen
file
and folder names can also be critical. If you have a project that
generates lots of computer files, you need to carefully consider how
to name and organize the files and folders. This is as true for
individual researchers as for large, multi-national teams. It's
important to put some effort into assigning informative names to
files.
The filename NM_021964fragment.pep is taken from
the GenBank ID of the record where this protein is found. It also
indicates the fragmentary nature of the data and contains the
filename extension .pep to remind you that the
file contains peptide or protein
sequence data. Of course, some other scheme might work better for
you; the point is to get some idea of what's in the file
without having to look into it.
Now that you've created or downloaded a file with protein
sequence data in it, let's develop a program that reads the
protein sequence data from the file and stores it into a variable.
Example 4-5 shows a first attempt, which will be
added to as we progress.
Example 4-5. Reading protein sequence data from a file
#!/usr/bin/perl -w
# Reading protein sequence data from a file
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file, and associate
# a "filehandle" with it. We choose the filehandle
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename);
# Now we do the actual reading of the protein sequence data from the file,
# by using the angle brackets < and > to get the input from the
# filehandle. We store the data into our variable $protein.
$protein = <PROTEINFILE>;
# Now that we've got our data, we can close the file.
close PROTEINFILE;
# Print the protein onto the screen
print "Here is the protein:\n\n";
print $protein;
exit;
Here's the output of Example 4-5:
Here is the protein:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
Notice that only the first line of the file prints out. I'll
show why in a moment.
Let's look at Example 4-5 in more detail.
After putting a filename into the variable
$proteinfilename, the
file
is opened with the following statement:
open(PROTEINFILE, $proteinfilename);
After opening the file, you can do various things with it, such as
reading, writing, searching, going to a specific location in the
file, erasing everything in the file, and so on. Notice that the
program assumes the file named in the variable
$proteinfilename exists and can be opened.
You'll see in a little bit how to check for that, but
here's something to try: change the name of the filename in
$proteinfilename so that there's no file of
that name on your computer, and then run the program. You'll
get some error messages if the file doesn't exist.
If you look at the documentation for the
open function, you'll see many
options. Mostly, they enable you to specify exactly what the file
will be used for after it's opened.
Let's examine the term PROTEINFILE, which is
called a
filehandle.
With filehandles, it's not important to understand what they
really are. They're just things you use when you're
dealing with files. They don't have to have capital letters,
but it's a widely followed convention. After the
open statement assigns a filehandle, all the
interaction with a file is done by naming the filehandle.
The data is actually read in to the program with the statement:
$protein = <PROTEINFILE>;
Why is the filehandle PROTEINFILE enclosed within
angle
brackets? These angle brackets are called input
operators; a filehandle within angle brackets is how you
bring in data from some source outside the program. Here, we're
reading the file called NM_021964fragment.pep
whose name is stored in variable $proteinfilename,
and which has a filehandle associated with it by the
open statement. The data is being stored in the
variable $protein and then printed out.
However, as you've already noticed, only the first line of this
multiline file is printed out. Why? Because there are a few more
things to learn about reading in files.
There are several ways to read in a whole file. Example 4-6 shows one way.
Example 4-6. Reading protein sequence data from a file, take 2
#!/usr/bin/perl -w
# Reading protein sequence data from a file, take 2
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file, and associate
# a "filehandle" with it. We choose the filehandle
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename);
# Now we do the actual reading of the protein sequence data from the file,
# by using the angle brackets < and > to get the input from the
# filehandle. We store the data into our variable $protein.
#
# Since the file has three lines, and since the read only is
# returning one line, we'll read a line and print it, three times.
# First line
$protein = <PROTEINFILE>;
# Print the protein onto the screen
print "\nHere is the first line of the protein file:\n\n";
print $protein;
# Second line
$protein = <PROTEINFILE>;
# Print the protein onto the screen
print "\nHere is the second line of the protein file:\n\n";
print $protein;
# Third line
$protein = <PROTEINFILE>;
# Print the protein onto the screen
print "\nHere is the third line of the protein file:\n\n";
print $protein;
# Now that we've got our data, we can close the file.
close PROTEINFILE;
exit;
Here's the output of Example 4-6:
Here is the first line of the protein file:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
Here is the second line of the protein file:
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
Here is the third line of the protein file:
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
The interesting thing about this program is that it shows how reading
from a file works. Every time you read into a scalar variable such as
$protein, the next line of the file is read.
Something is remembering where the previous read was and is picking
it up from there.
On the other hand, the drawbacks of this program are obvious. Having
to write a few lines of code for each line of an input file
isn't convenient. However, there are two Perl features that can
handle this nicely: arrays (in the next section) and loops (in Chapter 5).