11.2
Files and Folders
The
PDB
is distributed as files
within directories. Each protein structure occupies its own file. PDB
contains a huge amount of data, and it can be a challenge to deal
with it. So in this section, you'll learn to deal with large
numbers of files organized in directories and subdirectories.
You'll frequently find a need to write programs that manipulate
large numbers of files. For example: perhaps you keep all your
sequencing runs in a directory, organized into subdirectories labeled
by the dates of the sequencing runs and containing whatever the
sequencer produced on those days. After a few years, you could have
quite a number of files.
Then, one day you discover a new sequence of DNA that seems to be
implicated in cell division. You do a BLAST search (see Chapter 12) but find no significant hits for your new
DNA. At that point you want to know whether you've seen this
DNA before in any previous sequencing runs.[1]
What you
need to do is run a comparison subroutine on each of the hundreds or
thousands of files in all your various sequencing run subdirectories.
But that's going to take several days of repetitive, boring
work sitting at the computer screen.
You can write a program in much less time than that! Then all you
have to do is sit back and examine the results of any significant
matches your program finds. To write the program, however, you have
to know how to manipulate all the files and folders in Perl. The
following sections show you how to do it.
11.2.1
Opening Directories
A filesystem is organized in a tree structure.
The metaphor is apt. Starting from anyplace on the tree, you can
proceed up the branches and get to any leaves that stem from your
starting place. If you start from the root of the tree, you can reach
all the leaves. Similarly, in a filesystem, if you start at a certain
directory, you can reach all the files in all the subdirectories that
stem from your starting place, and if you start at the root (which,
strangely enough, is also called the "top") of the
filesystem, you can reach all the files.
You've already had plenty of practice opening, reading from,
writing to, and closing files. I will show a simple method with which
you can open a folder (also called a directory) and get the filenames
of all the files in that folder. Following that, you'll see how
to get the names of all files from all directories and subdirectories
from a certain starting point.
Let's look at the Perlish way to list all the files in a
folder, beginning with some pseudocode:
open folder
read contents of folder (files and subfolders)
print their names
Example 11-1 shows the actual Perl code.
Example 11-1. Listing the contents of a folder (or directory)
#!/usr/bin/perl
# Demonstrating how to open a folder and list its contents
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
my @files = ( );
my $folder = 'pdb';
# open the folder
unless(opendir(FOLDER, $folder)) {
print "Cannot open folder $folder!\n";
exit;
}
# read the contents of the folder (i.e. the files and subfolders)
@files = readdir(FOLDER);
# close the folder
closedir(FOLDER);
# print them out, one per line
print join( "\n", @files), "\n";
exit;
Since you're running this program on a folder that contains PDB
files, this is what you'll see:
.
..
3c
44
pdb1a4o.ent
If you want to list the files in the current directory, you can give
the directory name the special name "." for the current
directory, like so:
my $folder = '.';
On Unix or Linux systems, the special files "." and
".." refer to the current directory and the parent
directory, respectively. These aren't "really"
files, at least not files you'd want to read; you can avoid
listing them with the wonderful and amazing grep
function. grep allows you to select elements
from an array based on a test, such as a regular expression.
Here's how to filter out the array entries "." and
"..":
@files = grep( !/^\.\.?$/, @files);
grep selects all lines that don't match
the regular expression, due to the negation operator written as the
exclamation mark. The regular expression /^\.\.?$/
is looking for a line that begins with (the beginning of a line is
indicated with the ^ metacharacter) a period
\. (escaped with a backslash since a period is a
metacharacter) followed by 0 or 1 periods \.? (the
? matches 0 or 1 of the preceding items), and
nothing more (indicated by the $ end-of-string
metacharacter).
In fact, this is so often used when reading a directory that
it's usually combined into one step:
@files = grep (!/^\.\.?$/, readdir(FOLDER));
Okay, now all the files are listed. But wait: what if some of these
files aren't files at all but are subfolders? You can use the
handy file test operators to test each filename and then even open
each subfolder and list the files in them. First, some pseudocode:
open folder
for each item in the folder
if it's a file
print its name
else if it's a folder
open the folder
print the names of the contents of the folder
}
}
Example 11-2 shows the program.
Example 11-2. List contents of a folder and its subfolders
#!/usr/bin/perl
# Demonstrating how to open a folder and list its contents
# --distinguishing between files and subfolders, which
# are themselves listed
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
my @files = ( );
my $folder = 'pdb';
# Open the folder
unless(opendir(FOLDER, $folder)) {
print "Cannot open folder $folder!\n";
exit;
}
# Read the folder, ignoring special entries "." and ".."
@files = grep (!/^\.\.?$/, readdir(FOLDER));
closedir(FOLDER);
# If file, print its name
# If folder, print its name and contents
#
# Notice that we need to prepend the folder name!
foreach my $file (@files) {
# If the folder entry is a regular file
if (-f "$folder/$file") {
print "$folder/$file\n";
# If the folder entry is a subfolder
}elsif( -d "$folder/$file") {
my $folder = "$folder/$file";
# open the subfolder and list its contents
unless(opendir(FOLDER, "$folder")) {
print "Cannot open folder $folder!\n";
exit;
}
my @files = grep (!/^\.\.?$/, readdir(FOLDER));
closedir(FOLDER);
foreach my $file (@files) {
print "$folder/$file\n";
}
}
}
exit;
Here's the output of Example 11-2:
pdb/3c/pdb43c9.ent
pdb/3c/pdb43ca.ent
pdb/44/pdb144d.ent
pdb/44/pdb144l.ent
pdb/44/pdb244d.ent
pdb/44/pdb244l.ent
pdb/44/pdb344d.ent
pdb/44/pdb444d.ent
pdb/pdb1a4o.ent
Notice how variable names such as $file and
@files have been reused in this code, using
lexical scoping in the inner blocks with my. If
the overall structure of the program wasn't so short and
simple, this could get really hard to read. When the program says
$file, does it mean this $file
or that $file? This code is an example of how to
get into trouble. It works, but it's hard to read, despite its
brevity.
In fact, there's a deeper problem with Example 11-2. It's not well designed. By extending
Example 11-1, it can now list subdirectories. But
what if there are further levels of subdirectories?
11.2.2
Recursion
If you have a subroutine that lists the contents of directories and
recursively calls itself to list the contents of any subdirectories
it finds, you can call it on the top-level directory, and it
eventually lists all the files.
Let's write another program that does just that. A
recursive subroutine is defined simply as a
subroutine that calls itself. Here is the pseudocode and the code
(Example 11-3) followed by a discussion of how
recursion works:
subroutine list_recursively
open folder
for each item in the folder
if it's a file
print its name
else if it's a folder
list_recursively
}
}
Example 11-3. A recursive subroutine to list a filesystem
#!/usr/bin/perl
# Demonstrate a recursive subroutine to list a subtree of a filesystem
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
list_recursively('pdb');
exit;
################################################################################
# Subroutine
################################################################################
# list_recursively
#
# list the contents of a directory,
# recursively listing the contents of any subdirectories
sub list_recursively {
my($directory) = @_;
my @files = ( );
# Open the directory
unless(opendir(DIRECTORY, $directory)) {
print "Cannot open directory $directory!\n";
exit;
}
# Read the directory, ignoring special entries "." and ".."
@files = grep (!/^\.\.?$/, readdir(DIRECTORY));
closedir(DIRECTORY);
# If file, print its name
# If directory, recursively print its contents
# Notice that we need to prepend the directory name!
foreach my $file (@files) {
# If the directory entry is a regular file
if (-f "$directory/$file") {
print "$directory/$file\n";
# If the directory entry is a subdirectory
}elsif( -d "$directory/$file") {
# Here is the recursive call to this subroutine
list_recursively("$directory/$file");
}
}
}
Here's the output of Example 11-3 (notice that
it's the same as the output of Example 11-2):
pdb/3c/pdb43c9.ent
pdb/3c/pdb43ca.ent
pdb/44/pdb144d.ent
pdb/44/pdb144l.ent
pdb/44/pdb244d.ent
pdb/44/pdb244l.ent
pdb/44/pdb344d.ent
pdb/44/pdb444d.ent
pdb/pdb1a4o.ent
Look over the code for Example 11-3 and compare it to
Example 11-2. As you can see, the programs are
largely identical. Example 11-2 is all one main
program; Example 11-3 has almost identical code but
has packaged it up as a subroutine that is called by a short main
program. The main program of Example 11-3 simply
calls a recursive function, giving it a directory name (for a
directory that exists on my computer; you may need to change the
directory name when you attempt to run this program on your own
computer). Here is the call:
list_recursively('pdb');
I don't know if you feel let down, but I do. This looks just
like any other subroutine call. Clearly, the recursion must be
defined within the subroutine. It's not until the very end of
the list_recursively subroutine, where the
program finds (using the -d file test operator)
that one of the contents of the directory that it's listing is
itself a directory, that there's a significant difference in
the code as compared with Example 11-2. At that
point, Example 11-2 has code to once again look for
regular files or for directories. But this subroutine in Example 11-3 simply calls a subroutine, which happens to be
itself, namely, list_recursively:
list_recursively("$directory/$file");
That's recursion.
As you've seen here, there are times when the data—for
instance, the hierarchical structure of a filesystem—is well
matched by the capabilities of recursive programs. The fact that the
recursive call happens at the end of the subroutine means that
it's a special type of recursion called tail
recursion. Although recursion can be slow, due to all the
subroutine calls it can create, the good news about tail recursion is
that many compilers can optimize the code to make it run much faster.
Using recursion can result in clean, short, easy-to-understand
programs. (Although Perl doesn't yet optimize it, current plans
for Perl 6 include support for optimizing tail recursion.)
11.2.3
Processing Many Files
Perl has modules for a variety of tasks. Some come standard with
Perl; more can be installed after obtaining them from CPAN or
elsewhere: http://www.CPAN.org/.
Example 11-3 in the previous section showed how to
locate all files and directories under a given directory.
There's a module that is standard in any recent version of Perl
called File::Find. You can find it in your
manual pages: on Unix or Linux, for instance, you issue the command
perldoc File::Find. This module
makes it easy—and efficient—to process all files under a
given directory, performing whatever operations you specify.
Example 11-4 uses File::Find.
Consult the documentation for more examples of this useful module.
The example shows the same functionality as Example 11-3 but now uses File::Find.
It simply lists the files and directories. Notice how much less code
you have to write if you find a good module, ready to use!
Example 11-4. Demonstrate File::Find
#!/usr/bin/perl
# Demonstrate File::Find
use strict;
use warnings;
use BeginPerlBioinfo; # see Chapter 6 about this module
use File::Find;
find ( \&my_sub, ('pdb') );
sub my_sub {
-f and (print $File::Find::name, "\n");
}
exit;
Notice that a reference is passed to the my_sub
subroutine by prefacing it with the backslash character. You also
need to preface the name with the ampersand character, as mentioned
in Chapter 6.
The call to find can also be done like this:
find sub { -f and (print $File::Find::name, "\n") }, ('pdb');
This puts an anonymous subroutine in place of the reference to the
my_sub subroutine, and it's a convenience
for these types of short subroutines.
Here's the output:
pdb/pdb1a4o.ent
pdb/44/pdb144d.ent
pdb/44/pdb144l.ent
pdb/44/pdb244d.ent
pdb/44/pdb244l.ent
pdb/44/pdb344d.ent
pdb/44/pdb444d.ent
pdb/3c/pdb43c9.ent
pdb/3c/pdb43ca.ent
As a final example of processing files with Perl, here's the
same functionality as the preceding programs, with a one-line
program, issued at the command line:
perl -e 'use File::Find;find sub{-f and (print $File::Find::name,"\n")},("pdb")'
Pretty cool, for those who admire terseness, although it
doesn't really eschew obfuscation. Also note that for those on
Unix systems, ls -R pdb and find pdb
-print do the same thing with even less typing.
The reason for using a subroutine that you define is that it enables
you to perform any arbitrary tests on the files you find and then
take any actions with those files. It's another case of
modularization: the File::Find module makes it
easy to recurse over all the files and directories in a file
structure and lets you do as you wish with the files and directories
you find.