Safari | Beginning Perl for Bioinformatics -> 11.2 Files and Folders

Beginning Perl for Bioinformatics > 11. Protein Data Bank > 11.2 Files and Folders

11.2 Files and Folders

The PDB is distributed as files within directories. Each protein structure occupies its own file. PDB contains a huge amount of data, and it can be a challenge to deal with it. So in this section, you'll learn to deal with large numbers of files organized in directories and subdirectories.

You'll frequently find a need to write programs that manipulate large numbers of files. For example: perhaps you keep all your sequencing runs in a directory, organized into subdirectories labeled by the dates of the sequencing runs and containing whatever the sequencer produced on those days. After a few years, you could have quite a number of files.

Then, one day you discover a new sequence of DNA that seems to be implicated in cell division. You do a BLAST search (see Chapter 12) but find no significant hits for your new DNA. At that point you want to know whether you've seen this DNA before in any previous sequencing runs.^[1] What you need to do is run a comparison subroutine on each of the hundreds or thousands of files in all your various sequencing run subdirectories. But that's going to take several days of repetitive, boring work sitting at the computer screen.

^[1] You may do a comparison by keeping copies of all your sequencing runs in one large BLAST library; building such a BLAST library can be done using the techniques shown in this section.

You can write a program in much less time than that! Then all you have to do is sit back and examine the results of any significant matches your program finds. To write the program, however, you have to know how to manipulate all the files and folders in Perl. The following sections show you how to do it.

11.2.1 Opening Directories

A filesystem is organized in a tree structure. The metaphor is apt. Starting from anyplace on the tree, you can proceed up the branches and get to any leaves that stem from your starting place. If you start from the root of the tree, you can reach all the leaves. Similarly, in a filesystem, if you start at a certain directory, you can reach all the files in all the subdirectories that stem from your starting place, and if you start at the root (which, strangely enough, is also called the "top") of the filesystem, you can reach all the files.

You've already had plenty of practice opening, reading from, writing to, and closing files. I will show a simple method with which you can open a folder (also called a directory) and get the filenames of all the files in that folder. Following that, you'll see how to get the names of all files from all directories and subdirectories from a certain starting point.

Let's look at the Perlish way to list all the files in a folder, beginning with some pseudocode:

open folder

read contents of folder (files and subfolders)

print their names

Example 11-1 shows the actual Perl code.

Example 11-1. Listing the contents of a folder (or directory)

#!/usr/bin/perl
#   Demonstrating how to open a folder and list its contents

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

my @files = (  );
my $folder = 'pdb';

# open the folder
unless(opendir(FOLDER, $folder)) {
    print "Cannot open folder $folder!\n";
    exit;
}

# read the contents of the folder (i.e. the files and subfolders)
@files = readdir(FOLDER);

# close the folder
closedir(FOLDER);

# print them out, one per line
print join( "\n", @files), "\n";

exit;

Since you're running this program on a folder that contains PDB files, this is what you'll see:

.
..
3c
44
pdb1a4o.ent

If you want to list the files in the current directory, you can give the directory name the special name "." for the current directory, like so:

my $folder = '.';

On Unix or Linux systems, the special files "." and ".." refer to the current directory and the parent directory, respectively. These aren't "really" files, at least not files you'd want to read; you can avoid listing them with the wonderful and amazing grep function. grep allows you to select elements from an array based on a test, such as a regular expression. Here's how to filter out the array entries "." and "..":

@files = grep( !/^\.\.?$/, @files);

grep selects all lines that don't match the regular expression, due to the negation operator written as the exclamation mark. The regular expression /^\.\.?$/ is looking for a line that begins with (the beginning of a line is indicated with the ^ metacharacter) a period \. (escaped with a backslash since a period is a metacharacter) followed by 0 or 1 periods \.? (the ? matches 0 or 1 of the preceding items), and nothing more (indicated by the $ end-of-string metacharacter).

In fact, this is so often used when reading a directory that it's usually combined into one step:

@files = grep (!/^\.\.?$/, readdir(FOLDER));

Okay, now all the files are listed. But wait: what if some of these files aren't files at all but are subfolders? You can use the handy file test operators to test each filename and then even open each subfolder and list the files in them. First, some pseudocode:

open folder

for each item in the folder

    if it's a file
        print its name

    else if it's a folder
        open the folder
        print the names of the contents of the folder
    }
}

Example 11-2 shows the program.

Example 11-2. List contents of a folder and its subfolders

#!/usr/bin/perl
#   Demonstrating how to open a folder and list its contents
#    --distinguishing between files and subfolders, which
#         are themselves listed

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

my @files = (  );
my $folder = 'pdb';

# Open the folder
unless(opendir(FOLDER, $folder)) {
    print "Cannot open folder $folder!\n";
    exit;
}

# Read the folder, ignoring special entries "." and ".."
@files = grep (!/^\.\.?$/, readdir(FOLDER));

closedir(FOLDER);

# If file, print its name
# If folder, print its name and contents
#
# Notice that we need to prepend the folder name!
foreach my $file (@files) {

    # If the folder entry is a regular file
    if (-f "$folder/$file") {
        print "$folder/$file\n";

    # If the folder entry is a subfolder
    }elsif( -d "$folder/$file") {

        my $folder = "$folder/$file";

        # open the subfolder and list its contents
        unless(opendir(FOLDER, "$folder")) {
            print "Cannot open folder $folder!\n";
            exit;
        }
        
        my @files = grep (!/^\.\.?$/, readdir(FOLDER));
        
        closedir(FOLDER);
        
        foreach my $file (@files) {
            print "$folder/$file\n";
        }
    }
}

exit;

Here's the output of Example 11-2:

pdb/3c/pdb43c9.ent
pdb/3c/pdb43ca.ent
pdb/44/pdb144d.ent
pdb/44/pdb144l.ent
pdb/44/pdb244d.ent
pdb/44/pdb244l.ent
pdb/44/pdb344d.ent
pdb/44/pdb444d.ent
pdb/pdb1a4o.ent

Notice how variable names such as $file and @files have been reused in this code, using lexical scoping in the inner blocks with my. If the overall structure of the program wasn't so short and simple, this could get really hard to read. When the program says $file, does it mean this $file or that $file? This code is an example of how to get into trouble. It works, but it's hard to read, despite its brevity.

In fact, there's a deeper problem with Example 11-2. It's not well designed. By extending Example 11-1, it can now list subdirectories. But what if there are further levels of subdirectories?

11.2.2 Recursion

If you have a subroutine that lists the contents of directories and recursively calls itself to list the contents of any subdirectories it finds, you can call it on the top-level directory, and it eventually lists all the files.

Let's write another program that does just that. A recursive subroutine is defined simply as a subroutine that calls itself. Here is the pseudocode and the code (Example 11-3) followed by a discussion of how recursion works:

subroutine list_recursively

    open folder

    for each item in the folder

        if it's a file
            print its name

        else if it's a folder
            list_recursively
    }
}

Example 11-3. A recursive subroutine to list a filesystem

#!/usr/bin/perl
#  Demonstrate a recursive subroutine to list a subtree of a filesystem

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

list_recursively('pdb');

exit;

################################################################################
# Subroutine
################################################################################

# list_recursively
#
#   list the contents of a directory,
#              recursively listing the contents of any subdirectories

sub list_recursively {

    my($directory) = @_;

    my @files = (  );
    
    # Open the directory
    unless(opendir(DIRECTORY, $directory)) {
        print "Cannot open directory $directory!\n";
        exit;
    }
    
    # Read the directory, ignoring special entries "." and ".."
    @files = grep (!/^\.\.?$/, readdir(DIRECTORY));
    
    closedir(DIRECTORY);
    
    # If file, print its name
    # If directory, recursively print its contents

    # Notice that we need to prepend the directory name!
    foreach my $file (@files) {
    
        # If the directory entry is a regular file
        if (-f "$directory/$file") {
    
            print "$directory/$file\n";
        
        # If the directory entry is a subdirectory
        }elsif( -d "$directory/$file") {

            # Here is the recursive call to this subroutine
            list_recursively("$directory/$file");
        }
    }
}

Here's the output of Example 11-3 (notice that it's the same as the output of Example 11-2):

pdb/3c/pdb43c9.ent
pdb/3c/pdb43ca.ent
pdb/44/pdb144d.ent
pdb/44/pdb144l.ent
pdb/44/pdb244d.ent
pdb/44/pdb244l.ent
pdb/44/pdb344d.ent
pdb/44/pdb444d.ent
pdb/pdb1a4o.ent

Look over the code for Example 11-3 and compare it to Example 11-2. As you can see, the programs are largely identical. Example 11-2 is all one main program; Example 11-3 has almost identical code but has packaged it up as a subroutine that is called by a short main program. The main program of Example 11-3 simply calls a recursive function, giving it a directory name (for a directory that exists on my computer; you may need to change the directory name when you attempt to run this program on your own computer). Here is the call:

list_recursively('pdb');

I don't know if you feel let down, but I do. This looks just like any other subroutine call. Clearly, the recursion must be defined within the subroutine. It's not until the very end of the list_recursively subroutine, where the program finds (using the -d file test operator) that one of the contents of the directory that it's listing is itself a directory, that there's a significant difference in the code as compared with Example 11-2. At that point, Example 11-2 has code to once again look for regular files or for directories. But this subroutine in Example 11-3 simply calls a subroutine, which happens to be itself, namely, list_recursively:

list_recursively("$directory/$file");

That's recursion.

As you've seen here, there are times when the data—for instance, the hierarchical structure of a filesystem—is well matched by the capabilities of recursive programs. The fact that the recursive call happens at the end of the subroutine means that it's a special type of recursion called tail recursion. Although recursion can be slow, due to all the subroutine calls it can create, the good news about tail recursion is that many compilers can optimize the code to make it run much faster. Using recursion can result in clean, short, easy-to-understand programs. (Although Perl doesn't yet optimize it, current plans for Perl 6 include support for optimizing tail recursion.)

11.2.3 Processing Many Files

Perl has modules for a variety of tasks. Some come standard with Perl; more can be installed after obtaining them from CPAN or elsewhere: http://www.CPAN.org/.

Example 11-3 in the previous section showed how to locate all files and directories under a given directory. There's a module that is standard in any recent version of Perl called File::Find. You can find it in your manual pages: on Unix or Linux, for instance, you issue the command perldoc File::Find. This module makes it easy—and efficient—to process all files under a given directory, performing whatever operations you specify.

Example 11-4 uses File::Find. Consult the documentation for more examples of this useful module. The example shows the same functionality as Example 11-3 but now uses File::Find. It simply lists the files and directories. Notice how much less code you have to write if you find a good module, ready to use!

Example 11-4. Demonstrate File::Find

#!/usr/bin/perl
#  Demonstrate File::Find

use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module

use File::Find;

find ( \&my_sub, ('pdb') );

sub my_sub {
    -f and (print $File::Find::name, "\n");
}

exit;

Notice that a reference is passed to the my_sub subroutine by prefacing it with the backslash character. You also need to preface the name with the ampersand character, as mentioned in Chapter 6.

The call to find can also be done like this:

find sub { -f and (print $File::Find::name, "\n") }, ('pdb');

This puts an anonymous subroutine in place of the reference to the my_sub subroutine, and it's a convenience for these types of short subroutines.

Here's the output:

pdb/pdb1a4o.ent
pdb/44/pdb144d.ent
pdb/44/pdb144l.ent
pdb/44/pdb244d.ent
pdb/44/pdb244l.ent
pdb/44/pdb344d.ent
pdb/44/pdb444d.ent
pdb/3c/pdb43c9.ent
pdb/3c/pdb43ca.ent

As a final example of processing files with Perl, here's the same functionality as the preceding programs, with a one-line program, issued at the command line:

perl -e 'use File::Find;find sub{-f and (print $File::Find::name,"\n")},("pdb")'

Pretty cool, for those who admire terseness, although it doesn't really eschew obfuscation. Also note that for those on Unix systems, ls -R pdb and find pdb -print do the same thing with even less typing.

The reason for using a subroutine that you define is that it enables you to perform any arbitrary tests on the files you find and then take any actions with those files. It's another case of modularization: the File::Find module makes it easy to recurse over all the files and directories in a file structure and lets you do as you wish with the files and directories you find.

< BACK

CONTINUE >

Index terms contained in this section

files
      PDB
folders
      PDB
Protein Data Bank (PDB)
      files and folders