< BACKCONTINUE >

12.6 Bioperl

The Bioperl project is an important collection of Perl code for bioinformatics that has been in development since 1998. Although Bioperl uses the more advanced object-oriented style of Perl program design, it's possible to take an introductory look here at how it's organized and used.

The main focus of Bioperl modules is to perform sequence manipulation, provide access to various biology databases (both local and web-based), and parse the output of various programs.

Bioperl is available at http://www.bioperl.org/. Some of its features rely on having additional Perl modules—available from CPAN (http://www.cpan.org/)—installed. This situation is quite common, and as you do more Perl programming, you'll become familiar with installing modules from CPAN. The Bioperl tutorials include information on installing Bioperl and additional modules for the three major operating systems: Unix or Linux, Mac, and Windows.

Bioperl doesn't provide complete programs. Rather, it provides a fairly large—and growing—set of modules for accomplishing common tasks, including some tasks you've seen in this book. You're responsible for writing the code that holds the modules together. By providing these ready and (usually) easy-to-use modules, Bioperl makes developing bioinformatics applications in Perl faster and easier. There are example programs for most of the modules, which can be examined and modified to get started.

Like many open source projects, Bioperl has suffered from fragmentation and uneven documentation, due to the strictly volunteer and geographically dispersed group of contributors. But recent work on the project leading up to Release 0.7 in March 2001 has significantly improved the project. In particular, there is now enough tutorial information on using the modules to enable you to make good use of the code.

Some difficulties still remain. Most of the code has been developed on Unix or Linux systems. Not all of it works on Macs or Windows operating systems, but most will. There are some documents available at the Bioperl web site that discuss using Bioperl on non-Unix computers, but the bottom line is that you might find that some things don't work.

If you're going to give Bioperl a try (and I strongly recommend you do), you should make sure you have a fairly recent version of Perl installed. You'll need at least Version 5.004; it would be much better to install the latest stable release from the Perl web site http://www.perl.com.

12.6.1 Sample Modules

To give you an idea of what tasks Bioperl can make easier for you, Table 12-1 displays a representative sample of some of the most useful modules available.

Table 12-1. Bioperl modules

Module

Description

Bio::Seq

Sequence object, with features

Bio::SimpleAlign

Multiple alignments held as a set of sequences

Bio::Species

Generic species object

Bio::DB::Ace

Database object interface to ACeDB servers

Bio::DB::GDB

Database object interface to GDB HTTP query

Bio::DB::GenBank

Database object interface to GenBank

Bio::DB::GenPept

Database object interface to GenPept

Bio::DB::NCBIHelper

A collection of routines useful for queries to NCBI databases

Bio::DB::SwissProt

Database object interface to SWISS-PROT retrieval

Bio::Index::Fasta

Interface for indexing FASTA files

Bio::Index::GenBank

Interface for indexing GenBank seq files, that is, flat files in GenBank format

Bio::Location::Simple

Implementation of a simple location on a sequence

Bio::Location::Split

Implementation of a location on a sequence that has multiple locations

Bio::SeqFeature::FeaturePair

Holds pair feature information, e.g., BLAST hits

Bio::SeqFeature::Generic

Generic SeqFeature

Bio::SeqFeature::Similarity

Sequence feature based on similarity

Bio::SeqFeature::SimilarityPair

Sequence feature based on the similarity of two sequences

Bio::SeqFeature::Gene::Exon

Feature representing an exon

Bio::SeqFeature::Gene::GeneStructure

Feature representing an arbitrarily complex structure of a gene

Bio::SeqFeature::Gene::Transcript

Feature representing a transcript

Bio::SeqFeature::Gene::TranscriptI

Interface for a feature representing a transcript of exons, promoter, UTR, and a poly-adenylation site

Bio::Tools::Blast

Bioperl BLAST sequence analysis object

Bio::Tools::BPbl2seq

Lightweight BLAST parser for pair-wise sequence alignment using the BLAST algorithm

Bio::Tools::BPlite

Lightweight BLAST parser

Bio::Tools::BPpsilite

Lightweight BLAST parser for PSIBLAST reports

Bio::Tools::CodonTable

Bioperl codon table object

Bio::Tools::Fasta

Bioperl FASTA utility object

Bio::Tools::IUPAC

Generates unique seq objects from an ambiguous seq object

Bio::Tools::RestrictionEnzyme

Bioperl object for a restriction endonuclease object

Bio::Tools::SeqPattern

Bioperl object for a sequence pattern or motif

Bio::Tools::SeqStats

Object holding statistics for one particular sequence

Bio::Tools::SeqWords

Object holding n-mer statistics for one sequence

Bio::Tools::Blast::HSP

Bioperl BLAST high-scoring segment pair object

Bio::Tools::Blast::HTML

Bioperl utility module for HTML-formatting BLAST reports

Bio::Tools::Blast::Sbjct

Bioperl BLAST "hit" object

Bio::Tools::Blast::Run::LocalBlast

Bioperl module for running BLAST analyses locally

Bio::Tools::Blast::Run::Webblast

Bioperl module for running BLAST analyses using an HTTP interface

Bio::Tools::Prediction::Exon

Predicted exon feature

Bio::Tools::Prediction::Gene

Predicted gene structure feature

Bio::Variation::AAChange

Sequence change class for polypeptides

Bio::Variation::AAReverseMutate

Point mutation and codon information from single amino acid changes

Bio::Variation::Allele

Sequence object with allele-specific attributes

Bio::Variation::DNAMutation

DNA-level mutation class

Bio::Variation::IO

Handler for sequence variation I/O formats

12.6.2 Bioperl Tutorial Script

Bioperl has a tutorial script to help you try out various parts of the package. In this section, I'll show how to start up and run some example computations.

I've mentioned already that you should learn how to download code from CPAN in order to add modules such as Bioperl. A great deal of the usefulness of the Perl programming environment now resides in these modules available on CPAN. This was a design decision: by concentrating on the core Perl language, the Perl designers can focus on making the language as good as they can. The Perl module developers can then concentrate on their many modules. By all means, take a look around the CPAN web site for an idea of the wealth of Perl modules available to you.

I won't give the details of how to install Bioperl here: as mentioned, they are available at the Bioperl web site, or you can visit the CPAN web site for information.

So, let's assume you've installed the Bioperl module and looked over the tutorial at the Bioperl web site. Now, let's see how to try out some Bioperl programs.

Go to the directory where the Bioperl software has been built on your system. For instance, on my Linux computer, I put the download file bioperl-0.7.0.tar.gz into the directory /usr/local/src, and then unpacked it with the command:

tar xvzf bioperl-0.7.0.tar.gz

which creates the source directory /usr/local/src/bioperl-0.7.0. After installing the module (check the documentation), you're ready to run the tutorial script.

Change to the source directory and type perl bptutorial.pl. Here's the result (I've shown the head of the tutorial to give the author and copyright information):

% head bptutorial.pl 
# $Id: ch12,v 1.44 2001/10/10 20:37:42 troutman Exp mam $

=head1  BioPerl Tutorial

  Cared for by Peter Schattner <schattner@alum.mit.edu>

  Copyright Peter Schattner

   This tutorial includes "snippets" of code and text from various
   Bioperl documents including module documentation, example scripts
% perl bptutorial.pl 

The following numeric arguments can be passed to run the corresponding demo-script.
1 => access_remote_db ,
2 => index_local_db ,
3 => fetch_local_db ,               (# NOTE: needs to be run with demo 2)
4 => sequence_manipulations ,
5 => seqstats_and_seqwords ,
6 => restriction_and_sigcleave ,
7 => other_seq_utilities ,
8 => run_standaloneblast ,
9 => blast_parser ,
10 => bplite_parsing ,
11 => hmmer_parsing ,
12 => run_clustalw_tcoffee ,
13 => run_psw_bl2seq ,
14 => simplealign_univaln ,
15 => gene_prediction_parsing ,
16 => sequence_annotation ,
17 => largeseqs ,
18 => liveseqs ,
19 => demo_variations ,
20 => demo_xml ,

In addition the argument "100" followed by the name of a single
bioperl object will display a list of all the public methods
available from that object and from what object they are inherited.

Using the parameter "0" will run all tests.
Using any other argument (or no argument) will run this display.

So typical command lines might be:
To run all demo scripts:
 > perl -w  bptutorial.pl 0
or to just run the local indexing demos:
 > perl -w  bptutorial.pl 2 3
or to list all the methods available for object Bio::Tools::SeqStats -
 > perl -w  bptutorial.pl 100 Bio::Tools::SeqStats

%

Now let's try option 9, the BLAST parser, and option 1, access_remote_db. So here goes, starting with the BLAST parser:

% perl bptutorial.pl 9

Beginning blast.pm parser example... 

QUERY NAME     : gi|1401126
QUERY DESC     : UNKNOWN
LENGTH         : 504
FILE           : t/blast.report
DATE           : Thu, 16 Apr 1998 18:56:18 -0400
PROGRAM        : TBLASTN
VERSION        : 2.0.4 [Feb-24-1998]</b>
DB-NAME        : Non-redundant GenBank+EMBL+DDBJ+PDB sequences
DB-RELEASE     : Apr 16, 1998  9:38 AM
DB-LETTERS     : 677679054
DB-SEQUENCES   : 336723
GAPPED         : YES
TOTAL HITS     : 100
CHECKED ALL    : YES
FILT FUNC      : NO
SIGNIF HITS    : 4
SIGNIF CUTOFF  : 1.0e-05 (EXPECT-VALUE)
LOWEST EXPECT  : 0.0
HIGHEST EXPECT : 1e-05
HIGHEST EXPECT : 7.6 (OVERALL)
MATRIX         : BLOSUM62
FILTER         : NONE
EXPECT         : 10
LAMBDA, K, H   : 0.270, 0.0470, 0.230 (SHARED STATS)
WORD SIZE      : 13
S              : 42, 74 (SHARED STATS)
GAP CREATION   : 11
GAP EXTENSION  : 1

Number of hits is 4 
Fraction identical for hit 1 is 0.25 
Sequence identities for hsp of hit 1 are 66-68 70 73 76 79 80 87-89 114 117
119 131 144 146 149 150 152 156 162 165 168 170 171 176 178-182 184 187 190
191 205-207 211 214 217 222 226 241 244 245 249 256 266-268 270 278 284 291
296 304 306 309 311 316 319 324 
%

This is an interesting way to parse BLAST output! Now let's look at the access of the remote DB:

% perl bptutorial.pl 1
Beginning remote database access example... 
seq1 display id is MUSIGHBA1 
seq2 display id is AF303112 
Display id of first sequence in stream is AF041456
% 

Well, that was less informative as an output, but it seems you can infer that the remote DB access was successful. (By the way, if you're unsuccessful with this, it may be that you're behind a firewall which is denying access—a not uncommon occurrence in universities or large companies.)

The documentation suggests running the bptutorial.pl script under the Perl debugger to watch what happens step by step. I concur with that suggestion but won't include the output here. Try it yourself!

Since that last example wasn't much fun, let's try one more: here's the sequence manipulation tutorial:

% perl bptutorial.pl 4

Beginning sequence_manipulations and SeqIO example... 
First sequence in fasta format... 
>Test1
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
ACAACATCCATGAAACGCATTAGCACCACC
Seq object display id is Test1
Sequence is AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAG
CAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATA
GGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC 
Sequence from 5 to 10 is TTTCAT 
Acc num is unknown 
Moltype is dna 
Primary id is Test1 
Truncated Seq object sequence is TTTCAT 
Reverse complemented sequence 5 to 10  is GTGCTA  
Translated sequence 6 to 15 is LQRAICLCVD 

Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults   : LRK*
ctgagaaaataa translated as a coding region (CDS): MRK

Translating in all six frames:
 frame: 0 forward: LRK*
 frame: 0 reverse-complement: LFSQ
 frame: 1 forward: *ENX
 frame: 1 reverse-complement: YFLX
 frame: 2 forward: EKI
 frame: 2 reverse-complement: IFS
Translating with all codon tables using method defaults:
1 : LRK*
2 : L*K*
3 : TRK*
4 : LRK*
5 : LSK*
6 : LRKQ
9 : LSN*
10 : LRK*
11 : LRK*
12 : SRK*
13 : LGK*
14 : LSNY
15 : LRK*
16 : LRK*
21 : LSN*
% 

That was more fun, because this part of Bioperl is doing several things we've done in this book.

I hope this brief look at Bioperl has whetted your appetite for more. It's a good idea to explore this set of modules. A Perl module for parsing BLAST output called BPLite.pm may also be of interest: it's now part of the Bioperl project.

< BACKCONTINUE >

Index terms contained in this section

Bioperl modules
      representative sample of
      tutorial script
      web site for downloading
BLAST (Basic Local Alignment Search Tool)
      Bioperl, using with
BPLite.pm (Bioperl module for BLAST)
CPAN (Comprehensive Perl Archive Network)
      modules, downloading from
modules
      Bioperl
            representative sample of
            tutorial script
            web site for downloading
      CPAN (Comprehensive Perl Archive Network)
operating systems
      Bioperl, problems with
Perl
      Bioperl modules
      CPAN (Comprehensive Perl Archive Network)
      latest stable release, web site for
tutorial script, Bioperl
web sites
      CPAN
     Perl
            latest stable release

© 2002, O'Reilly & Associates, Inc.