6 BioPerlTutorial - a tutorial for bioperl
10 Written by Peter Schattner <schattner@alum.mit.edu>
12 Copyright Peter Schattner
14 Contributions, additions and corrections have been made
15 to this document by the following individuals:
26 This tutorial includes "snippets" of code and text from various
27 Bioperl documents including module documentation, example scripts
28 and "t" test scripts. You may distribute this tutorial under the
29 same terms as perl itself.
31 This document is written in Perl POD (plain old documentation)
32 format. You can run this file through your favorite pod translator
33 (pod2html, pod2man, pod2text, etc.) if you would like a more
34 convenient formatting.
40 I.2 Software requirements
41 I.2.1 For minimal bioperl installation
42 I.2.2 For complete installation
43 I.3 Installation procedures
44 I.4 Additional comments for non-unix users
46 II. Brief overview to bioperl's objects
47 II.1 Sequence objects:
48 (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, RichSeq, SeqWithQuality, SeqI)
49 II.2 Alignment objects (SimpleAlign)
50 II.3 Location objects (Simple, Split, Fuzzy)
51 II.4 Interface objects and implementation objects
54 III.1 Accessing sequence data from local and remote databases
55 III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
56 III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl,
58 III.2 Transforming formats of database/ file records
59 III.2.1 Transforming sequence files (SeqIO)
60 III.2.2 Transforming alignment files (AlignIO)
61 III.3 Manipulating sequences
62 III.3.1 Manipulating sequence data with Seq methods (Seq)
63 III.3.2 Obtaining basic sequence statistics- MW, residue &codon frequencies
65 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
66 III.3.4 Identifying amino acid cleavage sites (Sigcleave)
67 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
68 III.3.6 Sequence manipulation using the Bioperl EMBOSS interface
69 III.3.7 Sequence manipulation without creating Bioperl "objects"
70 III.4 Searching for "similar" sequences
71 III.4.1 Running BLAST locally (StandAloneBlast)
72 III.4.2 Running BLAST remotely (using RemoteBlast.pm)
73 III.4.3 Parsing BLAST and FASTA reports with Search and SearchIO
74 III.4.4 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq
75 III.4.5 Parsing HMM reports (HMMER::Results, SearchIO)
76 III.5 Creating and manipulating sequence alignments
77 III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
78 III.5.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
79 III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
80 III.5.4 Manipulating and displaying alignments (SimpleAlign)
81 III.6 Searching for genes and other structures on genomic DNA
82 (Genscan, Sim4, ESTScan, MZEF, Grail, Genemark, EPCR)
83 III.7 Developing machine readable sequence annotations
84 III.7.1 Representing sequence annotations (Annotation, SeqFeature, RichSeq)
85 III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
86 III.7.3 Representing related sequences - mutations, polymorphisms etc (Allele,
88 III.7.4 Incorpotating quality data in sequence annotation (SeqWithQuality)
89 III.7.5 Sequence XML representations - generation and parsing (SeqIO::game)
90 III.8 Representing non-sequence data in Bioperl: structures, trees, maps, graphics
91 and bibliographic text
92 III.8.1 Using 3D structure objects and reading PDB files (StructureI,
94 III.8.2 Tree objects and phylogenetic trees (Tree::Tree, TreeIO)
95 III.8.3 Map objects for manipulating genetic maps (Map::MapI, MapIO)
96 III.8.4 Bibliographic objects for querying bibliographic databases (Biblio)
97 III.8.5 Graphics objects for representing sequence objects as images (Graphics)
99 III.9 Bioperl alphabets
100 III.9.1 Extended DNA / RNA alphabet
101 III.9.2 Amino Acid alphabet
103 IV. Related projects - biocorba, biopython, biojava, EMBOSS, Ensembl, GFF, Genquire
105 IV.2 Biopython and biojava
107 IV.4 Ensembl and bioperl-db
108 IV.5 GFF format and Bio::DB::GFF*
109 IV.6 Genquire, the Annotation Workbench and bioperl-gui
112 V.1 Finding out which methods are used by which Bioperl Objects
113 V.2 Tutorial Demo Scripts
115 =head1 I. Introduction
119 Bioperl is a collection of perl modules that facilitate the
120 development of perl scripts for bioinformatics applications. As
121 such, it does not include ready to use programs in the sense that many
122 commercial packages and free web-based interfaces (eg Entrez, SRS) do.
123 On the other hand, bioperl does provide reusable perl modules that
124 facilitate writing perl scripts for sequence manipulation, accessing
125 of databases using a range of data formats and execution and parsing
126 of the results of various molecular biology programs including Blast,
127 clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently, bioperl
128 enables developing scripts that can analyze large quantities of
129 sequence data in ways that are typically difficult or impossible with
132 In order to take advantage of bioperl, the user needs a basic
133 understanding of the perl programming language including an
134 understanding of how to use perl references, modules, objects and
135 methods. If these concepts are unfamiliar the user is referred to any
136 of the various introductory / intermediate books on perl. (I've liked
137 S. Holzmer's Perl Core Language, Coriolis Technology Press, for
138 example). This tutorial is not intended to teach the fundamentals of
139 perl to those with little or no experience in the perl language. On
140 the other hand, advanced knowledge of perl - such as how to write a
141 perl object - is not required for successfully using bioperl.
143 Bioperl is open source software that is still under active
144 development. The advantages of open source software are well known.
145 They include the ability to freely examine and modify source code and
146 exemption from software licensing fees. However, since open source
147 software is typically developed by a large number of volunteer
148 programmers, the resulting code is often not as clearly organized and
149 its user interface not as standardized as in a mature commercial
150 product. In addition, in any project under active development,
151 documentation may not keep up with the development of new features.
152 Consequently the learning curve for actively developed, open source
153 source software is sometimes steep.
155 This tutorial is intended to ease the learning curve for new users of
156 bioperl. To that end the tutorial includes:
162 Descriptions of what bioinformatics tasks can be handled with bioperl
166 Directions on where to find the methods to accomplish these tasks
167 within the bioperl package
171 Recommendations on where to go for additional information.
175 The POD documentation should contain runnable code in the SYNOPSIS section
176 which is meant to illustrate the use of a module and its methods.
180 Running the bptutorial.pl script while going through this tutorial - or
181 better yet, stepping through it with an interactive debugger - is a
182 good way of learning bioperl. The tutorial script is also a good
183 place from which to cut-and-paste code for your scripts (rather than
184 using the code snippets in this tutorial). Most of the scripts in the
185 tutorial script should work on your machine - and if they don't it would
186 probably be a good idea to find out why, before getting too involved
189 This tutorial does not intend to be a comprehensive description of all
190 the objects and methods available in bioperl. For that the reader is
191 directed to the documentation included with each of the modules. A
192 very useful interface for finding one's way within all the module
193 documentation can be found at http://doc.bioperl.org/bioperl-live/.
194 This interface lists all bioperl modules and descriptions of all
197 One potential problem in locating the correct documentation is that
198 multiple methods in different modules may all share the same name.
199 Moreover, because of perl's complex method of "inheritance",
200 it is not often clear which of the identically named methods is being
201 called by a given object. One way to resolve this question is by using
202 the software described in Appendix V.1.
204 For those who prefer more visual descriptions,
205 http://bioperl.org/Core/Latest/modules.html also offers links to three
206 PDF files which contain schematics that describe how many of the bioperl
207 objects related to one another.
209 In addition, a bioperl online course is available on the web at
210 http://www.pasteur.fr/recherche/unites/sis/formation/bioperl. The
211 user is also referred to numerous bioperl scripts in the scripts/
212 directory (see bioscripts.pod for a description of these scripts).
214 =head2 I.2 Software requirements
218 =head2 I.2.1 Minimal bioperl installation
220 For a "minimal" installation of bioperl, you will need to have perl
221 itself installed as well as the bioperl "core modules". Bioperl has
222 been tested primarily using perl 5.005 and perl 5.6.
223 The minimal bioperl installation should still work under perl 5.004.
224 However, as increasing numbers of bioperl objects are using modules
225 from CPAN (see below), problems have been observed for bioperl running
226 under perl 5.004. So if you are having trouble running bioperl under
227 perl 5.004, you should probably upgrade your version of perl.
229 In addition to a current version of perl, the new user of bioperl is
230 encouraged to have access to, and familiarity with, an interactive
231 perl debugger. Bioperl is a large collection of complex interacting
232 software objects. Stepping through a script with an interactive
233 debugger is a very helpful way of seeing what is happening in such a
234 complex software system - especially when the software is not behaving
235 in the way that you expect. The free graphical debugger ptkdb
236 (available as Devel::ptkdb from CPAN) is highly recommended. Active
237 State, from http://www.activestate.com , offers a commercial graphical
238 debugger for windows systems. The standard perl distribution also
239 contains a powerful interactive debugger - though with a more cumbersome
240 command-line interface. The Perl tool Data::Dumper used with the
244 printer Dumper($seq);
246 can also be helpful for obtaining debugging information on perl objects.
248 =head2 I.2.2 Complete installation
250 Taking full advantage of bioperl requires software beyond that for the
251 minimal installation. This additional software includes perl modules
252 from CPAN, bioperl perl extensions, a bioperl xs-extension, and
253 several standard compiled bioinformatics programs.
257 For a complete listing of external Perl modules required by bioperl
258 please see the L<INSTALL> file.
260 B<Bioperl C extensions & external bioinformatics programs>
262 Bioperl also uses several C programs for sequence alignment and local
263 blast searching. To use these features of bioperl you will need an
264 ANSI C or Gnu C compiler as well as the actual program available from
267 for Smith-Waterman alignments- bioperl-ext-0.6 from
268 http://bioperl.org/Core/external.shtml
270 for clustalw alignments-
271 ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/
273 for tcoffee alignments-
274 http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
276 for local blast searching- ftp://ftp.ncbi.nlm.nih.gov/blast/server/current_release/
278 for EMBOSS applications - http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html
280 =for html <A NAME ="i.3"></A>
282 =head2 I.3 Installation
284 The actual installation of the various system components is
285 accomplished in the standard manner:
291 Locate the package on the network
299 Decompress (with gunzip or a similiar utility)
303 Remove the file archive (eg with tar -xvf)
307 Create a "makefile" (with "perl Makefile.PL" for perl modules or a
308 supplied "install" or "configure" program for non-perl program
312 Run "make", "make test" and "make install" This procedure must be
313 repeated for every CPAN module, bioperl-extension and external
314 module to be installed. A helper module CPAN.pm is available from
315 CPAN which automates the process for installing the perl modules.
317 The CPAN module can also be used to install all of the modules
318 listed above in a single step as a "bundle" of modules,
321 $>perl -MCPAN -e shell
322 cpan>install Bundle::BioPerl
323 <installation details....>
324 cpan>install B/BI/BIRNEY/bioperl-1.0.1.tar.gz
325 <installation details....>
328 Be advised that version numbers change regularly, so the number used
329 above may not apply. The disadvantage of this approach is that if
330 there's a problem installing any individual module it may be a bit
331 more difficult to address.
335 For the external programs (clustal, Tcoffee, ncbi-blast), there is an
342 Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or
343 BLASTDIR) to the directory holding the executable in your startup
344 file - eg in .bashrc. (For running local blasts, it is also
345 necessary that the name of local-blast database directory is known
346 to bioperl. This will typically happen automatically, but in case
347 of difficulty, refer to the documentation in
348 L<Bio::Tools::Run::StandAloneBlast>)
352 The only likely complication (at least on unix systems) that may occur
353 is if you are unable to obtain system level writing privileges. For
354 instructions on modifying the installation in this case and for more
355 details on the overall installation procedure, see the README file in
356 the bioperl distribution as well as the README files in the external
357 programs you want to use (eg bioperl-ext, clustalw, TCoffee,
360 =head2 I.4 Additional comments for non-unix users
362 Bioperl has mainly been developed and tested under various unix
363 environments (including Linux) and this tutorial is intended primarily
364 for unix users. The minimal installation of bioperl should work
365 under other OS's (NT, windows,Mac). However, bioperl has not been
366 widely tested under these OS's.
368 Todd Richmond has written of his experiences with BioPerl on MacOS 9
369 at http://bioperl.org/Core/mac-bioperl.html. There is also a
370 description of bioperl on windows by Jurgen Pletinckx at
371 http://www.bioperl.org/Core/windows-bioperl.html. (Note that currently
372 these documents describe release 0.7.x of bioperl.) Minimal bioperl
373 does run without problems on Mac OS X since it is a Unix system. However,
374 external precompiled programs (eg NCBI local Blast) and other useful
375 auxiliary programs such as perl-TK and ptkdb are in many cases not yet
376 available under OS X.
378 Steve Cannon has compiled installation notes and suggestions for Bioperl
379 on OS X online at http://www.tc.umn.edu/~cann0010/Bioperl_OSX_install.html.
381 Many bioperl features require the use of CPAN modules, compiled
382 extensions or external programs. These features will probably will
383 not work under some or all of these other operating systems. If a
384 script attempts to access these features from a non-unix OS, bioperl
385 is designed to simply report that the desired capability is not available.
386 However, since the testing of bioperl in these environments has been
387 limited, the script may well crash in a less "graceful" manner.
389 =head1 II. Brief introduction to bioperl's objects
391 The purpose of this tutorial is to get you using bioperl to solve
392 real-life bioinformatics problems as quickly as possible. The aim is
393 not to explain the structure of bioperl objects or perl
394 object-oriented programming in general. Indeed, the relationships
395 among the bioperl objects is not simple; however, understanding them
396 in detail is fortunately not necessary for successfully using the
399 Nevertheless, a little familiarity with the bioperl object "bestiary"
400 can be very helpful even to the casual user of bioperl. For example
401 there are (at least) seven different "sequence objects" - Seq,
402 PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI, and SeqWithQuality.
403 Understanding the relationships among these objects - and why there are
404 so many of them - will help you select the appropriate one to use in your
407 =for html <A NAME ="ii.1"></A>
409 =head2 II.1 Sequence objects: (Seq, RichSeq, SeqWithQuality, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
411 Seq is the central sequence object in bioperl. When in doubt this is
412 probably the object that you want to use to describe a DNA, RNA or
413 protein sequence in bioperl. Most common sequence manipulations can
414 be performed with Seq. These capabilities are described in sections
415 L<"III.3.1"> and L<"III.7.1">, or in L<Bio::Seq>.
417 Seq objects can be created explicitly (see section L<"III.2.1"> for an
418 example). However usually Seq objects will be created for you
419 automatically when you read in a file containing sequence data using
420 the SeqIO object. This procedure is described in section L<"III.2.1">.
421 In addition to storing its identification labels and the sequence itself,
422 a Seq object can store multiple annotations and associated "sequence
423 features". This capability can be very useful - especially in
424 development of automated genome annotation systems, see section
427 RichSeq objects store additional annotations beyond those used by
428 standard Seq objects. If you are using sources with very rich
429 sequence annotation, you may want to consider using these objects
430 which are described in section L<"III.7.1">. SeqWithQuality objects are
431 used to manipulate sequences with quality data, like those produced by
432 phred. These objects are described in section L<"III.7.4">,
433 L<Bio::Seq::RichSeqI>, and in L<Bio::Seq::SeqWithQuality>.
435 On the other hand, if you need a script capable of simultaneously
436 handling many (hundreds or thousands) sequences at a time, then the
437 overhead of adding annotations to each sequence can be significant.
438 For such applications, you will want to use the PrimarySeq
439 object. PrimarySeq is basically a "stripped down" version of Seq.
440 It contains just the sequence data itself and a few identifying labels
441 (id, accession number, molecule type = dna, rna, or protein). For
442 applications with hundreds or thousands or sequences, using PrimarySeq
443 objects can significantly speed up program execution and decrease the
444 amount of RAM the program requires. See L<Bio::PrimarySeq> for more
447 What is (for historical reasons) called a LocatableSeq object
448 might be more appropriately called an "AlignedSeq" object. It is a Seq
449 object which is part of a multiple sequence alignment. It has "start" and
450 "end" positions indicating from where in a larger sequence it may have
451 been extracted. It also may have "gap" symbols corresponding to the
452 alignment to which it belongs. It is used by the alignment
453 object SimpleAlign and other modules that use SimpleAlign objects
454 (eg AlignIO.pm, pSW.pm). In general you don't have to
455 worry about creating LocatableSeq objects because they will be made for
456 you automatically when you create an alignment (using pSW, Clustalw,
457 Tcoffee or bl2seq) or when you input an alignment data file using AlignIO.
458 However if you need to input a sequence alignment by hand (eg to build
459 a SimpleAlign object), you will need to input the sequences as
460 LocatableSeqs. Other sources of information include L<Bio::LocatableSeq>,
461 L<Bio::SimpleAlign>, L<Bio::AlignIO>, and L<Bio::Tools::pSW>.
463 A LargeSeq object is a special type of Seq object used for
464 handling very long ( eg E<gt> 100 MB) sequences. If you need to
465 manipulate such long sequences see section L<"III.7.2"> which describes
466 LargeSeq objects, or L<Bio::Seq::LargeSeq>.
468 A LiveSeq object is another specialized object for storing
469 sequence data. LiveSeq addresses the problem of features whose location
470 on a sequence changes over time. This can happen, for example, when
471 sequence feature objects are used to store gene locations on newly
472 sequenced genomes - locations which can change as higher quality
473 sequencing data becomes available. Although a LiveSeq object is not
474 implemented in the same way as a Seq object, LiveSeq does implement
475 the SeqI interface (see below). Consequently, most methods available
476 for Seq objects will work fine with LiveSeq objects. Section L<"III.7.2">
477 and L<Bio::LiveSeq> contain further discussion of LiveSeq objects.
479 SeqI objects are Seq "interface objects" (see section L<"II.4"> and
480 L<Bio::SeqI>). They are used to ensure bioperl's compatibility with
481 other software packages. SeqI and other interface objects are not
482 likely to be relevant to the casual bioperl user.
484 *** Having described these other types of sequence objects, the
485 "bottom line" still is that if you store your sequence data in Seq
486 objects (which is where they'll be if you read them in with
487 SeqIO), you will usually do just fine. ***
489 =for html <A NAME ="ii.2"></A>
491 =head2 II.2 Alignment objects (SimpleAlign)
493 This module allows the user to convert between alignment formats
494 as well as more sophisticated operations, like extracting specific regions
495 of the alignment and generating consensus sequences. For more information
496 see section L<"III.5.4"> and L<Bio::SimpleAlign>.
499 =for html <A NAME ="ii.3"></A>
501 =head2 II.3 Location objects
503 A Location object is designed to be associated with a Sequence
504 Feature object to indicate where on a larger structure (eg a chromosome or
505 contig) the feature can be found. The reason why this simple concept has
506 evolved in a collection of rather complicated objects is that
508 1) Some objects have multiple locations or sub-locations (e.g. a gene's exons
509 may have multiple start and stop locations)
510 2) In unfinished genomes, the precise locations of features is not known with
513 Bioperl's various Location objects address these complications. In addition
514 there are "CoordinatePolicy" objects that allow the user to specify how to
515 measure the "length" of a feature if its precise start and end coordinates are
516 not know. In most cases, you will not need to worry about these complications
517 if you are using bioperl to handle simple features with well-defined start
518 and stop locations. However, if you are using bioperl to annotate partially
519 or unfinished genomes or to read annotations of such genomes with bioperl,
520 understanding the various Location objects will be important. See the
521 documentation of the various modules in the Bio::Locations directory or
522 L<Bio::Location::CoordinatePolicyI> for more information.
524 =for html <A NAME ="ii.4"></A>
526 =head2 II.4 Interface objects and implementation objects
528 Since release 0.6, bioperl has been moving to separate interface and
529 implementation objects. An interface is solely the definition of what
530 methods one can call on an object, without any knowledge of how it is
531 implemented. An implementation is an actual, working implementation of
532 an object. In languages like Java, interface definition is part of the
533 language. In Perl, you have to roll your own.
535 In bioperl, the interface objects usually have names like
536 Bio::MyObjectI, with the trailing I indicating it is an interface
537 object. The interface objects mainly provide documentation on what the
538 interface is, and how to use it, without any implementations (though
539 there are some exceptions). Although interface objects are not of
540 much direct utility to the casual bioperl user, being aware of their
541 existence is useful since they are the basis to understanding how
542 bioperl programs can communicate with other bioinformatics projects
543 such as Ensembl and the Annotation Workbench (see section IV).
545 For more discussion of design and development issues please see the
548 =head1 III. Using bioperl
551 Bioperl provides software modules for many of the typical tasks of
552 bioinformatics programming. These include:
556 =item * Accessing sequence data from local and remote databases
558 =item * Transforming formats of database/ file records
560 =item * Manipulating individual sequences
562 =item * Searching for "similar" sequences
564 =item * Creating and manipulating sequence alignments
566 =item * Searching for genes and other structures on genomic DNA
568 =item * Developing machine readable sequence annotations
572 The following sections describe how bioperl can help perform all of
575 =head2 III.1 Accessing sequence data from local and remote databases
577 Much of bioperl is focused on sequence manipulation. However, before
578 bioperl can manipulate sequences, it needs to have access to sequence
579 data. Now one can directly enter data sequence data into a bioperl
582 $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
583 '-desc'=>'Sample Bio::Seq object',
584 '-display_id' => 'something',
585 '-accession_number' => 'accnum',
586 '-alphabet' => 'dna' );
588 However, in most cases, it is preferable to access sequence data from
589 some online data file or database (Note that in common with
590 conventional bioinformatics usage we will call a "database" what might
591 be more appropriately referred to as an "indexed flat file".) Bioperl
592 supports accessing remote databases as well as developing indices for
593 setting up local databases.
595 =head2 III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
597 Accessing sequence data from the principal molecular biology databases
598 is straightforward in bioperl. Data can be accessed by means of the
599 sequence's accession number or id. Batch mode access is also
600 supported to facilitate the efficient retrieval of multiple sequences.
601 For retrieving data from genbank, for example, the code could be as
604 $gb = new Bio::DB::GenBank();
605 # this returns a Seq object :
606 $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
607 # this returns a Seq object :
608 $seq2 = $gb->get_Seq_by_acc('AF303112'))
609 # this returns a SeqIO object :
610 $seqio = $gb->get_Stream_by_batch("J00522","AF303112","2981014");
612 Bioperl currently supports sequence data retrieval from the genbank,
613 genpept, RefSeq, swissprot, and EMBL databases. See L<Bio::DB::GenBank>,
614 L<Bio::DB::GenPept>, L<Bio::DB::SwissProt>, L<Bio::DB::RefSeq> and
615 L<Bio::DB::EMBL> for more information. A user can also specify a different
616 database mirror for a database - this is especially relevent for the SwissProt
617 resource where there are many ExPaSy mirrors. There are also configuration
618 options for specifying local proxy servers for those behind firewalls.
620 The retrieval of NCBI RefSeqs sequences is supported through a special
621 module called Bio::DB::RefSeq which actually queries an EBI server.
622 Please see L<Bio::DB::RefSeq> before using it as there are some caveats
623 with RefSeq retrieval. RefSeq ids in Genbank begin with "NT_", "NC_",
624 "NG_", "NM_", "NP_", "XM_", "XR_", or "XP_" (for more information see
625 http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html). Bio::DB::GenBank
626 can be used to retrieve entries corresponding to these ids but bear in
627 mind that these are not Genbank entries, strictly speaking. See
628 L<Bio::DB::GenBank> for special details on retrieving entries beginning
629 with "NT_", these are specially formatted "CONTIG" entries.
631 Bioperl also supports retrieval from a remote Ace database. This
632 capability requires the presence of the external AcePerl module. You
633 need to download and install the aceperl module from
634 http://stein.cshl.org/AcePerl/.
636 An additional module is available for accessing remote databases, BioFetch,
637 which queries the dbfetch script at EBI. The available databases are EMBL,
638 GenBank, or SWALL, and the entries can be retrieved in different formats
639 as objects or streams (SeqIO objects), or as "tempfiles". See
640 L<Bio::DB::BioFetch> for the details.
643 =head2 III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl, bpfetch.pl, Bio::DB::*)
645 Alternately, bioperl permits indexing local sequence data files by
646 means of the Bio::Index or Bio::DB::Fasta objects. The following sequence
647 data formats are supported by Bio::Index: genbank, swissprot, pfam, embl and
648 fasta. Once the set of sequences have been indexed using Bio::Index,
649 individual sequences can be accessed using syntax very similar to that
650 described above for accessing remote databases. For example, if one wants to
651 set up an indexed flat-file database of fasta files, and later wants then to
652 retrieve one file, one could write scripts like:
654 # script 1: create the index
655 use Bio::Index::Fasta; # using fasta file format
656 use strict; # some users have reported that this is required
658 my $Index_File_Name = shift;
659 my $inx = Bio::Index::Fasta->new(
660 -filename => $Index_File_Name,
662 $inx->make_index(@ARGV);
664 # script 2: retrieve some files
665 use Bio::Index::Fasta;
666 use strict; # some users have reported that this is required
668 my $Index_File_Name = shift;
669 my $inx = Bio::Index::Fasta->new($Index_File_Name);
670 foreach my $id (@ARGV) {
671 my $seq = $inx->fetch($id); # Returns Bio::Seq object
672 # do something with the sequence
675 To facilitate the creation and use of more complex or flexible
676 indexing systems, the bioperl distribution includes two sample scripts
677 in the scripts/ directory, bpindex.pl and bpfetch.pl. These scripts
678 can be used as templates to develop customized local data-file indexing
681 Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta
682 format files. It's similar in spirit to Bio::Index::Fasta but offers more
688 my $db = Bio::DB::Fasta->new($file); # one file or many files
689 my $seqstring = $db->seq($id); # get a sequence as string
690 my $seqobj = $db->get_Seq_by_id($id); # get a PrimarySeq obj
691 my $desc = $db->header($id); # get the header, or description, line
693 This module also offers the user the ability to designate a specific string
694 within the fasta header as the desired id, such as the gi number within the
695 string "gi|4556644|gb|X45555" (use the -makeid option for this capability).
696 See L<Bio::DB::Fasta> for more information on this fully-featured module.
699 =head2 III.2 Transforming formats of database/ file records
703 =for html <A NAME ="iii.2.1"></A>
705 =head2 III.2.1 Transforming sequence files (SeqIO)
707 A common - and tedious - bioinformatics task is that of converting
708 sequence data among the many widely used data formats. Bioperl's
709 SeqIO object, however, makes this chore a breeze. SeqIO can
710 read a stream of sequences - located in a single or in multiple files -
711 in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF,
712 phd/phred, Ace, fastq, or raw (plain sequence). Once the sequence data
713 has been read in with SeqIO, it is available to bioperl in the form of
714 Seq objects. Moreover, the Seq objects can then be written to another
715 file (again using SeqIO) in any of the supported data formats making
716 data converters simple to implement, for example:
719 $in = Bio::SeqIO->new('-file' => "inputfilename",
720 '-format' => 'Fasta');
721 $out = Bio::SeqIO->new('-file' => ">outputfilename",
722 '-format' => 'EMBL');
723 while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
725 In addition, perl "tied filehandle" syntax is available to SeqIO,
726 allowing you to use the standard E<lt>E<gt> and print operations to read
727 and write sequence objects, eg:
729 $in = Bio::SeqIO->newFh('-file' => "inputfilename" ,
730 '-format' => 'Fasta');
731 $out = Bio::SeqIO->newFh('-format' => 'EMBL');
732 print $out $_ while <$in>;
734 If the "-format" argument isn't used then Bioperl will guess the format
735 based on the file's suffix in a case-insensitive manner. Here are the
736 current interpretations:
740 fasta fasta|fast|seq|fa|fsa|nt|aa
741 genbank gb|gbank|genbank
744 embl embl|ebl|emb|dat
753 For more information see L<Bio::SeqIO>.
755 =for html <A NAME ="iii.2.2"></A>
757 =head2 III.2.2 Transforming alignment files (AlignIO)
759 Data files storing multiple sequence alignments also appear in varied
760 formats. AlignIO is the bioperl object for data conversion of
761 alignment files. AlignIO is patterned on the SeqIO object and shares
762 most of SeqIO's features. AlignIO currently supports input in the
773 water (needle and water from EMBOSS, see L<"III.3.6">)
781 AlignIO supports output in these formats: fasta, mase, selex, clustalw, msf/gcg,
782 and phylip (interleaved). One significant difference between AlignIO and SeqIO
783 is that AlignIO handles IO for only a single alignment at a time but
784 SeqIO.pm handles IO for multiple sequences in a single stream. Syntax
785 for AlignIO is almost identical to that of SeqIO:
788 $in = Bio::AlignIO->new('-file' => "inputfilename" ,
789 '-format' => 'fasta');
790 $out = Bio::AlignIO->new('-file' => ">outputfilename",
791 '-format' => 'pfam');
792 while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
794 The only difference is that here, the returned object reference, $aln,
795 is to a SimpleAlign object rather than a Seq object.
797 AlignIO also supports the tied filehandle syntax described above for
798 SeqIO. Note that currently AlignIO is usable only with SimpleAlign
799 alignment objects. See L<Bio::AlignIO> and section L<"III.5.4"> for more
802 =head2 III.3 Manipulating sequences
804 Bioperl contains many modules with functions for sequence analysis. And
805 if you cannot find the function you want in bioperl you may be able to
806 find it in EMBOSS, which is accessible through bioperl (see L<"III.3.6">).
810 =for html <A NAME ="iii.3.1"></A>
812 =head2 III.3.1 Manipulating sequence data with Seq methods
814 OK, so we know how to retrieve sequences and access them as Seq
815 objects. Let's see how we can use the Seq objects to manipulate our
816 sequence data and retrieve information. Seq provides multiple
817 methods for performing many common (and some not-so-common) tasks of
818 sequence manipulation and data retrieval. Here are some of the most
821 The following methods return strings
823 $seqobj->display_id(); # the human read-able id of the sequence
824 $seqobj->seq(); # string of sequence
825 $seqobj->subseq(5,10); # part of the sequence as a string
826 $seqobj->accession_number(); # when there, the accession number
827 $seqobj->alphabet(); # one of 'dna','rna','protein'
828 $seqobj->primary_id(); # a unique id for this sequence irregardless
829 # of its display_id or accession number
830 $seqobj->desc() # a description of the sequence
832 It is worth mentioning that some of these values correspond to specific
833 fields of given formats. For example, the display_id method returns
834 the LOCUS name of a Genbank entry, the (\S+) following the E<gt> character
835 in a Fasta file, the ID from a SwissProt file, and so on. The desc()
836 method will return the DEFINITION line of a Genbank file, the line
837 following the display_id in a Fasta file, and the DE field in a SwissProt
840 The following methods return an array of Bio::SeqFeature objects
842 $seqobj->top_SeqFeatures # The 'top level' sequence features
843 $seqobj->all_SeqFeatures # All sequence features, including sub
846 Sequence features will be discussed further in section L<"III.7"> on
847 machine-readable sequence annotation. A general description of the
848 object can be found in L<Bio::SeqFeature::Generic>, and a description
849 of related, top-level "annotation" is found in L<Bio::Annotation>.
851 The following methods returns new sequence objects, but do not transfer
854 $seqobj->trunc(5,10) # truncation from 5 to 10 as new object
855 $seqobj->revcom # reverse complements sequence
856 $seqobj->translate # translation of the sequence
858 Note that some methods return strings, some return arrays and some
859 return references to objects. See L<Bio::Seq> for more information.
861 Many of these methods are self-explanatory. However, bioperl's flexible
862 translation methods warrant further comment. Translation in bioinformatics
863 can mean two slightly different things:
867 =item 1 Translating a nucleotide sequence from start to end.
869 =item 2 Taking into account the constraints of real coding regions in mRNAs.
873 For historical reasons the bioperl implementation of translation does
874 the first of these tasks easily. Any sequence object which is not of type
875 'protein' can be translated by simply calling the method which returns
876 a protein sequence object:
878 $translation1 = $my_seq_object->translate;
880 However, the translate method can also be passed several optional
881 parameters to modify its behavior. For example, the first two
882 arguments to "translate" can be used to modify the characters used to
883 represent stop (default '*') and unknown amino acid ('X'). (These are
884 normally best left untouched.) The third argument determines the
885 frame of the translation. The default frame is "0". To get
886 translations in the other two forward frames, we would write:
888 $translation2 = $my_seq_object->translate(undef,undef,1);
889 $translation3 = $my_seq_object->translate(undef,undef,2);
891 The fourth argument to "translate" makes it possible to use
892 alternative genetic codes. There are currently 16 codon tables
893 defined, including tables for 'Verterbate Mitochondrial', 'Bacterial',
894 'Alternative Yeast Nuclear' and 'Ciliate, Dasycladacean and Hexamita
895 Nuclear' translation. These tables are located in the object
896 Bio::Tools::CodonTable which is used by the translate method. For
897 example, for mitochondrial translation:
899 $human_mitochondrial_translation =
900 $my_seq_object->translate(undef,undef,undef, 2);
902 If we want to translate full coding regions (CDS) the way major
903 nucleotide databanks EMBL, GenBank and DDBJ do it, the translate
904 method has to perform more tricks. Specifically, 'translate' needs to
905 confirm that the sequence has appropriate start and terminator codons
906 at the beginning and the end of the sequence and that there are no
907 terminator codons present within the sequence. In addition, if the
908 genetic code being used has an atypical (non-ATG) start codon, the
909 translate method needs to convert the initial amino acid to
910 methionine. These checks and conversions are triggered by setting the
911 fifth argument of the translate method to evaluate to "true".
913 If argument 5 is set to true and the criteria for a proper CDS are
914 not met, the method, by default, issues a warning. By setting the
915 sixth argument to evaluate to "true", one can instead instruct
916 the program to die if an improper CDS is found, e.g.
919 $cds->translate(undef,undef,undef,undef,1,'die_if_errors');
921 See L<Bio::Tools::CodonTable> for related details.
923 =head2 III.3.2 Obtaining basic sequence statistics- MW, residue &codon
924 frequencies(SeqStats, SeqWord)
926 In addition to the methods directly available in the Seq object,
927 bioperl provides various "helper" objects to determine additional
928 information about a sequence. For example, SeqStats
929 object provides methods for obtaining the molecular weight of the
930 sequence as well the number of occurrences of each of the component
931 residues (bases for a nucleic acid or amino acids for a protein.)
932 For nucleic acids, SeqStats also returns counts of the number of codons
936 $seq_stats = Bio::Tools::SeqStats->new($seqobj);
937 $weight = $seq_stats->get_mol_wt();
938 $monomer_ref = $seq_stats->count_monomers();
939 $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
941 Note: sometimes sequences will contain "ambiguous" codes. For this
942 reason, get_mol_wt() returns (a reference to) a two element array
943 containing a greatest lower bound and a least upper bound of the
946 The SeqWords object is similar to SeqStats and provides
947 methods for calculating frequencies of "words" (eg tetramers or hexamers)
948 within the sequence. See L<Bio::Tools::SeqStats> and L<Bio::Tools::SeqWords>
949 for more information.
951 =head2 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
953 Another common sequence manipulation task for nucleic acid sequences
954 is locating restriction enzyme cutting sites. Bioperl provides the
955 RestrictionEnzyme object for this purpose. Bioperl's
956 standard RestrictionEnzyme object comes with data for more than 150
957 different restriction enzymes. A list of the available enzymes can be
958 accessed using the available_list() method. For example to select all
959 available enzymes that with cutting patterns that are six bases long one
962 $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
963 @sixcutters = $re->available_list(6);
965 Once an appropriate enzyme has been selected, the sites for that
966 enzyme on a given nucleic acid sequence can be obtained using the
967 cut_seq() method. The syntax for performing this task is:
969 $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');
970 # $seqobj is the Seq object for the dna to be cut
971 @fragments = $re1->cut_seq($seqobj);
973 Adding an enzyme not in the default list is easily accomplished:
975 $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV--GAT^ATC',
978 Once the custom enzyme object has been created, cut_seq() can be
979 called in the usual manner. See L<Bio::Tools::RestrictionEnzyme> for
983 =head2 III.3.4 Identifying amino acid cleavage sites (Sigcleave)
985 For amino acid sequences we may be interested to know whether the
986 amino acid sequence contains a cleavable signal sequence for
987 directing the transport of the protein within the cell. SigCleave is
988 a program (originally part of the EGCG molecular biology package) to
989 predict signal sequences, and to identify the cleavage site based on
990 the von Heijne algorithm.
992 The "threshold" setting controls the score reporting. If no value for
993 threshold is passed in by the user, the code defaults to a reporting
994 value of 3.5. SigCleave will only return score/position
995 pairs which meet the threshold limit.
997 There are 2 accessor methods for this object. "signals" will return a
998 perl hash containing the sigcleave scores keyed by amino acid
999 position. "pretty_print" returns a formatted string similar to the
1000 output of the original sigcleave utility.
1002 The syntax for using Sigcleave is as follows:
1004 # doesn't currently accept a Seq object
1005 $seq = "AALLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV";
1007 use Bio::Tools::Sigcleave;
1008 $sigcleave_object = new Bio::Tools::Sigcleave
1009 ( -seq => 'sigtest.aa',
1011 -desc => 'test sigcleave protein seq',
1014 %raw_results = $sigcleave_object->signals;
1015 $formatted_output = $sigcleave_object->pretty_print;
1017 Note that Sigcleave is passed a raw sequence rather than a sequence
1018 object. Also note that the "type" in the Sigcleave object is "amino"
1019 whereas in a Seq object it would be called "protein". Please see
1020 L<Bio::Tools::Sigcleave> for details.
1023 =head2 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
1027 For some purposes it's useful to have a listing of an amino acid
1028 sequence showing where the hydrophobic amino acids are located or
1029 where the positively charged ones are. Bioperl provides this
1030 capability via the module Bio::Tools::OddCodes.
1032 For example, to quickly see where the charged amino acids are located
1033 along the sequence we perform:
1035 use Bio::Tools::OddCodes;
1036 $oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
1037 $output = $oddcode_obj->charge();
1039 The sequence will be transformed into a three-letter sequence (A,C,N)
1040 for negative (acidic), positive (basic), and neutral amino acids. For
1041 example the ACDEFGH would become NNAANNC.
1043 For a more complete chemical description of the sequence one can call
1044 the chemical() method which turns sequence into one with an 8-letter
1045 chemical alphabet { A (acidic), L (aliphatic), M (amide), R
1046 (aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:
1048 $output = $oddcode_obj->chemical();
1050 In this case the sample sequence ACDEFGH would become LSAARAC.
1052 OddCodes also offers translation into alphabets showing alternate
1053 characteristics of the amino acid sequence such as hydrophobicity,
1054 "functionality" or grouping using Dayhoff's definitions. See the
1055 documentation in L<Bio::Tools::OddCodes> for further details.
1059 The SeqPattern object is used to manipulate sequences
1060 that include perl "regular expressions". A key motivation for
1061 SeqPattern is to have a way of generating a reverse complement of a
1062 nucleic acid sequence pattern that includes ambiguous bases and/or
1063 regular expressions. This capability leads to significant performance
1064 gains when pattern matching on both the sense and anti-sense strands
1065 of a query sequence are required. Typical syntax for using SeqPattern
1066 is shown below. For more information, there are several interesting
1067 examples in the script seq_pattern.pl in the scripts/tools directory.
1069 Use Bio::Tools::SeqPattern;
1070 $pattern = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
1071 $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' => $pattern,
1073 $pattern_obj2 = $pattern_obj->revcom();
1074 $pattern_obj->revcom(1); # returns expanded rev complement pattern.
1076 More detail can be found in L<Bio::Tools::SeqPattern>.
1079 =for html <A NAME ="iii.3.6"></A>
1081 =head2 III.3.6 Sequence manipulation using the Bioperl EMBOSS interface (Tools::Run::EMBOSSApplication)
1083 EMBOSS (European Molecular Biology Open Source Software) is an extensive
1084 collection of sequence analysis programs written in the C
1085 programming language, from http://www.uk.embnet.org/Software/EMBOSS.
1086 There are a number of algorithms in EMBOSS that are not found in "Bioperl
1087 proper" (eg. calculating DNA melting temperature, finding repeats,
1088 identifying prospective antigenic sites) so if you if you cannot find
1089 the function you want in bioperl you might be able to find it in EMBOSS.
1091 EMBOSS programs are usually called from the command line but bioperl
1092 provides a Perl "wrapper" for EMBOSS function calls so that they can be
1093 executed from within a Perl script. Of course, the EMBOSS package must
1094 be installed for the Bioperl wrapper to function.
1096 In the future, it is planned that Bioperl EMBOSS objects will return
1097 appropriate Bioperl objects to the calling script in addition to
1098 generating standard EMBOSS reports. This functionality is
1099 being initially implemented with the EMBOSS sequence alignment
1100 programs, so that they will return SimpleAlign objects in a manner
1101 similar to the way the Bioperl modules TCoffee.pm and Clustalw.pm
1102 work (see section L<"III.5.4"> for a discussion of SimpleAlign).
1104 An example of the Bioperl EMBOSS wrapper where a file is returned
1107 $factory = new Bio::Factory::EMBOSS;
1108 $compseqapp = $factory->program('compseq');
1109 %input = ( -word => 4,
1110 -sequence => $seqObj,
1111 -outfile => $compseqoutfile );
1112 $compseqapp->run(\%input);
1113 $seqio = Bio::SeqIO->new( -file => $compseqoutfile ); # etc...
1115 Note that a Seq object was used as input. The EMBOSS object can also
1116 accept a file name as input, eg
1118 -sequence => "inputfasta.fa"
1120 Some EMBOSS programs will return strings, others will create files that
1121 can be read directly using Bio::SeqIO (section L<"III.2.1">), as in the
1122 example above. It's worth mentioning that the AlignIO module can use files
1123 from EMBOSS's water and needle as input (see L<"III.2.2">) to create AlignIO
1127 =head2 III.3.7 Sequence manipulation without creating Bioperl "objects" (Perl.pm)
1129 Using the Bio::Perl.pm module, it is possible to manipulate sequence
1130 data in Bioperl without explicitly creating Seq or SeqIO objects.
1131 This feature may ease the Bioperl learning curve for new users
1132 unfamiliar or uncomfortable with using Perl objects. However, only
1133 limited data manipulation are supported in this mode. In addition,
1134 each method (i.e. function) that will be used by the script must be
1135 explicitly declared in the initial "use directive". For example a
1136 simple data format conversion and sequence manipulation could be
1137 performed as follows - note that no "new" methods are called and that
1138 no Seq or SeqIO objects are created:
1140 use Bio::Perl qw( get_sequence );
1141 # get a sequence from a database (assummes internet connection)
1142 $seq_object = get_sequence('swissprot',"ROA1_HUMAN");
1143 # $seq_object is Bio::Seq object, so the following methods work
1144 $seq_id = $seq_object->display_id;
1145 $seq_as_string = $seq_object->seq();
1147 For more details see L<Bio::Perl>.
1149 =head2 III.4 Searching for "similar" sequences
1151 One of the basic tasks in molecular biology is identifying sequences
1152 that are, in some way, similar to a sequence of interest. The Blast
1153 programs, originally developed at the NCBI, are widely used for
1154 identifying such sequences. Bioperl offers a number of modules to
1155 facilitate running Blast as well as to parse the often voluminous
1156 reports produced by Blast.
1158 =head2 III.4.1 Running BLAST locally (StandAloneBlast)
1160 There are several reasons why one might want to run the Blast programs
1161 locally - speed, data security, immunity to network problems, being
1162 able to run large batch runs etc. The NCBI provides a downloadable
1163 version of blast in a stand-alone format, and running blast locally
1164 without any use of perl or bioperl is completely straightforward.
1165 However, there are situations where having a perl interface for running
1166 the blast programs locally is convenient.
1168 The module Bio::Tools::Run::StandAloneBlast offers the ability to
1169 wrap local calls to blast from within perl. All of the currently
1170 available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are
1171 available from within the bioperl StandAloneBlast interface. Of course,
1172 to use StandAloneBlast, one needs to have installed locally ncbi-blast
1173 as well as one or more blast-readable databases.
1175 Basic usage of the StandAloneBlast.pm module is simple. Initially, a
1176 local blast "factory object" is created.
1178 @params = ('program' => 'blastn',
1179 'database' => 'ecoli.nt');
1180 $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
1182 Any parameters not explicitly set will remain as the BLAST defaults.
1183 Once the factory has been created and the appropriate parameters set,
1184 one can call one of the supported blast executables. The input
1185 sequence(s) to these executables may be fasta file(s), a Seq
1186 object or an array of Seq objects, eg
1188 $input = Bio::Seq->new(-id =>"test query",
1189 -seq =>"ACTAAGTGGGGG");
1190 $blast_report = $factory->blastall($input);
1192 The returned blast report will be in the form of a bioperl
1193 parsed-blast object. The report object may be either a BPlite,
1194 BPpsilite, BPbl2seq or Blast object depending on the type of blast
1195 search. The "raw" blast report is also available.
1197 The syntax for running PHIBLAST, PSIBLAST and bl2seq searches via
1198 StandAloneBlast is also straightforward. See
1199 L<Bio::Tools::Run::StandAloneBlast> documentation for details. In
1200 addition, the script standaloneblast.pl in the scripts/tools directory
1201 contains descriptions of various possible applications of the
1202 StandAloneBlast object. This script shows how the blast report object
1203 can access a blast parser directly, eg
1205 while (my $sbjct = $blast_report->nextSbjct){
1206 while (my $hsp = $sbjct->nextHSP){
1207 print $hsp->score . " " . $hsp->subject->seqname . "\n";
1211 See the section L<"III.4.4"> on parsing BLAST reports with Bio::Tools::BPlite,
1212 below, or L<Bio::Tools::BPlite> for details.
1214 =head2 III.4.2 Running BLAST remotely (using RemoteBlast.pm)
1216 Bioperl supports remote execution of blasts at NCBI by means of the
1219 A skeleton script to run a remote blast might look as follows:
1221 $remote_blast = Bio::Tools::Run::RemoteBlast->new(
1222 '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );
1223 $r = $remote_blast->submit_blast("t/data/ecolitst.fa");
1224 while (@rids = $remote_blast->each_rid ) {
1225 foreach $rid ( @rids ) {$rc = $remote_blast->retrieve_blast($rid);}}
1227 You may want to change some parameter of the remote job and this example
1228 shows how to change the matrix:
1230 $Bio::Tools::Run::RemoteBlast::HEADER{'MATRIX_NAME'} = 'BLOSUM25';
1232 For a description of the many CGI parameters see:
1234 http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
1236 Note that the script has to be broken into two parts. The actual Blast
1237 submission and the subsequent retrieval of the results. At times when the
1238 NCBI Blast is being heavily used, the interval between when a Blast
1239 submission is made and when the results are available can be substantial.
1241 The object $rc would contain the blast report that could then be parsed with
1242 Bio::Tools::BPlite or Bio::SearchIO. The default object is BPlite
1243 in version 1.0 or earlier and it's SearchIO after version 1.0. The
1244 object type can be changed using the -readmethod parameter but bear
1245 in mind that the favored Blast parser is Bio::SearchIO, others won't be
1246 supported in later versions.
1248 Note that to make this script actually useful, one should add details
1249 such as checking return codes from the Blast to see if it succeeded and
1250 a "sleep" loop to wait between consecutive requests to the NCBI server.
1251 See example 26 in the demonstration script in the appendix to see some
1252 working code you could use, or L<Bio::Tools::Run::RemoteBlast> for
1255 It should also be noted that the syntax for creating a remote blast factory
1256 is slightly different from that used in creating StandAloneBlast, Clustalw,
1257 and T-Coffee factories. Specifically RemoteBlast requires parameters to
1258 be passed with a leading hyphen, as in '-prog' =E<gt> 'blastp', while the
1259 other programs do not pass parameters with a leading hyphen.
1262 =for html <A NAME ="iii.4.3"></A>
1264 =head2 III.4.3 Parsing BLAST and FASTA reports with Search and SearchIO
1266 No matter how Blast searches are run (locally or remotely, with or
1267 without a perl interface), they return large quantities of data that
1268 are tedious to sift through. Bioperl offers several different objects
1269 - Search.pm/SearchIO.pm, and BPlite.pm (along with its minor
1270 modifications, BPpsilite and BPbl2seq) for parsing Blast reports. Search
1271 and SearchIO which are new in Bioperl 1.0 and are now the principal Bioperl
1272 interfaces for Blast (and FASTA) report parsing are described in this section.
1273 The older BPlite is described in section L<"III.4.4">. We recommend you use
1274 SearchIO, it's certain to be supported in future releases.
1276 The Search and SearchIO modules provide a uniform interface for
1277 parsing sequence-similarity-search reports generated by BLAST (in
1278 standard and BLAST XML formats), PSI-BLAST, RPS-BLAST and FASTA. In the future,
1279 it is envisioned that the Search/SearchIO syntax will be extended to
1280 provide a uniform interface to a wider range of report parsers
1281 including parsers for Genscan.
1283 Parsing sequence-similarity reports with Search and SearchIO is
1284 straightforward. Initially a SearchIO object specifies a file
1285 containing the report(s). The method next_result reads the next report
1286 into a Search object in just the same way that the next_seq method of
1287 SeqIO reads in the next sequence in a file into a Seq object.
1289 Once a report (i.e. a Search object) has been read in and is available
1290 to the script, the report's overall attributes (e.g. the query) can be
1291 determined and its individual "hits" can be accessed with the
1292 next_hit method. Individual high-scoring-pairs for each hit
1293 can then be accessed with the next_hsp method. Except for
1294 the additional syntax required to enable the reading of multiple
1295 reports in a single file, the remainder of the Search/SearchIO parsing
1296 syntax is very similar to that of the BPlite object it is intended to replace.
1297 Sample code to read a BLAST report might look like this:
1300 $searchio = new Bio::SearchIO (-format => 'blast',
1301 -file => $blast_report);
1302 $result = $searchio->next_result;
1304 # Get info about the entire report
1305 $result->database_name;
1306 $algorithm_type = $result->algorithm;
1308 # get info about the first hit
1309 $hit = $result->next_hit;
1310 $hit_name = $hit->name ;
1312 # get info about the first hsp of the first hit
1313 $hsp = $hit->next_hsp;
1314 $hsp_start = $hsp->query->start;
1316 For more details on parsing with Search/SearchIO see the Search and SearchIO
1317 documentation: L<Bio::SearchIO::blast>, L<Bio::SearchIO::psiblast>,
1318 L<Bio::SearchIO::blastxml>, L<Bio::SearchIO::fasta>, and L<Bio::SearchIO>.
1319 There is also sample code in the Bio/scripts/searchio directory which
1320 illustrates how to use SearchIO.
1322 =for html <A NAME ="iii.4.4"></A>
1324 =head2 III.4.4 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq
1326 Bioperl's older BLAST report parsers - BPlite, BPpsilite, BPbl2seq and
1327 Blast.pm - are expected to be phased out over a period of time. Since a
1328 considerable amount of legacy Bioperl scripts has been written which
1329 heavily use these objects, they are likely to remain within Bioperl
1332 Much of the user interface of BPlite is very similar to that of Search.
1333 However accessing the next hit or HSP uses methods called next_Sbjct and
1334 next_HSP, respectively - in contrast to Search's next_hit and next_hsp.
1339 The syntax for using BPlite is as follows where the method
1340 for retrieving hits is now called "nextSbjct" (for "subject"), while the
1341 method for retrieving high-scoring-pairs is called "nextHSP":
1343 use Bio::Tools::BPlite;
1344 $report = new Bio::Tools::BPlite(-fh=>\*STDIN);
1346 while(my $sbjct = $report->nextSbjct) {
1348 while (my $hsp = $sbjct->nextHSP) { $hsp->score; }
1351 A complete description of the module can be found in L<Bio::Tools::BPlite>.
1355 BPpsilite and BPbl2seq are objects for parsing (multiple iteration)
1356 PSIBLAST reports and Blast bl2seq reports, respectively. They are
1357 both minor variations on the BPlite object. See L<Bio::Tools::BPbl2seq>
1358 and L<Bio::Tools::BPpsilite> for details.
1360 The syntax for parsing a multiple iteration PSIBLAST report is as
1361 shown below. The only significant additions to BPlite are methods to
1362 determine the number of iterated blasts and to access the results from
1363 each iteration. The results from each iteration are parsed in the
1364 same manner as a (complete) BPlite object.
1366 use Bio::Tools::BPpsilite;
1367 $report = new Bio::Tools::BPpsilite(-fh=>\*STDIN);
1368 $total_iterations = $report->number_of_iterations;
1369 $last_iteration = $report->round($total_iterations)
1370 while(my $sbjct = $last_iteration ->nextSbjct) {
1372 while (my $hsp = $sbjct->nextHSP) {$hsp->score; }
1375 See L<Bio::Tools::BPpsilite> for details.
1379 BLAST bl2seq is a program for comparing and aligning two sequences
1380 using BLAST. Although the report format is similar to that of a
1381 conventional BLAST, there are a few differences. Consequently, the
1382 standard bioperl parser BPlite ia unable to read bl2seq
1383 reports directly. From the user's perspective, one difference
1384 between bl2seq and other blast reports is that the bl2seq report does
1385 not print out the name of the first of the two aligned sequences.
1386 Consequently, BPbl2seq has no way of identifying the name of one of
1387 the initial sequence unless it is explicitly passed to constructor as
1388 a second argument as in:
1390 use Bio::Tools::BPbl2seq;
1391 $report = Bio::Tools::BPbl2seq->new(-file => "t/data/dblseq.out",
1392 -queryname => "ALEU_HORVU");
1393 $hsp = $report->next_feature;
1394 $answer=$hsp->score;
1396 In addition, since there will only be (at most) one "subject" (hit) in a
1397 bl2seq report one should use the method $report-E<gt>next_feature,
1398 rather than $report-E<gt>nextSbjct-E<gt>nextHSP to obtain the next high
1399 scoring pair. See L<Bio::Tools::BPbl2seq> for more details.
1403 The Bio::Tools::Blast parser has been removed from Bioperl as of version
1404 1.1. Consequently, the BPlite parser (described in the
1405 section L<"III.4.4">) or the Search/SearchIO parsers (section L<"III.4.3">)
1406 should be use for BLAST parsing within bioperl (SearchIO is the preferred
1407 approach and will be formally supported in subsequent releases).
1410 =head2 III.4.5 Parsing HMM reports (HMMER::Results, SearchIO)
1412 Blast is not the only sequence-similarity-searching program supported
1413 by bioperl. HMMER is a Hidden Markov Model (HMM) program that
1414 (among other capabilities) enables sequence similarity searching, from
1415 http://hmmer.wustl.edu. Bioperl does not currently provide a perl interface
1416 for running HMMER. However, bioperl does provide HMMER report parsers,
1417 one with the (perhaps not too descriptive) name of Results, the other
1418 within the SearchIO framework.
1420 Results can parse reports generated both by the HMMER program
1421 hmmsearch - which searches a sequence database for sequences similar
1422 to those generated by a given HMM - and the program hmmpfam - which
1423 searches a HMM database for HMMs which match domains of a given
1424 sequence. For hmmsearch, a series of HMMER::Set objects
1425 are made, one for each sequence. For hmmpfam searches, only one Set
1426 object is made. Sample usage for parsing a hmmsearch report might be:
1428 use Bio::Tools::HMMER::Results;
1429 $res = new Bio::Tools::HMMER::Results(-file => 'output.hmm',
1430 -type => 'hmmsearch' );
1431 foreach $seq ( $res->each_Set ) {
1432 print "Sequence bit score is ", $seq->bits, "\n";
1433 foreach $domain ( $seq->each_Domain ) {
1434 print " Domain start ", $domain->start, " end ",
1435 $domain->end," score ",$domain->bits,"\n";
1439 Additional methods are described in L<Bio::Tools::HMMER::Results>.
1441 In addition one can parse HMMER output files using Bio::SearchIO. This
1442 approach is the approved and supported Bioperl method. An example:
1446 $in = new Bio::SearchIO(-format => 'hmmer',-file => '123.hmmsearch');
1447 while ( $res = $in->next_result ){
1448 # get a Bio::Search::Result::HMMERResult object
1449 print $res->query_name, " for HMM ", $res->hmm_name, "\n";
1450 while ( $hit = $res->next_hit ){
1451 print $hit->name, "\n";
1452 while ( $hsp = $hit->next_hsp ){
1453 print "length is ", $hsp->length, "\n";
1458 Purists may insist that the term "hsp" is not applicable to hmmsearch or
1459 hmmpfam results and they may be correct - this is an unintended
1460 consequence of using the flexible and extensible SearchIO approach. See
1461 L<Bio::Search::Result::HMMERResult> for more information.
1464 =head2 III.5 Creating and manipulating sequence alignments
1466 Once one has identified a set of similar sequences, one often needs to
1467 create an alignment of those sequences. Bioperl offers several perl
1468 objects to facilitate sequence alignment: pSW, Clustalw.pm, TCoffee.pm
1469 and the bl2seq option of StandAloneBlast. All of these objects take
1470 as arguments a reference to an array of (unaligned) Seq objects. All
1471 (except bl2seq) return a reference to a SimpleAlign object. bl2seq
1472 can also produce a SimpleAlign object when it is combined with
1473 Bio::AlignIO (see section below, L<"III.5.2">).
1476 =head2 III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
1478 The Smith-Waterman (SW) algorithm is the standard method for producing
1479 an optimal local alignment of two sequences. Bioperl supports the
1480 computation of SW alignments via the pSW object. Note that pSW only
1481 supports the alignment of protein sequences, not nucleotide.
1483 The SW algorithm itself is implemented in C and incorporated into bioperl using
1484 an XS extension. This has significant efficiency advantages but means that
1485 pSW will B<not> work unless you have compiled the bioperl-ext
1486 package. If you have compiled the bioperl-ext package, usage is
1487 simple, where the method align_and_show displays the alignment while
1488 pairwise_alignment produces a (reference to) a SimpleAlign object.
1490 use Bio::Tools::pSW;
1491 $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla',
1494 $factory->align_and_show($seq1, $seq2, STDOUT);
1495 $aln = $factory->pairwise_alignment($seq1, $seq2);
1497 SW matrix, gap and extension parameters can be adjusted as shown.
1498 Bioperl comes standard with blosum62 and gonnet250 matrices. Others
1499 can be added by the user. For additional information on accessing the
1500 SW algorithm via pSW see the script psw.pl in the scripts/tools directory and
1501 the documentation in L<Bio::Tools::pSW>.
1503 An alternative way to get Smith-Waterman alignments is the EMBOSS
1504 program 'water' (see Section L<"IV.3"> for more information on the EMBOSS
1505 package). This can produce an output file that bioperl can read in with
1509 my $in = new Bio::AlignIO(-format => 'emboss', -file => 'filename');
1510 my $aln = $in->next_aln();
1512 =for html <A NAME ="iii.5.2"></A>
1514 =head2 III.5.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
1516 As an alternative to Smith-Waterman, two sequences can also be aligned
1517 in Bioperl using the bl2seq option of Blast within the StandAloneBlast
1518 object. To get an alignment - in the form of a SimpleAlign object -
1519 using bl2seq, you need to parse the bl2seq report with the Bio::AlignIO
1520 file format reader as follows:
1522 $factory = Bio::Tools::Run::StandAloneBlast->new('outfile' => 'bl2seq.out');
1523 $bl2seq_report = $factory->bl2seq($seq1, $seq2);
1524 # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
1525 $str = Bio::AlignIO->new('-file '=>' bl2seq.out',
1526 '-format' => 'bl2seq');
1527 $aln = $str->next_aln();
1529 =head2 III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
1531 For aligning multiple sequences (ie two or more), bioperl offers a
1532 perl interface to the bioinformatics-standard clustalw and tcoffee
1533 programs. Clustalw has been a leading program in global multiple
1534 sequence alignment (MSA) for several years. TCoffee is a relatively
1535 recent program - derived from clustalw - which has been shown to
1536 produce better results for local MSA.
1538 To use these capabilities, the clustalw and/or tcoffee programs
1539 themselves need to be installed on the host system. In addition, the
1540 environmental variables CLUSTALDIR and TCOFFEEDIR need to be set to
1541 the directories containg the executables. See section L<"I.3"> and the
1542 L<Bio::Tools::Run::Alignment::Clustalw> and
1543 L<Bio::Tools::Run::Alignment::TCoffee> for information on downloading
1544 and installing these programs.
1546 From the user's perspective, the bioperl syntax for calling
1547 Clustalw.pm or TCoffee.pm is almost identical. The only differences
1548 are the names of the modules themselves appearing in the initial "use"
1549 and constructor statements and the names of the some of the individual
1550 program options and parameters.
1552 In either case, initially, a "factory object" must be created. The
1553 factory may be passed most of the parameters or switches of the
1554 relevant program. In addition, alignment parameters can be changed
1555 and/or examined after the factory has been created. Any parameters
1556 not explicitly set will remain as the underlying program's
1557 defaults. Clustalw.pm/TCoffee.pm output is returned in the form of a
1558 SimpleAlign object. It should be noted that some Clustalw and TCoffee
1559 parameters and features (such as those corresponding to tree
1560 production) have not been implemented yet in the Perl interface.
1562 Once the factory has been created and the appropriate parameters set,
1563 one can call the method align() to align a set of unaligned sequences,
1564 or profile_align() to add one or more sequences or a second alignment
1565 to an initial alignment. Input to align() consists of a set of
1566 unaligned sequences in the form of the name of file containing the
1567 sequences or a reference to an array of Seq objects. Typical
1568 syntax is shown below. (We illustrate with Clustalw.pm, but the same
1569 syntax - except for the module name - would work for TCoffee.pm)
1571 use Bio::Tools::Run::Alignment::Clustalw;
1572 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
1573 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
1575 $factory->ktuple($ktuple); # change the parameter before executing
1576 $seq_array_ref = \@seq_array;
1577 # where @seq_array is an array of Bio::Seq objects
1578 $aln = $factory->align($seq_array_ref);
1580 Clustalw.pm/TCoffee.pm can also align two (sub)alignments to each
1581 other or add a sequence to a previously created alignment by using the
1582 profile_align method. For further details on the required syntax and
1583 options for the profile_align method, the user is referred to
1584 L<Bio::Tools::Run::Alignment::Clustalw> and
1585 L<Bio::Tools::Run::Alignment::TCoffee>. The user is also
1586 encouraged to examine the script clustalw.pl in the scripts/align directory.
1588 =for html <A NAME ="iii.5.4"></A>
1590 =head2 III.5.4 Manipulating and displaying alignments (SimpleAlign)
1592 SimpleAlign objects are produced by bioperl alignment creation objects
1593 (eg Clustalw.pm, BLAST's bl2seq, and pSW) and they can read and write multiple
1594 alignment formats via AlignIO.
1596 Some of the manipulations possible with SimpleALign include:
1602 slice(): Obtaining an alignment "slice", that is, a subalignment inclusive of
1603 specified start and end columns. Sequences with no residues in the slice are
1604 excluded from the new alignment and a warning is printed.
1608 column_from_residue_number(): Finding column in an alignment where a specified
1609 residue of a specified sequence is located.
1613 consensus_string(): Making a consensus string. This method includes an
1614 optional threshold parameter, so that positions in the alignment with lower
1615 percent-identity than the threshold are marked by "?"'s in the consensus
1619 percentage_identity(): A fast method for calculating the average percentage
1620 identity of the alignment
1624 consensus_iupac(): Making a consensus using IUPAC ambiguity codes from
1630 Skeleton code for using some of these features is shown below. More detailed,
1631 working code is in Demo example 14 and in align_on_codons.pl in the scripts
1632 directory. Additional documentation on methods can be found in
1633 L<Bio::SimpleAlign> and L<Bio::LocatableSeq>.
1635 use Bio::SimpleAlign;
1636 $aln = Bio::SimpleAlign->new('t/data/testaln.dna');
1637 $threshold_percent = 60;
1638 $consensus_with_threshold = $aln->consensus_string($threshold_percent);
1639 $iupac_consensus = $aln->consensus_iupac(); # dna/rna alignments only
1640 $percent_ident = $aln->percentage_identity;
1641 $seqname = '1433_LYCES';
1642 $pos = $aln->column_from_residue_number($seqname, 14);
1644 =head2 III.6 Searching for genes and other structures on genomic DNA
1645 (Genscan, Sim4, Grail, Genemark, ESTScan, MZEF, EPCR)
1647 Automated searching for putative genes, coding sequences,
1648 sequence-tagged-sites (STS's) and other functional
1649 units in genomic and expressed sequence tag (EST) data has
1650 become very important as the available quantity of sequence data has
1651 rapidly increased. Many feature searching programs currently exist.
1652 Each produces reports containing predictions that must be read
1653 manually or parsed by automated report readers.
1655 Parsers for six widely used gene prediction programs - Genscan, Sim4,
1656 Genemark, Grail, ESTScan and MZEF - are currently available or under active
1657 development in bioperl. The interfaces for the four parsers are similar.
1658 We illustrate the usage for Genscan and Sim4 here. The syntax is relatively
1659 self-explanatory; see L<Bio::Tools::Genscan>, L<Bio::Tools::Genemark>,
1660 L<Bio::Tools::Grail>, L<Bio::Tools::ESTScan>, L<Bio::Tools::MZEF>, and
1661 L<Bio::Tools::Sim4::Results> for further details.
1663 use Bio::Tools::Genscan;
1664 $genscan = Bio::Tools::Genscan->new(-file => 'result.genscan');
1665 # $gene is an instance of Bio::Tools::Prediction::Gene
1666 # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
1667 while($gene = $genscan->next_prediction())
1668 { @exon_arr = $gene->exons(); }
1671 See L<Bio::Tools::Prediction::Gene> and L<Bio::Tools::Prediction::Exon>
1674 use Bio::Tools::Sim4::Results;
1675 $sim4 = new Bio::Tools::Sim4::Results(-file => 't/data/sim4.rev',
1677 # $exonset is-a Bio::SeqFeature::Generic with Bio::Tools::Sim4::Exons
1679 $exonset = $sim4->next_exonset;
1680 @exons = $exonset->sub_SeqFeature();
1681 # $exon is-a Bio::SeqFeature::FeaturePair
1683 $exonstart = $exons[$exon]->start();
1684 $estname = $exons[$exon]->est_hit()->seqname();
1687 See L<Bio::SeqFeature::Generic> and L<Bio::Tools::Sim4::Exons> for more
1690 A parser for the ePCR program is also available. The ePCR program identifies
1691 potential PCR-based sequence tagged sites (STSs)
1692 For more details see the documentation in L<Bio::Tools::EPCR>.
1693 A sample skeleton script for parsing an ePCR report and using the data to
1694 annotate a genomic sequence might look like this:
1696 use Bio::Tools::EPCR;
1698 $parser = new Bio::Tools::EPCR(-file => 'seq1.epcr');
1699 $seqio = new Bio::SeqIO(-format => 'fasta', -file => 'seq1.fa');
1700 $seq = $seqio->next_seq;
1701 while( $feat = $parser->next_feature ) {
1702 # add EPCR annotation to a sequence
1703 $seq->add_SeqFeature($feat);}
1705 =for html <A NAME ="iii.7"></A>
1707 =head2 III.7 Developing machine readable sequence annotations
1709 Historically, annotations for sequence data have been entered and read
1710 manually in flat-file or relational databases with relatively little
1711 concern for machine readability. More recent projects - such as EBI's
1712 Ensembl project and the efforts to develop an XML molecular biology
1713 data specification - have begun to address this limitation. Because
1714 of its strengths in text processing and regular-expression handling,
1715 perl is a natural choice for the computer language to be used for this
1716 task. And bioperl offers numerous tools to facilitate this process -
1717 several of which are described in the following sub-sections.
1719 =for html <A NAME ="iii.7.1"></A>
1721 =head2 III.7.1 Representing sequence annotations (Annotation,SeqFeature)
1723 As of the 0.7 release of bioperl, the fundamental sequence object,
1724 Seq, can have multiple sequence feature (SeqFeature) objects - eg
1725 Gene, Exon, Promoter objects - associated with it. A Seq object can
1726 also have an Annotation object (used to store database links,
1727 literature references and comments) associated with it. Creating a
1728 new SeqFeature and Annotation and associating it with a Seq is
1729 accomplished with syntax like:
1731 $feat = new Bio::SeqFeature::Generic('-start' => 40,
1734 '-primary' => 'exon',
1735 '-source' => 'internal' );
1736 $seqobj->add_SeqFeature($feat); # Add the SeqFeature to the parent
1737 $seqobj->annotation(new Bio::Annotation
1738 ('-description' => 'desc-here'));
1740 Once the features and annotations have been associated with the Seq,
1741 they can be with retrieved, eg:
1743 @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
1744 @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
1745 $ann = $seqobj->annotation(); # annotation object
1747 The individual components of a SeqFeature can also be set or retrieved
1748 with methods including:
1750 # attributes which return numbers
1751 $feat->start # start position
1752 $feat->end # end position
1754 $feat->strand # 1 means forward, -1 reverse, 0 not relevant
1756 # attributes which return strings
1757 $feat->primary_tag # the main 'name' of the sequence feature,
1759 $feat->source_tag # where the feature comes from, eg'BLAST'
1761 # attributes which return Bio::PrimarySeq objects
1762 $feat->seq # the sequence between start,end
1763 $feat->entire_seq # the entire sequence
1765 # other useful methods include
1766 $feat->overlap($other) # do SeqFeature $feat and SeqFeature $other overlap?
1767 $feat->contains($other) # is $other completely within $feat?
1768 $feat->equals($other) # do $feat and $other completely agree?
1769 $feat->sub_SeqFeatures # create/access an array of subsequence features
1771 See L<Bio::Annotation> and L<Bio::SeqFeature::Generic> as starting points
1772 for further exploration, and see the scripts/gff2ps.pl script.
1774 It is worth mentioning that one can also retrieve the start and end
1775 positions of a feature using a Bio::LocationI object:
1777 $location = $feat->location # $location is a Bio::LocationI object
1778 $location->start; # start position
1779 $location->end; # end position
1781 This is useful because one needs a Bio::Location::SplitLocationI object
1782 in order to retrieve the split coordinates inside the Genbank or EMBL join()
1783 statements (e.g. "CDS join(51..142,273..495,1346..1474)"):
1785 if ( $feat->location->isa('Bio::Location::SplitLocationI') &&
1786 $feat->primary_tag eq 'CDS' ) {
1787 foreach $loc ( $feat->location->sub_Location ) {
1788 print $loc->start . ".." . $loc->end . "\n";
1792 See L<Bio::LocationI> and L<Bio::Location::SplitLocationI> for more
1795 If more detailed annotation is required than is currently available in Seq
1796 objects the RichSeq object may be used. It is applicable in particular to
1797 database sequences (EMBL, GenBank and Swissprot) with detailed annotations.
1798 Sample usage might be:
1800 @secondary = $richseq->get_secondary_accessions;
1801 $division = $richseq->division;
1802 @dates = $richseq->get_dates;
1803 $seq_version = $richseq->seq_version;
1805 See L<Bio::Seq::RichSeqI> for more details.
1808 =for html <A NAME ="iii.7.2"></A>
1810 =head2 III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
1812 This interface extends the Bio::SeqI interface to give additional functionality
1813 to sequences with richer data sources, in particular from database sequences
1814 (EMBL, GenBank and Swissprot).
1816 Very large sequences and/or data files with sequences that are frequently being
1817 updated present special problems to automated sequence-annotation storage and
1818 retrieval projects. Bioperl's LargeSeq and LiveSeq objects are designed to
1819 address these two situations.
1823 A LargeSeq object is a SeqI compliant object that stores
1824 a sequence as a series of files in a temporary directory (see sect L<"II.1">
1825 or L<Bio::SeqI> for a definition of SeqI objects). The aim is to enable
1826 storing very large sequences (eg, E<gt> 100MBases) without running out of
1827 memory and, at the same time, preserving the familiar bioperl Seq object
1828 interface. As a result, from the users perspective, using a LargeSeq
1829 object is almost identical to using a Seq object. The principal
1830 difference is in the format used in the SeqIO calls. Another difference
1831 is that the user must remember to only read in small chunks of the
1832 sequence at one time. These differences are illustrated in the following
1835 $seqio = new Bio::SeqIO('-format'=>'largefasta',
1836 '-file' =>'t/data/genomic-seq.fasta');
1837 $pseq = $seqio->next_seq();
1838 $plength = $pseq->length();
1839 $last_4 = $pseq->subseq($plength-3,$plength); # this is OK
1841 # On the other hand, the next statement would
1842 # probably cause the machine to run out of memory
1843 # $lots_of_data = $pseq->seq(); # NOT OK for a large LargeSeq object
1847 The LiveSeq object addresses the need for a sequence object capable of
1848 handling sequence data that may be changing over time. In such a
1849 sequence, the precise locations of features along the sequence may
1850 change. LiveSeq deals with this issue by re-implementing the sequence
1851 object internally as a "double linked chain." Each element of the
1852 chain is connected to other two elements (the PREVious and the NEXT
1853 one). There is no absolute position (like in an array), hence if
1854 positions are important, they need to be computed (methods are
1855 provided). Otherwise it's easy to keep track of the elements with
1856 their "LABELs". There is one LABEL (think of it as a pointer) to each
1857 ELEMENT. The labels won't change after insertions or deletions of the
1858 chain. So it's always possible to retrieve an element even if the
1859 chain has been modified by successive insertions or deletions.
1861 Although the implementation of the LiveSeq object is novel, its
1862 bioperl user interface is unchanged since LiveSeq implements a
1863 PrimarySeqI interface (recall PrimarySeq is the subset of Seq without
1864 annotations or SeqFeatures - see section L<"II.1"> or L<Bio::PrimarySeq>).
1865 Consequently syntax for using LiveSeq objects is familiar although a
1866 modified version of SeqIO called Bio::LiveSeq::IO::Bioperl needs to be
1867 used to actually load the data, eg:
1869 $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
1870 '-file'=>"t/data/factor7.embl");
1871 $gene=$loader->gene2liveseq('-gene_name' => "factor7");
1872 $id = $gene->get_DNA->display_id ;
1873 $maxstart = $gene->maxtranscript->start;
1875 See L<Bio::LiveSeq::IO::BioPerl> for more details.
1877 Creating, maintaining and querying of LiveSeq genes is quite memory
1878 and processor intensive. Consequently, any additional information
1879 relating to mutational changes in a gene need to be stored separately
1880 from the sequence data itself. The next section describes the mutation
1881 and polymorphism objects used to accomplish this.
1884 =head2 III.7.3 Representing related sequences - mutations,
1885 polymorphisms etc (Allele, SeqDiff)
1887 The Mutation object allows for a basic description of a sequence change
1888 in the DNA sequence of a gene. The Mutator object takes in mutations,
1889 applies them to a LiveSeq gene and returns a set of Bio::Variation
1890 objects describing the net effect of the mutation on the gene at the DNA,
1891 RNA and protein level.
1893 The objects in Bio::Variation and Bio::LiveSeq directory were
1894 originally designed for the "Computational Mutation Expression
1895 Toolkit" project at European Bioinformatics Institute (EBI). The
1896 result of using them to mutate a gene is a holder object, 'SeqDiff',
1897 that can be printed out or queried for specific information. For
1898 example, to find out if restriction enzyme changes caused by a
1899 mutation are exactly the same in DNA and RNA sequences, we can write:
1901 use Bio::LiveSeq::IO::BioPerl;
1902 use Bio::LiveSeq::Mutator;
1903 use Bio::LiveSeq::Mutation;
1905 $loader = Bio::LiveSeq::IO::BioPerl->load('-file' => "$filename");
1906 $gene = $loader->gene2liveseq('-gene_name' => $gene_name);
1907 $mutation = new Bio::LiveSeq::Mutation ('-seq' =>'G',
1909 $mutate = Bio::LiveSeq::Mutator->new('-gene' => $gene,
1910 '-numbering' => "coding" );
1911 $mutate->add_Mutation($mutation);
1912 $seqdiff = $mutate->change_gene();
1913 $DNA_re_changes = $seqdiff->DNAMutation->restriction_changes;
1914 $RNA_re_changes = $seqdiff->RNAChange->restriction_changes;
1915 $DNA_re_changes eq $RNA_re_changes or print "Different!\n";
1917 For a complete working script, see the change_gene.pl script
1918 in the scripts/liveseq directory. For more details on the use of these objects
1919 see L<Bio::LiveSeq::Mutator> and L<Bio::LiveSeq::Mutation> as well as
1920 the original documentation for the "Computational Mutation Expression
1921 Toolkit" project at http://www.ebi.ac.uk/mutations/toolkit/.
1923 =for html <A NAME ="iii.7.4"></A>
1925 =head2 III.7.4 Incorporating quality data in sequence annotation (SeqWithQuality)
1927 SeqWithQuality objects are used to describe sequences with very
1928 specific annotations - that is, data quality annotaions. Data quality
1929 information is important for documenting the reliability of base
1930 "calls" in newly sequenced or otherwise questionable sequence
1931 data. The quality data is contained within a Bio::Seq::PrimaryQual object.
1932 Syntax for using SeqWithQuality objects is as follows:
1934 # first, make a PrimarySeq object
1935 $seqobj = Bio::PrimarySeq->new
1936 ( -seq => 'atcgatcg', -id => 'GeneFragment-12',
1937 -accession_number => 'X78121', -alphabet => 'dna');
1938 # now make a PrimaryQual object
1939 $qualobj = Bio::Seq::PrimaryQual->new
1940 ( -qual => '10 20 30 40 50 50 20 10', -id => 'GeneFragment-12',
1941 -accession_number => 'X78121', -alphabet => 'dna');
1942 # now make the SeqWithQuality object
1943 $swqobj = Bio::Seq::SeqQithQuality->new
1944 ( -seq => $seqobj, -qual => $qualobj);
1945 # Now we access the sequence with quality object
1946 $swqobj->id(); # the id of the SeqWithQuality object may not match the
1947 # id of the sequence or of the quality
1948 $swqobj->seq(); # the sequence of the SeqWithQuality object
1949 $swqobj->qual(); # the quality of the SeqWithQuality object
1951 A SeqWithQuality object is created automatically when phred output, a *phd
1952 file, is read by Seqio, eg
1954 $seqio = Bio::SeqIO->new(-file=>"my.phd",-format=>"phd");
1955 # or just 'Bio::SeqIO->new(-file=>"my.phd")'
1956 $seqWithQualObj = $seqio->next_seq;
1958 See L<Bio::Seq::SeqWithQuality> for a detailed description of the methods,
1959 L<Bio::Seq::PrimaryQual>, and L<Bio::SeqIO::phd>.
1962 =head2 III.7.5 Sequence XML representations - generation and parsing (SeqIO::game, SeqIO::bsml)
1964 The previous subsections have described tools for automated sequence
1965 annotation by the creation of an "object layer" on top of a
1966 traditional database structure. XML takes a somewhat different
1967 approach. In XML, the data structure is unmodified, but machine
1968 readability is facilitated by using a data-record syntax with special
1969 flags and controlled vocabulary.
1971 Bioperl supports a set of XML flags and vocabulary words for molecular
1972 biology - called bioxml - detailed at
1973 http://www.bioxml.org/dtds/current/. The idea is that any bioxml
1974 features can be turned into bioperl Seq annotations. Conversely
1975 Seq object features and annotations can be converted to XML so that
1976 they become available to any other systems that are XML (and bioxml)
1977 compliant. Typical usage is shown below. No special syntax is
1978 required by the user. Note that some Seq annotation will be lost when
1979 using bioxml in this manner since in its current implementation
1980 bioxml does not support all the annotation information available in
1983 $str = Bio::SeqIO->new('-file'=> 't/data/test.game',
1984 '-format' => 'game');
1985 $seq = $str->next_primary_seq();
1987 @feats = $seq->all_SeqFeatures();
1988 $first_primary_tag = $feats[0]->primary_tag;
1990 Additional XML formats to describe sequences and their annotations
1991 have been created. BSML and AGAVE are two additional formats that
1992 have been created in the last year. Bioperl currently only supports
1993 BSML through the SeqIO system at this time. Usage is similar to other
1996 $str = Bio::SeqIO->new('-file'=> 'bsmlfile.xml',
1997 '-format' => 'bsml');
1998 $seq = $str->next_primary_seq();
2000 @feats = $seq->all_SeqFeatures();
2001 $first_primary_tag = $feats[0]->primary_tag;
2004 =head2 III.8 Representing non-sequence data in Bioperl: structures, trees and maps
2006 Though bioperl has its roots in describing and searching nucleotide and protein
2007 sequences it has also branched out into related fields of study,
2008 such as protein structure, phylogenetic trees and genetic maps.
2011 =head2 III.8.1 Using 3D structure objects and reading PDB files
2012 (StructureI, Structure::IO)
2014 A StructureIO object can be created from one or more 3D structures
2015 represented in Protein Data Bank, or pdb, format (see
2016 http://www.rcsb.org/pdb for details).
2018 StructureIO objects allow access to a variety of related Bio:Structure
2019 objects. An Entry object consist of one or more Model objects, which
2020 in turn consist of one or more Chain objects. A Chain is composed of
2021 Residue objects, which in turn consist of Atom objects. There's a
2022 wealth of methods, here are just a few:
2024 $structio = Bio::Structure::IO->new( -file => "1XYZ.pdb");
2025 $struc = $structio->next_structure; # returns an Entry object
2026 $ann = $struc->annotation; # returns a Bio::Annotation object
2027 $pseq = $struc->seqres; # returns a PrimarySeq object, thus
2028 $pseq->subseq(1,20); # returns a sequence string
2029 @atoms = $struc->get_atoms($res); # Atom objects, given a Residue
2030 @xyz = $atom->xyz; # the 3D coordinates of the atom
2032 These lines show how one has access to a number of related objects and methods.
2033 For examples of typical usage of these modules, see the scripts in the
2034 scripts/structure subdirectory. Also see L<Bio::Structure::IO>,
2035 L<Bio::Structure::Entry>, L<Bio::Structure::Model>,
2036 L<Bio::Structure::Chain>, L<Bio::Structure::Residue>, and
2037 L<Bio::Structure::Atom> for more information.
2039 =head2 III.8.2 Tree objects and phylogenetic trees (Tree::Tree, TreeIO)
2041 Bioperl Tree objects can store data for all kinds of computer trees
2042 and are intended especially for phylogenetic trees. Nodes and
2043 branches of trees can be individually manipulated. The TreeIO object
2044 is used for stream I/O of tree objects. Currently only phylip/newick
2045 tree format is supported. Sample code might be:
2047 $treeio = new Bio::TreeIO( -format => 'newick', -file => $treefile);
2048 $tree = $treeio->next_tree; # get the tree
2049 @nodes = $tree->get_nodes; # get all the nodes
2050 $tree->get_root_node()->each_Descendent(); # get descendents of root node
2052 See L<Bio::TreeIO> and L<Bio::Tree::Tree> for details.
2055 =head2 III.8.3 Map objects for manipulating genetic maps (Map::MapI,
2058 Bioperl map objects can be used to describe any type of biological map
2059 data including genetic maps, STS maps etc. Map I/O is performed with
2060 the MapIO object which works in a similar manner to the SeqIO,
2061 SearchIO and similar I/O objects described previously. In principle,
2062 Map I/O with various map data formats can be performed. However
2063 currently only "mapmaker" format is supported. Manipulation of
2064 genetic map data with Bioperl Map objects might look like this:
2066 $mapio = new Bio::MapIO( '-format' => 'mapmaker', '-file' => $mapfile);
2067 $map = $mapio->next_map; # get a map
2068 $maptype = $map->type ;
2069 foreach $marker ( $map->each_element ) {
2070 $marker_name = $marker->name ; # get the name of each map marker
2073 See L<Bio::MapIO> and L<Bio::Map::SimpleMap> for more information.
2076 =head2 III.8.4 Bibliographic objects for querying bibliographic databases (Biblio)
2078 Bio::Biblio objects are used to query bibliographic databases, such as MEDLINE.
2079 The associated modules are built to work with OpenBQS-compatible databases
2080 (see http://industry.ebi.ac.uk/openBQS). A Bio::Biblio object can execute a query
2083 my $collection = $biblio->find ('brazma', 'authors');
2084 while ( $collection->has_next ) {
2085 print $collection->get_next;
2088 See L<Bio::Biblio> or the scripts/biblio/biblio.pl script for details.
2091 =for html <A NAME ="iii.8.5"></A>
2093 =head2 III.8.5 Graphics objects for representing sequence objects as images (Graphics)
2095 A user may want to represent Seq objects and their SeqFeatures graphically. The
2096 Bio::Graphics::* modules use Perl's GD.pm module to create a PNG or GIF image
2097 given the SeqFeatures (Section L<"III.7.1">) contained within a Seq object.
2099 These modules contain numerous methods to dictate the sizes, colors, labels,
2100 and line formats within the image. See L<Bio::Graphics>, L<Bio::Graphics::Panel>,
2101 or the scripts/render_sequence.pl script for more information.
2103 The Genquire application also provides ways to graphically represent Seq objects
2104 (see Section L<"IV.6">).
2107 =head2 III.9 Bioperl alphabets
2109 Bioperl modules use the standard extended single-letter genetic
2110 alphabets to represent nucleotide and amino acid sequences.
2112 In addition to the standard alphabet, the following symbols
2113 are also acceptable in a biosequence:
2115 ? (a missing nucleotide or amino acid)
2119 =head2 III.9.1 Extended DNA / RNA alphabet
2121 (includes symbols for nucleotide ambiguity)
2122 ------------------------------------------
2123 Symbol Meaning Nucleic Acid
2124 ------------------------------------------
2144 IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE:
2145 Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.
2148 =head2 III.9.2 Amino Acid alphabet
2150 ------------------------------------------
2152 ------------------------------------------
2154 B Aspartic Acid, Asparagine
2175 Z Glutamic Acid, Glutamine
2179 IUPAC-IUP AMINO ACID SYMBOLS:
2180 Biochem J. 1984 Apr 15; 219(2): 345-373
2181 Eur J Biochem. 1993 Apr 1; 213(1): 2
2184 =for html <A NAME ="IV."></A>
2186 =head1 IV. Related projects - biocorba, biopython, biojava, Ensembl,
2187 Genquire /AnnotationWorkbench / bioperl-gui
2189 There are several "sister projects" to bioperl currently under
2190 development. These include biocorba, biopython, biojava, EMBOSS, Ensembl,
2191 and Genquire / Annotation Workbench (which includes Bioperl-gui). These are all
2192 large complex projects and describing them in detail here will not be
2193 attempted. However a brief introduction seems appropriate since, in
2194 the future, they may each provide significant added utility to the
2197 =head2 IV.1 Biocorba
2199 Interface objects have facilitated interoperability between bioperl
2200 and other perl packages such as Ensembl and the Annotation Workbench.
2201 However, interoperability between bioperl and packages written in
2202 other languages requires additional support software. CORBA is one
2203 such framework for interlanguage support, and the biocorba project is
2204 currently implementing a CORBA interface for bioperl. With biocorba,
2205 objects written within bioperl will be able to communicate with
2206 objects written in biopython and biojava (see the next subsection).
2207 For more information, see the biocorba project website at
2208 http://biocorba.org/. The Bioperl BioCORBA server and client bindings are
2209 available in the bioperl-corba-server and bioperl-corba-client bioperl CVS
2210 repositories respecitively. (see http://cvs.bioperl.org for more information).
2212 =head2 IV.2 Biopython and biojava
2214 Biopython and biojava are open source projects with very similar goals
2215 to bioperl. However their code is implemented in python and java,
2216 respectively. With the development of interface objects and biocorba,
2217 it is possible to write java or python objects which can be accessed
2218 by a bioperl script, or to call bioperl objects from java or python
2219 code. Since biopython and biojava are more recent projects than
2220 bioperl, most effort to date has been to port bioperl functionality to
2221 biopython and biojava rather than the other way around. However, in
2222 the future, some bioinformatics tasks may prove to be more effectively
2223 implemented in java or python in which case being able to call them
2224 from within bioperl will become more important. For more information,
2225 go to the biojava http://biojava.org/ and biopython http://biopython.org/
2229 =for html <A NAME ="iv.3"></A>
2233 EMBOSS is another open source project with similar goals
2234 to bioperl. However EMBOSS code is implemented in C and has been
2235 designed for standalone execution on the Unix command line, rather
2236 than for incorporation into a user script or program. EMBOSS includes
2237 a wide array of useful bioinformatics functions similar to those of
2238 the GCG package after which it was designed. For more information on EMBOSS
2239 refer to http://www.hgmp.mrc.ac.uk/Software/EMBOSS/. The principal bioperl
2240 interface object to EMBOSS is described in L<Bio::Tools::Run::EMBOSSApplication>
2241 (and see section L<"III.3.6">).
2243 =head2 IV.4 Ensembl and bioperl-db
2245 Ensembl is an ambitious automated-genome-annotation project at EBI.
2246 Much of Ensembl's code is based on bioperl, and Ensembl developers, in
2247 turn, have contributed significant pieces of code to bioperl. In
2248 particular, the bioperl code for automated sequence annotation has
2249 been largely contributed by Ensembl developers.
2250 Describing Ensembl and its capabilities is far beyond the scope of
2251 this tutorial The interested reader is referred to the Ensembl website
2252 at http://www.ensembl.org/.
2254 Bioperl-db is a relatively new project intended to transfer some of
2255 Ensembl's capability of integrating bioperl syntax with a standalone
2256 Mysql database (http://www.mysql.com) to the bioperl code-base. More details
2257 on bioperl-db can be found in the bioperl-db CVS directory at
2258 http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/?cvsroot=bioperl.
2259 It is worth mentioning that most of the bioperl objects mentioned above
2260 map directly to tables in the bioperl-db schema. Therefore object data such
2261 as sequences, their features, and annotations can be easily loaded into the
2264 $loader->store($newid,$seqobj)
2266 Similarly one can query the database in a variety of ways and retrieve
2267 arrays of Seq objects. See biodatabases.pod, L<Bio::DB::SQL::SeqAdaptor>,
2268 L<Bio::DB::SQL::QueryConstraint>, and L<Bio::DB::SQL::BioQuery> for examples.
2271 =head2 IV.5 GFF format and Bio::DB::GFF*
2273 The Bio::DB::GFF module provides access to relational databases constructed
2274 from data files in GFF format. This file type is well suited to sequence
2275 annotation because it allows the ability to describe entries in terms of
2276 parent-child relationships (see http://www.sanger.ac.uk/software/GFF for
2277 details). Like bioperl-db, above, the current implementation uses mysql
2278 (http://www.mysql.com).
2280 The module accesses not only by id but by annotation type and position or
2281 range. Those who would like to explore bioperl as a means to overlay nucleotide
2282 sequence, protein sequence, features, and annotations should take a close
2283 look at L<Bio::DB::GFF>and the load_gff.pl, bulk_load_gff.pl, gadfly_to_gff.pl,
2284 and sgd_to_gff.pl scripts in the scripts/Bio-DB-GFF directory.
2287 =for html <A NAME ="iv.6"></A>
2289 =head2 IV.6 Genquire, the Annotation Workbench and bioperl-gui
2291 The Annotation Workbench and Genquire were developed at the Plant
2292 Biotechnology Institute of the National Research Council of Canada.
2293 This is an integrated graphical suite of tools in Perl for examining a sequence,
2294 predicting gene structure, and creating annotations. Information about
2295 Genquire is available at http://bioinformatics.org/project/?group_id=99.
2296 With Genquire and bioperl-gui one can display a Bioperl Seq object
2297 graphically. You can download the current version of
2298 the gui software from the bioperl-gui CVS directory at
2299 http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-gui/?cvsroot=bioperl.
2300 You will also need Tcl/Tk.
2303 =head2 V.1 Appendix: Finding out which methods are used by which
2306 At numerous places in the tutorial, the reader is directed to the
2307 "documentation included with each of the modules." As was mentioned in
2308 the introduction, it is sometimes not easy in perl to determine the
2309 appropriate documentation to look for, because objects inherit methods
2310 from other objects (and the relevant documentation will be stored in the
2311 object from which the method was inherited.)
2313 For example, say you wanted to find documentation on the "parse" method
2314 of the object Genscan.pm. You would not find this documentation in
2315 the code for Genscan.pm, but rather in the code for AnalysisResult.pm
2316 from which Genscan.pm inherits the parse method!
2318 So how would you know to look in AnalysisResult.pm
2319 for this documentation? The easy way is to use the special function
2320 "option 100" in the bptutorial script. Specifically if you run:
2322 > perl -w bptutorial.pl 100 Bio::Tools::Genscan
2324 you will receive the following output:
2326 ***Methods for Object Bio::Tools::Genscan ********
2328 Methods taken from package Bio::Root::IO
2329 catfile close gensym new qualify qualify_to_ref
2330 rmtree tempdir tempfile ungensym
2332 Methods taken from package Bio::Root::RootI
2333 DESTROY stack_trace stack_trace_dump throw verbose warn
2335 Methods taken from package Bio::SeqAnalysisParserI
2336 carp confess croak next_feature
2338 Methods taken from package Bio::Tools::AnalysisResult
2339 analysis_date analysis_method analysis_method_version analysis_query analysis_subject parse
2341 Methods taken from package Bio::Tools::Genscan
2344 From this output, it is clear exactly from which object each method
2345 of Genscan.pm is taken, and, in particular that "parse" is
2346 taken from the package Bio::Tools::AnalysisResult.
2348 With this approach you can easily determine the source of any method
2349 in any bioperl object.
2351 =head2 V.2 Appendix: Tutorial demo scripts
2353 The following scripts demonstrate many of the features of bioperl. To
2354 run all the demos run:
2356 > perl -w bptutorial.pl 0
2358 To run a subset of the scripts do
2360 > perl -w bptutorial.pl
2362 and use the displayed help screen.
2368 # PROGRAM : bptutorial.pl
2369 # PURPOSE : Demonstrate various uses of the bioperl package
2370 # AUTHOR : Peter Schattner schattner@alum.mit.edu
2371 # CREATED : Dec 15 2000
2375 use Bio
::SimpleAlign
;
2380 # subroutine references
2382 my ($access_remote_db, $index_local_db, $fetch_local_db,
2383 $sequence_manipulations, $seqstats_and_seqwords,
2384 $restriction_and_sigcleave, $other_seq_utilities, $run_remoteblast,
2385 $run_standaloneblast, $blast_parser, $bplite_parsing, $hmmer_parsing,
2386 $run_clustalw_tcoffee, $run_psw_bl2seq, $simplealign,
2387 $gene_prediction_parsing, $sequence_annotation, $largeseqs,
2388 $run_tree, $run_map, $run_struct, $run_perl, $searchio_parsing,
2389 $liveseqs, $demo_variations, $demo_xml, $display_help, $bpinspect1 );
2391 # global variable file names. Edit these if you want to try
2392 #out a tutorial script on a different file
2394 Bio
::Root
::IO
->catfile("t","data","ecolitst.fa");
2396 my $dna_seq_file = Bio
::Root
::IO
->catfile("t","data","dna1.fa"); # used in $sequence_manipulations
2397 my $amino_seq_file = Bio
::Root
::IO
->catfile("t","data","cysprot1.fa"); # used in $other_seq_utilities and in $run_perl
2398 #and $sequence_annotation
2399 my $blast_report_file = Bio
::Root
::IO
->catfile("t","data","blast.report"); # used in $blast_parser
2400 my $bp_parse_file1 = Bio
::Root
::IO
->catfile("t","data","blast.report"); # used in $bplite_parsing
2401 my $bp_parse_file2 = Bio
::Root
::IO
->catfile("t","data","psiblastreport.out"); # used in $bplite_parsing
2402 my $bp_parse_file3 = Bio
::Root
::IO
->catfile("t","data","bl2seq.out"); # used in $bplite_parsing
2403 my $unaligned_amino_file = Bio
::Root
::IO
->catfile("t","data","cysprot1a.fa"); # used in $run_clustalw_tcoffee
2404 my $aligned_amino_file = Bio
::Root
::IO
->catfile("t","data","testaln.pfam"); # used in $simplealign
2406 # other global variables
2410 ##############################
2412 $display_help = sub {
2415 # Prints usage information for tutorial script.
2417 print STDERR
<<"QQ_PARAMS_QQ";
2419 The following numeric arguments can be passed to run the corresponding demo-script.
2420 1 => access_remote_db ,
2421 2 => index_local_db ,
2422 3 => fetch_local_db , (NOTE: needs to be run with demo 2)
2423 4 => sequence_manipulations ,
2424 5 => seqstats_and_seqwords ,
2425 6 => restriction_and_sigcleave ,
2426 7 => other_seq_utilities ,
2427 8 => run_standaloneblast ,
2428 9 => bplite_parsing ,
2429 10 => hmmer_parsing ,
2430 11 => run_clustalw_tcoffee ,
2431 12 => run_psw_bl2seq ,
2433 14 => gene_prediction_parsing ,
2434 15 => sequence_annotation ,
2437 18 => demo_variations ,
2443 24 => searchio_parsing ,
2444 25 => run_remoteblast ,
2446 In addition the argument "100" followed by the name of a single
2447 bioperl object will display a list of all the public methods
2448 available from that object and from what object they are inherited.
2450 Using the parameter "0" will run all tests.
2451 Using any other argument (or no argument) will run this display.
2453 So typical command lines might be:
2454 To run all demo scripts:
2455 > perl -w bptutorial.pl 0
2456 or to just run the local indexing demos:
2457 > perl -w bptutorial.pl 2 3
2458 or to list all the methods available for object Bio::Tools::SeqStats -
2459 > perl -w bptutorial.pl 100 Bio::Tools::SeqStats
2467 ## Note: "main" routine is at very end of script
2469 #################################################
2470 # access_remote_db():
2473 $access_remote_db = sub {
2475 use Bio::DB::GenBank;
2477 print "Beginning remote database access example... \n";
2479 # III.2.1 Accessing remote databases (Bio::DB::GenBank, etc)
2481 my ($gb, $seq1, $seq2, $seq1_id, $seqobj, $seq2_id,$seqio);
2484 $gb = new Bio::DB::GenBank();
2485 $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
2489 warn "Warning: Couldn't connect to Genbank with Bio::DB::GenBank.pm!\nProbably no network access.\n Skipping Test\n";
2492 $seq1_id = $seq1->display_id();
2493 print "seq1 display id is $seq1_id \n";
2494 $seq2 = $gb->get_Seq_by_acc('AF303112');
2495 $seq2_id = $seq2->display_id();
2496 print "seq2 display id is $seq2_id \n";
2497 $seqio = $gb->get_Stream_by_batch([ qw(2981014 J00522 AF303112)]);
2498 $seqobj = $seqio->next_seq();
2499 print "Display id of first sequence in stream is ", $seqobj->display_id() ,"\n";
2504 #################################################
2505 # index_local_db ():
2508 $index_local_db = sub {
2510 use Bio
::Index
::Fasta
; # using fasta file format
2511 use strict
; # some users have reported that this is required
2513 my ( $Index_File_Name, $inx1, $inx2, $id, $dir, $key,
2514 $keyfound, $seq, $indexhash);
2515 print "\nBeginning indexing local_db example... \n";
2516 print "This subroutine unlikely to run unless OS = unix\n";
2519 # III.2.2 Indexing and accessing local databases
2520 # (Bio::Index::*, bpindex.pl, bpfetch.pl)
2522 # first create the index
2523 $Index_File_Name = 'bptutorial.indx';
2525 eval { use Cwd
; $dir = cwd
; };
2526 # CWD not installed, revert to unix behavior, best we can do
2527 if( $@
) { $dir = `pwd`;}
2530 $inx1 = Bio
::Index
::Fasta
->new
2531 ('-FILENAME' => "$dir/$Index_File_Name",
2532 '-write_flag' => 1);
2534 $inx1->make_index(Bio
::Root
::IO
->catfile("$dir","t","data","multifa.seq") );
2535 $inx1->make_index(Bio
::Root
::IO
->catfile("$dir","t","data","seqs.fas") );
2538 # $inx1->make_index("$dir/t/data/multifa.seq");
2539 # $inx1->make_index("$dir/t/data/seqs.fas");
2540 print "Finished indexing local_db example... \n";
2546 #################################################
2547 # fetch_local_db ():
2550 $fetch_local_db = sub {
2552 use Bio
::Index
::Fasta
; # using fasta file format
2553 use strict
; # some users have reported that this is required
2555 my ( $Index_File_Name, $inx2, $id, $dir, $key,
2556 $keyfound, $seq, $indexhash,$value );
2557 print "\nBeginning retrieving local_db example... \n";
2559 # then retrieve some files
2560 #$Index_File_Name = shift;
2561 $Index_File_Name = 'bptutorial.indx';
2563 eval { use Cwd
; $dir = cwd
; };
2564 # CWD not installed, revert to unix behavior, best we can do
2565 if( $@
) { $dir = `pwd`;}
2567 $inx2 = Bio
::Index
::Abstract
->new
2568 ('-FILENAME' => Bio
::Root
::IO
->catfile("$dir","$Index_File_Name") );
2569 # ('-FILENAME' => "$dir/$Index_File_Name");
2571 $indexhash = $inx2->db();
2573 while ( ($key,$value) = each %$indexhash ) {
2574 if ( $key =~ /97/ ) { $keyfound = 'true'; $id = $key; last; }
2577 $seq = $inx2->fetch($id);
2578 print "Sequence ", $seq->display_id(),
2579 " has length ", $seq->length()," \n";
2582 unlink "$dir/$Index_File_Name" ;
2589 #################################################
2590 # sequence_manipulations ():
2593 $sequence_manipulations = sub {
2595 my ($infile, $in, $out, $seqobj);
2596 $infile = $dna_seq_file;
2598 print "\nBeginning sequence_manipulations and SeqIO example... \n";
2601 # III.3.1 Transforming sequence files (SeqIO)
2603 $in = Bio
::SeqIO
->new('-file' => $infile ,
2604 '-format' => 'Fasta');
2605 $seqobj = $in->next_seq();
2607 # perl "tied filehandle" syntax is available to SeqIO,
2608 # allowing you to use the standard <> and print operations
2609 # to read and write sequence objects, eg:
2610 #$out = Bio::SeqIO->newFh('-format' => 'EMBL');
2612 $out = Bio
::SeqIO
->newFh('-format' => 'fasta');
2614 print "First sequence in fasta format... \n";
2617 # III.4 Manipulating individual sequences
2619 # The following methods return strings
2622 print "Seq object display id is ",
2623 $seqobj->display_id(), "\n"; # the human read-able id of the sequence
2624 print "Sequence is ",
2625 $seqobj->seq()," \n"; # string of sequence
2626 print "Sequence from 5 to 10 is ",
2627 $seqobj->subseq(5,10)," \n"; # part of the sequence as a string
2628 print "Acc num is ",
2629 $seqobj->accession_number(), " \n"; # when there, the accession number
2630 print "Moltype is ",
2631 $seqobj->alphabet(), " \n"; # one of 'dna','rna','protein'
2632 print "Primary id is ", $seqobj->primary_seq->primary_id()," \n";
2633 # a unique id for this sequence irregardless
2634 #print "Primary id is ", $seqobj->primary_id(), " \n";
2635 # a unique id for this sequence irregardless
2636 # of its display_id or accession number
2638 # The following methods return an array of Bio::SeqFeature objects
2639 $seqobj->top_SeqFeatures; # The 'top level' sequence features
2640 $seqobj->all_SeqFeatures; # All sequence features, including sub
2643 # The following methods returns new sequence objects,
2644 # but do not transfer features across
2645 # truncation from 5 to 10 as new object
2646 print "Truncated Seq object sequence is ",
2647 $seqobj->trunc(5,10)->seq(), " \n";
2648 # reverse complements sequence
2649 print "Reverse complemented sequence 5 to 10 is ",
2650 $seqobj->trunc(5,10)->revcom->seq, " \n";
2651 # translation of the sequence
2652 print "Translated sequence 6 to 15 is ",
2653 $seqobj->translate->subseq(6,15), " \n";
2657 $c ||= 'ctgagaaaataa';
2659 print "\nBeginning 3-frame and alternate codon translation example... \n";
2661 my $seq = new Bio
::PrimarySeq
('-SEQ' => $c,
2663 print "$c translated using method defaults : ",
2664 $seq->translate->seq, "\n";
2666 # Bio::Seq uses same sequence methods as PrimarySeq
2667 my $seq2 = new Bio
::Seq
('-SEQ' => $c, '-ID' => 'no.Two');
2668 print "$c translated as a coding region (CDS): ",
2669 $seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";
2671 print "\nTranslating in all six frames:\n";
2672 my @frames = (0, 1, 2);
2673 foreach my $frame (@frames) {
2674 print " frame: ", $frame, " forward: ",
2675 $seq->translate(undef, undef, $frame)->seq, "\n";
2676 print " frame: ", $frame, " reverse-complement: ",
2677 $seq->revcom->translate(undef, undef, $frame)->seq, "\n";
2680 print "Translating with all codon tables using method defaults:\n";
2681 my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
2682 foreach my $ct (@codontables) {
2684 $seq->translate(undef, undef, undef, $ct)->seq, "\n";
2690 ################################################# ;
2691 # seqstats_and_seqwords ():
2694 $seqstats_and_seqwords = sub {
2696 use Bio
::Tools
::SeqStats
;
2697 use Bio
::Tools
::SeqWords
;
2699 print "\nBeginning seqstats_and_seqwords example... \n";
2702 my ($seqobj, $weight, $monomer_ref, $codon_ref,
2703 $seq_stats, $words, $hash);
2704 $seqobj = Bio
::Seq
->new
2705 ('-seq'=>'ACTGTGGCGTCAACTGACTGTGGCGTCAACTGACTGTGGGCGTCAACTGACTGTGGCGTCAACTG',
2710 # III.4.1 Obtaining basic sequence statistics- MW,
2711 # residue &codon frequencies(SeqStats, SeqWord)
2714 $seq_stats = Bio
::Tools
::SeqStats
->new($seqobj);
2715 $weight = $seq_stats->get_mol_wt();
2716 $monomer_ref = $seq_stats->count_monomers();
2717 $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
2718 print "Sequence \'test\' is ", $seqobj->seq(), " \n";
2719 print "Sequence ", $seqobj->id()," molecular weight is $$weight[0] \n";
2720 # Note, $weight is an array reference
2721 print "Number of A\'s in sequence is $$monomer_ref{'A'} \n";
2722 print "Number of ACT codon\'s in sequence is $$codon_ref{'ACT'} \n";
2724 $words = Bio
::Tools
::SeqWords
->new($seqobj);
2725 $hash = $words->count_words(6);
2726 print "Number of 6-letter \'words\' of type ACTGTG in",
2727 " sequence is $$hash{'ACTGTG'} \n";
2733 #################################################
2734 # restriction_and_sigcleave ():
2737 $restriction_and_sigcleave = sub {
2739 use Bio
::Tools
::RestrictionEnzyme
;
2740 use Bio
::Tools
::Sigcleave
;
2742 my ($re, $re1, $re2, @sixcutters, @fragments1,
2743 @fragments2, $seqobj, $dna);
2744 print "\nBeginning restriction enzyme example... \n";
2746 # III.4.4 Identifying restriction enzyme sites (RestrictionEnzyme)
2748 $dna = 'CCTCCGGGGACTGCCGTGCCGGGCGGGAATTCGCCATGGCGACCCTGGAAAAGCTGATATCGAAGGCCTTCGA';
2750 # Build sequence and restriction enzyme objects.
2751 $seqobj = new Bio
::Seq
('-ID' => 'test_seq',
2754 #$re = new Bio::Tools::RestrictionEnzyme();
2755 $re = new Bio
::Tools
::RestrictionEnzyme
('-name'=>'EcoRI');
2756 @sixcutters = $re->available_list(6);
2758 print "The following 6-cutters are available\n";
2759 print join(" ",@sixcutters),"\n";
2761 $re1 = new Bio
::Tools
::RestrictionEnzyme
('-name'=>'EcoRI');
2762 @fragments1 = $re1->cut_seq($seqobj);
2763 #$seqobj is the Seq object for the dna to be cut
2765 print "\nThe sequence of " . $seqobj->display_id . " is " .
2766 $seqobj->seq . "\n";
2767 print "When cutting " . $seqobj->display_id() . " with " .
2768 $re1->seq->id . " the initial fragment is\n" . $fragments1[0];
2770 $re2 = new Bio
::Tools
::RestrictionEnzyme
2771 ('-NAME' =>'EcoRV--GAT^ATC',
2772 '-MAKE' =>'custom');
2773 @fragments2 = $re2->cut_seq($seqobj);
2775 print "\nWhen cutting ", $seqobj->display_id(),
2776 " with ", $re2->seq->id;
2777 print " the second fragment is\n", $fragments2[1], " \n";
2779 # III.4.7 Identifying amino acid cleavage sites (Sigcleave)
2781 print "\nBeginning signal cleaving site location example... \n";
2783 my ( $sigcleave_object, %raw_results , $location,
2784 $formatted_output, $protein, $in, $seqobj2);
2787 "MKVILLFVLAVFTVFVSSRGIPPEEQSQFLEFQDKFNKKYSHEEYLERFEIFKSNLGKIEELNLIAINHKADTKFGVNKFADLSSDEFKNYYLNNKEAIFTDDLPVADYLDDEFINSIPTAFDWRTRGAVTPVKNQGQCGSCWSFSTTGNVEGQHFISQNKLVSLSEQNLVDCDHECMEYEGEEACDEGCNGGLQPNAYNYIIKNGGIQTESSYPYTAETGTQCNFNSANIGAKISNFTMIPKNETVMAGYIVSTGPLAIAADAVEWQFYIGGVFDIPCNPNSLDHGILIVGYSAKNTIFRKNMPYWIVKNSWGADWGEQGYIYLRRGKNTCGVSNFVSTSII";
2788 $formatted_output = "";
2791 # Note that Sigcleave is passed a raw sequence
2792 # rather than a sequence object when it is created.
2793 $sigcleave_object = new Bio
::Tools
::Sigcleave
2794 ('-id' => 'test_sigcleave_seq',
2796 '-threshold' => 3.0,
2797 '-seq' => $protein);
2799 %raw_results = $sigcleave_object->signals;
2800 $formatted_output = $sigcleave_object->pretty_print;
2802 print " SigCleave \'raw results\': \n";
2803 foreach $location (sort keys %raw_results) {
2804 print " Cleave site found at location $location ",
2805 "with value of $raw_results{$location} \n";
2807 print "\nSigCleave formatted output: \n $formatted_output \n";
2812 #################################################
2813 # run_standaloneblast ():
2816 $run_standaloneblast = sub {
2817 use Bio
::Tools
::Run
::StandAloneBlast
;
2818 use Bio
::Tools
::BPlite
;
2820 print "\nBeginning run_standaloneblast example... \n";
2822 my (@params, $factory, $input, $blast_report, $blast_present,
2823 $database, $sbjct, $str, $seq1);
2825 $database = $_[0] || 'ecoli.nt'; # user can select local nt database
2827 $blast_present = Bio
::Tools
::Run
::StandAloneBlast
->exists_blast();
2828 unless ($blast_present) {
2829 warn "blast program not found. Skipping StandAloneBlast example\n";
2832 #@params = ('program' => 'blastn', 'database' => 'ecoli.nt');
2833 @params = ('program' => 'blastn', 'database' => $database);
2834 $factory = Bio
::Tools
::Run
::StandAloneBlast
->new(@params);
2837 $str = Bio
::SeqIO
->new('-file'=> Bio
::Root
::IO
->catfile("t","data","dna2.fa") ,
2838 # $str = Bio::SeqIO->new('-file'=>'t/data/dna2.fa' ,
2839 '-format' => 'Fasta', );
2840 $seq1 = $str->next_seq();
2842 $blast_report = $factory->blastall($seq1);
2843 $sbjct = $blast_report->nextSbjct;
2845 print " Hit name is ", $sbjct->name, " \n";
2850 #################################################
2851 # run_remoteblast ():
2854 $run_remoteblast = sub {
2855 print "\nBeginning run_remoteblast example... \n";
2856 eval { require Bio
::Tools
::Run
::RemoteBlast
; };
2859 print STDERR
"Cannot load Bio::Tools::Run::RemoteBlast\n";
2860 print STDERR
"Cannot run run_remoteblast example:\n$@\n";
2862 my (@params, $remote_blast_object, $blast_file, $r, $rc,
2865 $database = 'ecoli';
2866 @params = ('-prog' => 'blastp',
2867 '-data' => $database,
2868 '-expect' => '1e-10',
2869 '-readmethod' => 'BPlite' );
2871 $remote_blast_object = Bio
::Tools
::Run
::RemoteBlast
->new(@params);
2872 $blast_file = Bio
::Root
::IO
->catfile("t","data","ecolitst.fa");
2873 $r = $remote_blast_object->submit_blast( $blast_file);
2874 print "submitted Blast job\n";
2875 while ( my @rids = $remote_blast_object->each_rid ) {
2876 foreach my $rid ( @rids ) {
2877 $rc = $remote_blast_object->retrieve_blast($rid);
2878 print "retrieving results...\n";
2879 if( !ref($rc) ) { # $rc not a reference => either error
2880 # or job not yet finished
2882 $remote_blast_object->remove_rid($rid);
2883 print "Error return code for BlastID code $rid ... \n";
2887 $remote_blast_object->remove_rid($rid);
2888 while ( my $sbjct = $rc->nextSbjct ) {
2889 print "sbjct name is ", $sbjct->name, "\n";
2890 while ( my $hsp = $sbjct->nextHSP ) {
2891 print "score is ", $hsp->score, "\n";
2902 #################################################
2903 # searchio_parsing ():
2908 $searchio_parsing = sub {
2910 my ($searchio, $result,$hit,$hsp);
2917 print "\nBeginning searchio-parser example... \n";
2919 $searchio = new Bio
::SearchIO
('-format' => 'blast',
2920 '-file' => Bio
::Root
::IO
->catfile('t','data','ecolitst.bls'));
2922 $result = $searchio->next_result;
2924 print "Database name is ", $result->database_name , "\n";
2925 print "Algorithm is ", $result->algorithm , "\n";
2926 print "Query length used is ", $result->query_length , "\n";
2927 print "Kappa value is ", $result->get_statistic('kappa') , "\n";
2928 print "Name of matrix used is ", $result->get_parameter('matrix') , "\n";
2931 $hit = $result->next_hit;
2932 print "First hit name is ", $hit->name , "\n";
2933 print "First hit length is ", $hit->length , "\n";
2934 print "First hit accession number is ", $hit->accession , "\n";
2936 $hsp = $hit->next_hsp;
2937 print "First hsp query start is ", $hsp->query->start , "\n";
2938 print "First hsp query end is ", $hsp->query->end , "\n";
2939 print "First hsp hit start is ", $hsp->hit->start , "\n";
2940 print "First hsp hit end is ", $hsp->hit->end , "\n";
2946 #################################################
2947 # bplite_parsing ():
2950 $bplite_parsing = sub {
2952 use Bio
::Tools
::BPlite
;
2953 my ($file1, $file2, $file3, $report,$report2 , $report3 ,
2954 $last_iteration, $sbjct, $total_iterations, $hsp,
2957 print "\nBeginning bplite, bppsilite, bpbl2seq parsing example... \n";
2959 ($file1, $file2, $file3) = ($bp_parse_file1,
2960 $bp_parse_file2 ,$bp_parse_file3 );
2961 #open FH, "t/data/blast.report";
2962 $report = Bio
::Tools
::BPlite
->new('-file'=>$file1);
2963 $sbjct = $report->nextSbjct;
2965 print " Hit name is ", $sbjct->name, " \n";
2966 while ($hsp = $sbjct->nextHSP) {
2970 use Bio
::Tools
::BPpsilite
;
2971 $report2 = Bio
::Tools
::BPpsilite
->new('-file'=>$file2);
2972 $total_iterations = $report2->number_of_iterations;
2973 $last_iteration = $report2->round($total_iterations);
2975 # print first 10 psiblast hits
2976 for ($loop = 1; $loop <= 10; $loop++) {
2977 $sbjct = $last_iteration->nextSbjct;
2978 $sbjct && print " PSIBLAST Hit is ", $sbjct->name, " \n";
2982 use Bio
::Tools
::BPbl2seq
;
2983 $report3 = Bio
::Tools
::BPbl2seq
->new('-file' => $file3,
2984 '-queryname' => "ALEU_HORVU");
2985 $hsp = $report3->next_feature;
2986 $matches = $hsp->match;
2987 print " Number of Blast2seq matches for first hsp equals $matches \n";
2994 #################################################
2998 $hmmer_parsing = sub {
3000 print "\nBeginning hmmer_parsing example... \n";
3002 # III.5.4 Parsing HMM reports (HMMER::Results)
3004 use Bio
::Tools
::HMMER
::Domain
;
3005 use Bio
::Tools
::HMMER
::Set
;
3006 use Bio
::Tools
::HMMER
::Results
;
3008 my ($res, @seq, @domain);
3010 $res = new Bio
::Tools
::HMMER
::Results
('-file' => Bio
::Root
::IO
->catfile("t","data","hmmsearch.out"),
3011 '-type' => 'hmmsearch');
3013 @seq = $res->each_Set;
3014 print "Initial sequence bit score is ",$seq[0]->bits,"\n";
3015 @domain = $seq[0]->each_Domain;
3016 print " For initial domain, start = ",$domain[0]->start,
3017 " end = ",$domain[0]->end,"and score =",$domain[0]->bits,"\n";
3019 #foreach $seq ( $res->each_Set ) {
3020 # print "Sequence bit score is",$seq->bits,"\n";
3021 # foreach $domain ( $seq->each_Domain ) {
3022 # print " Domain start ",$domain->start," end ",
3023 # $domain->end," score ",$domain->bits,"\n";
3029 #################################################
3030 # run_clustalw_tcoffee ():
3033 $run_clustalw_tcoffee = sub {
3035 use Bio
::Tools
::Run
::Alignment
::TCoffee
;
3036 use Bio
::Tools
::Run
::Alignment
::Clustalw
;
3037 my (@params, $factory, $ktuple, $aln, $seq_array_ref, $strout);
3038 my (@seq_array, $seq, $str, $infile);
3040 $infile = $unaligned_amino_file;
3042 # III.6.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
3043 print "\nBeginning run_clustalw example... \n";
3045 # put unaligned sequences in a Bio::Seq array
3046 $str = Bio
::SeqIO
->new('-file'=> $infile,
3047 '-format' => 'Fasta');
3049 while ($seq = $str->next_seq() ) { push (@seq_array, $seq) ;}
3050 $seq_array_ref = \
@seq_array;
3051 # where @seq_array is an array of Bio::Seq objects
3053 my $clustal_present =
3054 Bio
::Tools
::Run
::Alignment
::Clustalw
->exists_clustal();
3055 if ($clustal_present) {
3056 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
3057 $factory = Bio
::Tools
::Run
::Alignment
::Clustalw
->new(@params);
3059 $factory->ktuple($ktuple); # change the parameter before executing
3060 $aln = $factory->align($seq_array_ref);
3061 $strout = Bio
::AlignIO
->newFh('-format' => 'msf');
3062 print "Output of clustalw alignment... \n";
3065 warn "Clustalw program not found. Skipping run_clustalw example....\n";
3067 print "\nBeginning run_tcoffee example... \n";
3069 my $tcoffee_present =
3070 Bio
::Tools
::Run
::Alignment
::TCoffee
->exists_tcoffee();
3071 if ($tcoffee_present) {
3072 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
3073 $factory = Bio
::Tools
::Run
::Alignment
::TCoffee
->new(@params);
3075 $factory->ktuple($ktuple); # change the parameter before executing
3076 $aln = $factory->align($seq_array_ref);
3077 $strout = Bio
::AlignIO
->newFh('-format' => 'msf');
3078 print "Output of TCoffee alignment... \n";
3081 warn "TCoffee program not found. Skipping run_TCoffee example....\n";
3086 #################################################
3090 $simplealign = sub {
3092 my ($in, $out1,$out2, $aln, $threshold_percent, $infile);
3093 my ( $str, $resSlice1, $resSlice2, $resSlice3, $tmpfile);
3095 print "\nBeginning simplealign and alignio example... \n";
3097 # III.3.2 Transforming alignment files (AlignIO)
3098 $infile = $aligned_amino_file;
3099 #$tmpfile = "test.tmp";
3100 $in = Bio
::AlignIO
->new('-file' => $infile ,
3101 '-format' => 'pfam');
3102 $out1 = Bio
::AlignIO
->newFh('-format' => 'msf');
3104 # while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
3105 $aln = $in->next_aln() ;
3106 print "Alignment to display in msf format... \n";
3109 $threshold_percent = 60;
3110 $str = $aln->consensus_string($threshold_percent) ;
3111 print "Consensus string with threshold = $threshold_percent is $str\n";
3113 print "The average percentage identity of the alignment is ", $aln->percentage_identity. "\n";
3115 my $seqname = '1433_LYCES';
3116 my $residue_number = 14;
3117 my $pos = $aln->column_from_residue_number($seqname, $residue_number);
3118 print "Residue number $residue_number of the sequence $seqname in the alignment occurs at column $pos. \n";
3120 my $s1 = new Bio
::LocatableSeq
(-id
=> 'AAA', -seq
=> 'aawtat-tn-', -start
=> 1, -end
=> 8,
3121 -alphabet
=> 'dna' );
3122 my $s2 = new Bio
::LocatableSeq
(-id
=> 'BBB', -seq
=> '-aaaat-tt-', -start
=> 1,
3123 -end
=> 7, -alphabet
=> 'dna');
3124 my $a = new Bio
::SimpleAlign
;
3128 print "The first sequence in the dna alignment is... \t", $s1->seq , "\n";
3129 print "The second sequence in the dna alignment is... \t", $s2->seq , "\n";
3130 my $iupac_consensus = $a->consensus_iupac;
3131 print "The IUPAC consensus of the dna alignment is... \t",$iupac_consensus , "\n";
3136 #################################################
3137 # run_psw_bl2seq ():
3140 $run_psw_bl2seq = sub {
3142 print "\nBeginning run_psw_bl2seq example... \n";
3143 eval { require Bio
::Tools
::pSW
; };
3145 print STDERR
("\nThe C-compiled engine for Smith Waterman alignments (Bio::Ext::Align) has not been installed.\n Please read the install the bioperl-ext package\n\n");
3149 #use Bio::Tools::pSW;
3150 use Bio
::Tools
::Run
::StandAloneBlast
;
3152 # III.6.1 Aligning 2 sequences with Smith-Waterman (pSW)
3154 my ($factory, $aln, $out1, $out2, $str, $seq1, $seq2, @params);
3156 # Get protein sequences from file
3158 $str = Bio
::SeqIO
->new('-file'=>Bio
::Root
::IO
->catfile("t","data","amino.fa") ,
3159 '-format' => 'Fasta', );
3160 $seq1 = $str->next_seq();
3161 $seq2 = $str->next_seq();
3163 $factory = new Bio
::Tools
::pSW
( '-matrix' => 'scripts/tools/blosum62.bla',
3167 $factory->align_and_show($seq1,$seq2,'STDOUT');
3168 $aln = $factory->pairwise_alignment($seq1,$seq2);
3170 $out1 = Bio
::AlignIO
->newFh('-format' => 'fasta');
3171 print "The Smith-Waterman alignment in fasta format... \n";
3175 # III.6.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
3178 # first create factory for bl2seq
3179 @params = ('program' => 'blastp', 'outfile' => 'bl2seq.out');
3180 $factory = Bio
::Tools
::Run
::StandAloneBlast
->new(@params);
3181 $factory->bl2seq($seq1, $seq2);
3183 # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
3184 $str = Bio
::AlignIO
->new('-file'=> 'bl2seq.out',
3185 '-format' => 'bl2seq');
3186 $aln = $str->next_aln();
3188 $out2 = Bio
::AlignIO
->newFh('-format' => 'fasta');
3189 print "The Blast 2 sequence alignment in fasta format... \n";
3196 #################################################
3197 # gene_prediction_parsing ():
3200 $gene_prediction_parsing = sub {
3202 use Bio
::Tools
::Genscan
;
3204 print "\nBeginning genscan_result_parsing example... \n";
3206 my ($genscan, $gene, @exon_arr, $first_exon);
3208 # III.7 Searching for genes and other structures
3209 # on genomic DNA (Genscan, ESTScan, MZEF)
3211 use Bio
::Tools
::Genscan
;
3213 $genscan = Bio
::Tools
::Genscan
->new('-file' => Bio
::Root
::IO
->catfile("t","data","genomic-seq.genscan"));
3214 # $gene is an instance of Bio::Tools::Prediction::Gene
3215 # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
3216 while($gene = $genscan->next_prediction()) {
3217 print "Protein corresponding to current gene prediction is ",
3218 $gene->predicted_protein->seq()," \n";
3219 @exon_arr = $gene->exons();
3220 $first_exon = $exon_arr[0];
3221 print "Start location of first exon of current predicted gene is ",
3222 $first_exon->start," \n";
3226 use Bio
::Tools
::Sim4
::Results
;
3227 print "\nBeginning Sim4_result_parsing example... \n";
3228 my ($sim4,$exonset, @exons, $exon, $homol);
3230 $sim4 = new Bio
::Tools
::Sim4
::Results
(-file
=> Bio
::Root
::IO
->catfile("t","data","sim4.rev"), -estisfirst
=>0);
3231 $exonset = $sim4->next_exonset;
3232 @exons = $exonset->sub_SeqFeature();
3234 print "Start location of first exon of first exon set is ",$exon->start(), " \n";
3235 print "Name of est sequence is ",$exon->est_hit()->seqname(), " \n";
3236 print "The length of the est hit is ",$exon->est_hit()->seqlength(), " \n";
3243 #################################################
3244 # sequence_annotation ():
3247 $sequence_annotation = sub {
3249 print "\nBeginning sequence_annotation example... \n";
3250 my ($feat, $feature1, $feature2,$seqobj, @topfeatures, @allfeatures,
3251 $ann, $in, $infile, $other, $answer);
3252 # III.8.1 Representing sequence annotations (Annotation, SeqFeature)
3254 $infile = $amino_seq_file;
3256 $in = Bio
::SeqIO
->new('-file' => $infile ,
3257 '-format' => 'Fasta');
3258 $seqobj = $in->next_seq();
3260 $feature1 = new Bio
::SeqFeature
::Generic
3264 '-primary' => 'exon',
3265 '-source' => 'internal', );
3267 $feature2 = new Bio
::SeqFeature
::Generic
3271 '-primary' => 'exon',
3272 '-source' => 'internal', );
3274 $seqobj->add_SeqFeature($feature1); # Add the SeqFeature to the parent
3275 $seqobj->add_SeqFeature($feature2); # Add the SeqFeature to the parent
3276 $seqobj->annotation(new Bio
::Annotation
('-description' => 'This is the annotation'));
3278 # Once the features and annotations have been associated
3279 # with the Seq, they can be with retrieved, eg:
3280 @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
3281 @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
3282 $ann = $seqobj->annotation(); # annotation object
3283 $feat = $allfeatures[0];
3284 $other = $allfeatures[1];
3286 print " Total number of sequence features is: ", scalar @allfeatures, " \n";
3287 print " The sequence annotation is: ", $ann->description," \n";
3289 # The individual components of a SeqFeature can also be set
3290 # or retrieved with methods including:
3291 # attributes which return numbers
3292 $feat->start; # start position
3293 $feat->end; # end position
3294 $feat->strand; # 1 means forward, -1 reverse, 0 not relevant
3295 # attributes which return strings
3296 $feat->primary_tag; # the main 'name' of the sequence feature, eg, 'exon'
3297 $feat->source_tag; # where the feature comes from, eg'BLAST'
3298 # attributes which return Bio::PrimarySeq objects
3299 $feat->seq; # the sequence between start,end
3300 $feat->entire_seq; # the entire sequence
3302 print " The primary tag of the feature is: ", $feat->primary_tag, " \n";
3303 print " The first feature begins at location ", $feat->start, " \n";
3304 print " ends at location ", $feat->end, " \n";
3305 print " and is located on strand ", $feat->strand, " \n";
3307 # other useful methods include
3308 $answer = $feat->overlaps($other); # do SeqFeature $feat and SeqFeature $other overlap?
3309 print " Feature 1 does ", ($answer)?
"":"not ", " overlap Feature2. \n";
3311 $answer = $feat->contains($other); # is $other completely within $feat?
3312 print " Feature 1 does ", ($answer)?
"":"not ", " contain Feature2. \n";
3314 $answer = $feat->equals($other); # do $feat and $other completely $agree?
3315 print " Feature 1 does ", ($answer)?
"":"not ", " equal Feature2. \n";
3317 $feat->sub_SeqFeature(); # create/access an array of subsequence features
3323 #################################################
3329 print "\nBeginning largeseqs example... \n";
3331 # III.8.2 Representing and large sequences
3332 my ( $tmpfile, $seqio, $pseq, $plength, $last_4);
3335 $tmpfile = Bio
::Root
::IO
->catfile("t","data","largefastatest.out") ;
3336 $seqio = new Bio
::SeqIO
('-format'=>'largefasta',
3337 '-file' =>Bio
::Root
::IO
->catfile("t","data","genomic-seq.fasta"));
3338 $pseq = $seqio->next_seq();
3339 $plength = $pseq->length();
3341 print " The length of the sequence ",
3342 $pseq->display_id()," is $plength \n";
3343 $last_4 = $pseq->subseq($plength-3,$plength);
3344 print " The final four residues are $last_4 \n";
3346 #END { unlink $tmpfile; }
3353 #################################################
3359 use Bio
::LiveSeq
::IO
::BioPerl
;
3360 my ($loader, $gene, $id, $maxstart);
3362 print "\nBeginning liveseqs example... \n";
3364 # III.8.2 Representing changing sequences (LiveSeq)
3366 $loader=Bio
::LiveSeq
::IO
::BioPerl
->load('-db'=>"EMBL",
3367 '-file'=>Bio
::Root
::IO
->catfile("t","data","factor7.embl"));
3368 $gene=$loader->gene2liveseq('-gene_name' => "factor7");
3369 $id = $gene->get_DNA->display_id ;
3370 print " The id of the gene is $id , \n";
3371 $maxstart = $gene->maxtranscript->start;
3372 print " The start position of the max transcript is $maxstart , \n";
3378 #################################################
3384 eval { require Bio
::Structure
::Entry
;
3385 require Bio
::Structure
::IO
;
3388 print STDERR
"Cannot load Bio::Structure modules\n";
3389 print STDERR
"Cannot run run_struct:\n$@\n";
3391 print "\nBeginning Structure object example... \n";
3393 # testing PDB format
3394 my $pdb_file = Bio
::Root
::IO
->catfile("t","data","pdb1bpt.ent");
3395 my $structio = Bio
::Structure
::IO
->new(-file
=> $pdb_file,
3397 my $struc = $structio->next_structure;
3398 my ($chain) = $struc->chain;
3399 print " The current chain is ", $chain->id ," \n";
3400 my $pseq = $struc->seqres;
3401 print " The first 20 residues of the sequence corresponding " .
3402 "to this structure are " . $pseq->subseq(1,20) . " \n";
3411 #################################################
3420 print "\nBeginning MapIO example... \n";
3423 my $mapio = new Bio
::MapIO
(
3424 '-format' => 'mapmaker',
3425 '-file' => Bio
::Root
::IO
->catfile('t','data',
3428 my $map = $mapio->next_map;
3430 print " The type of the map is " , $map->type ," \n";
3431 print " The units of the map are " , $map->units ," \n";
3434 foreach my $marker ( $map->each_element ) {
3435 print " The marker ", $marker->name," is at ordered position " , $marker->position->order ," \n";
3436 if ($count++ >= 5) {return 1};
3443 #################################################
3452 print "\nBeginning phylogenetic tree example... \n";
3455 my $treeio = new Bio
::TreeIO
( -format
=> 'newick',
3456 -file
=> Bio
::Root
::IO
->catfile('t','data',
3457 'cysprot1b.newick'));
3459 my $tree = $treeio->next_tree;
3460 my @nodes = $tree->get_nodes;
3462 foreach my $node ( $tree->get_root_node()->each_Descendent() ) {
3463 print "node id and branch length: ", $node->to_string(), "\n";
3464 my @ch = $node->each_Descendent();
3466 print "\tchildren are: \n";
3467 foreach my $node ( $node->each_Descendent() ) {
3468 print "\t\t id and length: ", $node->to_string(), "\n";
3477 #################################################
3483 use Bio
::Perl
qw( read_sequence
3489 print "\nBeginning example of sequence manipulation without explicit Seq objects... \n";
3492 # getting a sequence from a database (assummes internet connection)
3494 my $seq_object = get_sequence
('swissprot',"ROA1_HUMAN");
3496 # sequences are Bio::Seq objects, so the following methods work
3497 # (for more info see Bio::Seq documentation - try perldoc Bio::Seq)
3499 print "Name of sequence retrieved from swissprot is ",$seq_object->display_id,"\n";
3500 print "Sequence acc is ",$seq_object->accession_number,"\n";
3501 print "First 5 residues are ",$seq_object->subseq(1,5),"\n";
3503 # getting sequence data from disk
3505 $seq_object = read_sequence
($amino_seq_file,'fasta');
3506 print "Name of sequence retrieved from disk is ",$seq_object->display_id,"\n";
3512 #################################################
3513 # demo_variations ():
3516 $demo_variations = sub {
3518 print "\nBeginning demo_variations example... \n";
3520 # III.8.3 Representing related sequences
3521 # - mutations, polymorphisms etc (Allele, SeqDiff,)
3523 use Bio
::Variation
::SeqDiff
;
3524 use Bio
::Variation
::Allele
;
3525 use Bio
::Variation
::DNAMutation
;
3526 my ($a1, $a2, $dnamut, $seqDiff, @current_variants);
3528 $a1 = Bio
::Variation
::Allele
->new;
3530 $a2 = Bio
::Variation
::Allele
->new;
3533 $dnamut = Bio
::Variation
::DNAMutation
->new
3536 '-upStreamSeq' => 'actggg',
3537 '-proof' => 'experimental',
3538 '-isMutation' => 1, );
3539 $dnamut->allele_ori($a1);
3540 $dnamut->add_Allele($a2);
3541 $seqDiff = Bio
::Variation
::SeqDiff
->new
3543 '-alphabet' => 'rna',
3544 #$seqDiff = Bio::Variation::SeqDiff->new
3545 #( '-id' => 'M20132', '-alphabet' => 'RNA',
3546 '-gene_symbol' => 'AR',
3547 '-chromosome' => 'X',
3548 '-numbering' => 'coding');
3549 # add it to a SeqDiff container object
3550 $seqDiff->add_Variant($dnamut);
3551 @current_variants = $seqDiff->each_Variant();
3552 print " The original sequence at current variation is ",
3553 $current_variants[0]->allele_ori->seq ," \n";
3554 print " The mutated sequence is ",
3555 $current_variants[0]->allele_mut->seq ," \n";
3556 print " The upstream sequence is ",
3557 $current_variants[0]->upStreamSeq ," \n";
3563 #################################################
3569 print "\nBeginning demo_xml example... \n";
3570 my ($str, $seqobj, $id, @feats, $first_primary_tag);
3572 # III.8.4 Sequence XML representations
3573 # - generation and parsing (SeqIO::game)
3575 eval { $str = Bio
::SeqIO
->new('-file'=>
3576 Bio
::Root
::IO
->catfile("t","data","test.game"),
3577 '-format' => 'game'); };
3579 print STDERR
"Cannot run demo_xml\n";
3580 print STDERR
"Problem parsing GAME format:\n$@\n";
3582 $seqobj = $str->next_seq();
3583 # $seq = $str->next_primary_seq();
3586 print "Seq object display id is ", $seqobj->display_id(),
3587 "\n"; # the human read-able id of the sequence
3588 print "The sequence description is: \n";
3589 print " ", $seqobj->desc(), " \n";
3590 print "Acc num is ", $seqobj->accession_number(),
3591 " \n"; # when there, the accession number
3592 print "Moltype is ", $seqobj->alphabet(),
3593 " \n"; # one of 'dna','rna','protein'
3595 @feats = $seqobj->all_SeqFeatures();
3596 $first_primary_tag = $feats[0]->primary_tag;
3597 print " Total number of sequence features is: ", scalar @feats, "
3599 print " The primary tag of the first feature is:
3600 $first_primary_tag \n";
3601 print " The first feature begins at location ", $feats[0]->start,
3603 print " and ends at location ", $feats[0]->end, " \n";
3610 ################################################
3611 # other_seq_utilities ():
3614 $other_seq_utilities = sub {
3616 use Bio
::Tools
::OddCodes
;
3617 use Bio
::Tools
::SeqPattern
;
3618 use Bio
::Tools
::CodonTable
;
3620 my ( $oddcode_obj, $amino_obj, $in, $infile, $chargeseq, $chemseq);
3621 my (@ii, $i, @res, $myCodonTable);
3622 my ( $pattern,$pattern_obj, $pattern_obj2, $pattern_obj3);
3624 print "\nBeginning sequence utilities example... \n";
3626 # III.4.6 Identifying amino acid characteristics
3627 # - eg charge, hydrophobicity (OddCodes)
3628 $infile = $amino_seq_file;
3629 $in = Bio
::SeqIO
->new('-file' => $infile ,
3630 '-format' => 'Fasta');
3631 $amino_obj = $in->next_seq->trunc(1,20);
3632 $oddcode_obj = Bio
::Tools
::OddCodes
->new($amino_obj);
3633 print "Initial sequence is: \n ", $amino_obj->seq(), " \n";
3634 $chargeseq = $oddcode_obj->charge;
3635 # Note OddCodes returns a reference to the desired
3636 # sequence, not the sequence itself.
3637 print "Sequence in \'charge\' alphabet is : \n ",$$chargeseq, " \n";
3638 $chemseq = $oddcode_obj->chemical;
3639 print "Sequence in \'chemical\' alphabet is : \n ",$$chemseq, " \n";
3641 # III.4.2.1 non-standard nucleotide translation tables (CodonTable)
3643 # create a table object by giving an ID
3644 $myCodonTable = Bio
::Tools
::CodonTable
-> new
( '-id' => 16);
3645 print "\nThe name of the current codon table is ",
3646 $myCodonTable->name() , " \n";
3648 # translate some codons
3649 @ii = qw(act acc att ggg ttt aaa ytr yyy);
3651 for $i (0..$#ii) { $res[$i] = $myCodonTable->translate($ii[$i]); }
3652 print "\nWith the current codon table, the sequence \n ",
3653 @ii, "\n is translated as \n" , @res, " \n";
3655 # III.4.2.2 utilities for ambiguous residues (SeqPattern, IUPAC)
3656 # III.4.2.3 regex-like nucleotide pattern manipulation
3658 print "\nBeginning demo of SeqPattern object \n ";
3660 $pattern = '(CCCCT)N{1,200}(agyyg)N{1,80}(agggg)';
3661 print "\nInput regular expression = $pattern \n ";
3662 $pattern_obj = new Bio
::Tools
::SeqPattern
('-SEQ' =>$pattern,
3664 $pattern_obj2 = $pattern_obj->revcom();
3665 print "\nReverse-complemented expression = ",$pattern_obj2->str, "\n ";
3666 $pattern_obj3 = $pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3667 print "\nExpanded reverse-complemented expression = ",$pattern_obj3->str, "\n ";
3668 #$pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3674 #################################################
3677 # Subroutine that identifies from which ancestor module each method
3678 # of an object is derived
3682 my ($object, $object_class,$package );
3683 my @package_list = ();
3685 ($object_class = $object) =~ s/::/\//g; # replace "::" with "/"
3686 $object_class = $object_class . ".pm
";
3687 require $object_class;
3689 my $method_hash = $object->methods('all');
3691 print " \n ***Methods
for Object
$object ******** \n";
3692 # get all the *unique* package names providing methods
3694 foreach $package (values %$method_hash) {
3695 push(@package_list, $package) unless $seen{$package}++;
3697 for $package (sort @package_list) {
3698 print " \n\n Methods taken from
package $package \n";
3700 for my $method (sort keys %$method_hash) {
3701 if ($$method_hash{$method} eq $package ) {
3704 if ($out_count == 7) {
3705 print " \n"; # keeps output line from getting too long
3713 # The following subroutine is based closely on Mark Summerfield's and Piers
3714 # Cawley's clever Class_Methods_Introspection program
3718 my ($class, $types) = @_;
3719 $class = ref $class || $class;
3723 my @class = ($class);
3726 while ($class = shift @class) {
3727 next if $classes_seen{$class}++;
3728 unshift @class, @{"${class}::ISA
"} if $types eq 'all';
3729 # Based on methods_via() in perl5db.pl
3730 for my $method (grep {not /^[(_]/ and
3731 defined &{${"${class}::"}{$_}}}
3732 keys %{"${class}::"}) {
3733 $methods{$method} = $class; #ps
3734 # $methods{$method} = wantarray ? undef : $class->can($method);
3738 wantarray ? keys %methods : \%methods;
3747 ## "main
" program follows
3751 if (scalar(@runlist)==0) {&$display_help;}; # display help if no option
3752 if ($runlist[0] == 0) {@runlist = (1..26); }; # argument = 0 means run all tests
3753 foreach $n (@runlist) {
3754 if ($n ==100) {my $object = $runlist[1]; &$bpinspect1($object); last;}
3755 if ($n ==1) {&$access_remote_db; next;}
3756 if ($n ==2) {&$index_local_db; next;}
3757 if ($n ==3) {&$fetch_local_db; next;}
3758 if ($n ==4) {&$sequence_manipulations; next;}
3759 if ($n ==5) {&$seqstats_and_seqwords; next;}
3760 if ($n ==6) {&$restriction_and_sigcleave; next;}
3761 if ($n ==7) {&$other_seq_utilities; next;}
3762 if ($n ==8) {&$run_standaloneblast; next;}
3763 if ($n ==9) {&$bplite_parsing; next;}
3764 if ($n ==10) {&$hmmer_parsing; next;}
3765 if ($n ==11) {&$run_clustalw_tcoffee; next;}
3766 if ($n ==12) {&$run_psw_bl2seq; next;}
3767 if ($n ==13) {&$simplealign ; next;}
3768 if ($n ==14) {&$gene_prediction_parsing; next;}
3769 if ($n ==15) {&$sequence_annotation; next;}
3770 if ($n ==16) {&$largeseqs; next;}
3771 if ($n ==17) {&$liveseqs; next;}
3772 if ($n ==18) {&$demo_variations; next;}
3773 if ($n ==19) {&$demo_xml; next;}
3774 if ($n ==20) {&$run_tree; next;}
3775 if ($n ==21) {&$run_map; next;}
3776 if ($n ==22) {&$run_struct; next;}
3777 if ($n ==23) {&$run_perl; next;}
3778 if ($n ==24) {&$searchio_parsing; next;}
3779 if ($n ==25) {&$run_remoteblast; next;}