bptutorial.pl

   1 #!/usr/bin/perl
   2 # $Id$
   3
   4 =head1 NAME
   5
   6 BioPerlTutorial - a tutorial for bioperl
   7
   8 =head1 AUTHOR
   9
  10   Written by Peter Schattner <schattner@alum.mit.edu>
  11
  12   Copyright Peter Schattner
  13
  14   Contributions, additions and corrections have been made
  15   to this document by the following individuals:
  16
  17   Jason Stajich
  18   Heikki Lehvaslaiho
  19   Brian Osborne
  20   Hilmar Lapp
  21   Chris Dagdigian
  22
  23
  24 =head1 DESCRIPTION
  25
  26    This tutorial includes "snippets" of code and text from various
  27    Bioperl documents including module documentation, example scripts
  28    and "t" test scripts. You may distribute this tutorial under the
  29    same terms as perl itself.
  30
  31    This document is written in Perl POD (plain old documentation)
  32    format.  You can run this file through your favorite pod translator
  33    (pod2html, pod2man, pod2text, etc.) if you would like a more
  34    convenient formatting.
  35
  36   Table of Contents
  37
  38   I. Introduction
  39   I.1 Overview
  40   I.2 Software requirements
  41     I.2.1 For minimal bioperl installation
  42     I.2.2 For complete installation
  43   I.3 Installation procedures
  44   I.4 Additional comments for non-unix users
  45
  46   II. Brief overview to bioperl's objects
  47   II.1 Sequence objects:
  48          (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, RichSeq, SeqWithQuality, SeqI)
  49   II.2 Alignment objects (SimpleAlign)
  50   II.3  Location objects (Simple, Split, Fuzzy)
  51   II.4  Interface objects and implementation objects
  52
  53   III. Using bioperl
  54   III.1 Accessing sequence data from local and remote databases
  55      III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
  56      III.1.2 Indexing and accessing local databases (Bio::Index::*,  bpindex.pl,
  57              bpfetch.pl)
  58   III.2 Transforming formats of database/ file records
  59      III.2.1 Transforming sequence files (SeqIO)
  60      III.2.2 Transforming alignment files (AlignIO)
  61   III.3 Manipulating sequences
  62     III.3.1 Manipulating sequence data with Seq methods (Seq)
  63     III.3.2 Obtaining basic sequence statistics- MW, residue &codon frequencies
  64             (SeqStats)
  65     III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
  66     III.3.4 Identifying amino acid cleavage sites (Sigcleave)
  67     III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
  68     III.3.6 Sequence manipulation using the Bioperl EMBOSS interface
  69     III.3.7 Sequence manipulation without creating Bioperl "objects"
  70   III.4 Searching for "similar" sequences
  71      III.4.1 Running BLAST locally  (StandAloneBlast)
  72      III.4.2 Running BLAST remotely (using RemoteBlast.pm)
  73      III.4.3 Parsing BLAST and FASTA reports with Search and SearchIO
  74      III.4.4 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq
  75      III.4.5 Parsing HMM reports (HMMER::Results, SearchIO)
  76   III.5 Creating and manipulating sequence alignments
  77      III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
  78      III.5.2 Aligning 2 sequences with Blast using  bl2seq and AlignIO
  79      III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
  80      III.5.4 Manipulating and displaying alignments (SimpleAlign)
  81   III.6 Searching for genes and other structures on genomic DNA
  82                         (Genscan, Sim4, ESTScan, MZEF, Grail, Genemark, EPCR)
  83   III.7 Developing machine readable sequence annotations
  84      III.7.1 Representing sequence annotations (Annotation, SeqFeature, RichSeq)
  85      III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
  86      III.7.3 Representing related sequences - mutations, polymorphisms etc (Allele,
  87              SeqDiff)
  88      III.7.4 Incorpotating quality data in sequence annotation (SeqWithQuality)
  89      III.7.5 Sequence XML representations - generation and parsing (SeqIO::game)
  90   III.8 Representing non-sequence data in Bioperl: structures, trees, maps, graphics
  91         and bibliographic text
  92      III.8.1 Using 3D structure objects and reading PDB files (StructureI,
  93              Structure::IO)
  94      III.8.2 Tree objects and phylogenetic trees (Tree::Tree, TreeIO)
  95      III.8.3 Map objects for manipulating genetic maps (Map::MapI, MapIO)
  96      III.8.4 Bibliographic objects for querying bibliographic databases (Biblio)
  97      III.8.5 Graphics objects for representing sequence objects as images (Graphics)
  98
  99   III.9 Bioperl alphabets
 100      III.9.1 Extended DNA / RNA alphabet
 101      III.9.2 Amino Acid alphabet
 102
 103   IV.  Related projects - biocorba, biopython, biojava, EMBOSS, Ensembl, GFF, Genquire
 104      IV.1 Biocorba
 105      IV.2 Biopython and biojava
 106      IV.3 EMBOSS
 107      IV.4 Ensembl and bioperl-db
 108      IV.5 GFF format and Bio::DB::GFF*
 109      IV.6 Genquire, the Annotation Workbench and bioperl-gui
 110
 111   V.  Appendices
 112      V.1 Finding out which methods are used by which Bioperl Objects
 113      V.2 Tutorial Demo Scripts
 114
 115 =head1 I. Introduction
 116
 117 =head2  I.1 Overview
 118
 119 Bioperl is a collection of perl modules that facilitate the
 120 development of perl scripts for bioinformatics applications.  As
 121 such, it does not include ready to use programs in the sense that many
 122 commercial packages and free web-based interfaces (eg Entrez, SRS) do.
 123 On the other hand, bioperl does provide reusable perl modules that
 124 facilitate writing perl scripts for sequence manipulation, accessing
 125 of databases using a range of data formats and execution and parsing
 126 of the results of various molecular biology programs including Blast,
 127 clustalw, TCoffee, genscan, ESTscan and HMMER.  Consequently, bioperl
 128 enables developing scripts that can analyze large quantities of
 129 sequence data in ways that are typically difficult or impossible with
 130 web based systems.
 131
 132 In order to take advantage of bioperl, the user needs a basic
 133 understanding of the perl programming language including an
 134 understanding of how to use perl references, modules, objects and
 135 methods. If these concepts are unfamiliar the user is referred to any
 136 of the various introductory / intermediate books on perl. (I've liked
 137 S. Holzmer's Perl Core Language, Coriolis Technology Press, for
 138 example).  This tutorial is not intended to teach the fundamentals of
 139 perl to those with little or no experience in the perl language.  On
 140 the other hand, advanced knowledge of perl - such as how to write a
 141 perl object - is not required for successfully using bioperl.
 142
 143 Bioperl is open source software that is still under active
 144 development.  The advantages of open source software are well known.
 145 They include the ability to freely examine and modify source code and
 146 exemption from software licensing fees.  However, since open source
 147 software is typically developed by a large number of volunteer
 148 programmers, the resulting code is often not as clearly organized and
 149 its user interface not as standardized as in a mature commercial
 150 product.  In addition, in any project under active development,
 151 documentation may not keep up with the development of new features.
 152 Consequently the learning curve for actively developed, open source
 153 source software is sometimes steep.
 154
 155 This tutorial is intended to ease the learning curve for new users of
 156 bioperl.  To that end the tutorial includes:
 157
 158 =over 4
 159
 160 =item *
 161
 162 Descriptions of what bioinformatics tasks can be handled with bioperl
 163
 164 =item *
 165
 166 Directions on where to find the methods to accomplish these tasks
 167 within the bioperl package
 168
 169 =item *
 170
 171 Recommendations on where to go for additional information.
 172
 173 =item *
 174
 175 The POD documentation should contain runnable code in the SYNOPSIS section
 176 which is meant to illustrate the use of a module and its methods.
 177
 178 =back
 179
 180 Running the bptutorial.pl script while going through this tutorial - or
 181 better yet, stepping through it with an interactive debugger - is a
 182 good way of learning bioperl.  The tutorial script is also a good
 183 place from which to cut-and-paste code for your scripts (rather than
 184 using the code snippets in this tutorial). Most of the scripts in the
 185 tutorial script should work on your machine - and if they don't it would
 186 probably be a good idea to find out why, before getting too involved
 187 with bioperl!
 188
 189 This tutorial does not intend to be a comprehensive description of all
 190 the objects and methods available in bioperl.  For that the reader is
 191 directed to the documentation included with each of the modules. A
 192 very useful interface for finding one's way within all the module
 193 documentation can be found at http://doc.bioperl.org/bioperl-live/.
 194 This interface lists all bioperl modules and descriptions of all
 195 of their methods.
 196
 197 One potential problem in locating the correct documentation is that
 198 multiple methods in different modules may all share the same name.
 199 Moreover, because of perl's complex method of "inheritance",
 200 it is not often clear which of the identically named methods is being
 201 called by a given object. One way to resolve this question is by using
 202 the software described in Appendix V.1.
 203
 204 For those who prefer more visual descriptions,
 205 http://bioperl.org/Core/Latest/modules.html also offers links to three
 206 PDF files which contain schematics that describe how many of the bioperl
 207 objects related to one another.
 208
 209 In addition, a bioperl online course is available on the web at
 210 http://www.pasteur.fr/recherche/unites/sis/formation/bioperl. The
 211 user is also referred to numerous bioperl scripts in the scripts/
 212 directory (see bioscripts.pod for a description of these scripts).
 213
 214 =head2 I.2 Software requirements
 215
 216 =cut
 217
 218 =head2   I.2.1 Minimal bioperl installation
 219
 220 For a "minimal" installation of bioperl, you will need to have perl
 221 itself installed as well as the bioperl "core modules".  Bioperl has
 222 been tested primarily using perl 5.005 and perl 5.6.
 223 The minimal bioperl installation should still work under perl 5.004.
 224 However, as increasing numbers of bioperl objects are using modules
 225 from CPAN (see below), problems have been observed for bioperl running
 226 under perl 5.004.  So if you are having trouble running bioperl under
 227 perl 5.004, you should probably upgrade your version of perl.
 228
 229 In addition to a current version of perl, the new user of bioperl is
 230 encouraged to have access to, and familiarity with, an interactive
 231 perl debugger.  Bioperl is a large collection of complex interacting
 232 software objects.  Stepping through a script with an interactive
 233 debugger is a very helpful way of seeing what is happening in such a
 234 complex software system - especially when the software is not behaving
 235 in the way that you expect.  The free graphical debugger ptkdb
 236 (available as Devel::ptkdb from CPAN) is highly recommended.  Active
 237 State, from http://www.activestate.com , offers a commercial graphical
 238 debugger for windows systems.  The standard perl distribution also
 239 contains a powerful interactive debugger - though with a more cumbersome
 240 command-line interface. The Perl tool Data::Dumper used with the
 241 syntax:
 242
 243   use Data::Dumper;
 244   printer Dumper($seq);
 245
 246 can also be helpful for obtaining debugging information on perl objects.
 247
 248 =head2     I.2.2 Complete installation
 249
 250 Taking full advantage of bioperl requires software beyond that for the
 251 minimal installation.  This additional software includes perl modules
 252 from CPAN, bioperl perl extensions, a bioperl xs-extension, and
 253 several standard compiled bioinformatics programs.
 254
 255 B<Perl - extensions>
 256
 257 For a complete listing of external Perl modules required by bioperl
 258 please see the L<INSTALL> file.
 259
 260 B<Bioperl C extensions & external bioinformatics programs>
 261
 262 Bioperl also uses several C programs for sequence alignment and local
 263 blast searching. To use these features of bioperl you will need an
 264 ANSI C or Gnu C compiler as well as the actual program available from
 265 sources such as:
 266
 267 for Smith-Waterman alignments- bioperl-ext-0.6 from
 268 http://bioperl.org/Core/external.shtml
 269
 270 for clustalw alignments-
 271 ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/
 272
 273 for tcoffee alignments-
 274 http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
 275
 276 for local blast searching- ftp://ftp.ncbi.nlm.nih.gov/blast/server/current_release/
 277
 278 for EMBOSS applications - http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html
 279
 280 =for html <A NAME ="i.3"></A>
 281
 282 =head2  I.3 Installation
 283
 284 The actual installation of the various system components is
 285 accomplished in the standard manner:
 286
 287 =over 6
 288
 289 =item *
 290
 291 Locate the package on the network
 292
 293 =item *
 294
 295 Download
 296
 297 =item *
 298
 299 Decompress (with gunzip or a similiar utility)
 300
 301 =item *
 302
 303 Remove the file archive (eg with tar -xvf)
 304
 305 =item *
 306
 307 Create a "makefile" (with "perl Makefile.PL" for perl modules or a
 308 supplied "install" or "configure" program for non-perl program
 309
 310 =item *
 311
 312 Run "make", "make test" and "make install" This procedure must be
 313 repeated for every CPAN module, bioperl-extension and external
 314 module to be installed. A helper module CPAN.pm is available from
 315 CPAN which automates the process for installing the perl modules.
 316
 317 The CPAN module can also be used to install all of the modules
 318 listed above in a single step as a "bundle" of modules,
 319 Bundle::BioPerl, eg
 320
 321   $>perl -MCPAN -e shell
 322   cpan>install Bundle::BioPerl
 323   <installation details....>
 324   cpan>install B/BI/BIRNEY/bioperl-1.0.1.tar.gz
 325   <installation details....>
 326   cpan>quit
 327
 328 Be advised that version numbers change regularly, so the number used
 329 above may not apply. The disadvantage of this approach is that if
 330 there's a problem installing any individual module it may be a bit
 331 more difficult to address.
 332
 333 =back
 334
 335 For the external programs (clustal, Tcoffee, ncbi-blast), there is an
 336 extra step:
 337
 338 =over 1
 339
 340 =item *
 341
 342 Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or
 343 BLASTDIR) to the directory holding the executable in your startup
 344 file - eg in .bashrc. (For running local blasts, it is also
 345 necessary that the name of local-blast database directory is known
 346 to bioperl.  This will typically happen automatically, but in case
 347 of difficulty, refer to the documentation in
 348 L<Bio::Tools::Run::StandAloneBlast>)
 349
 350 =back
 351
 352 The only likely complication (at least on unix systems) that may occur
 353 is if you are unable to obtain system level writing privileges.  For
 354 instructions on modifying the installation in this case and for more
 355 details on the overall installation procedure, see the README file in
 356 the bioperl distribution as well as the README files in the external
 357 programs you want to use (eg bioperl-ext, clustalw, TCoffee,
 358 NCBI-blast).
 359
 360 =head2 I.4 Additional comments for non-unix users
 361
 362 Bioperl has mainly been developed and tested under various unix
 363 environments (including Linux) and this tutorial is intended primarily
 364 for unix users.  The minimal installation of bioperl should work
 365 under other OS's (NT, windows,Mac).  However, bioperl has not been
 366 widely tested under these OS's.
 367
 368 Todd Richmond has written of his experiences with BioPerl on MacOS 9
 369 at http://bioperl.org/Core/mac-bioperl.html.  There is also a
 370 description of bioperl on windows by Jurgen Pletinckx at
 371 http://www.bioperl.org/Core/windows-bioperl.html. (Note that currently
 372 these documents describe release 0.7.x of bioperl.)  Minimal bioperl
 373 does run without problems on Mac OS X since it is a Unix system.  However,
 374 external precompiled programs (eg NCBI local Blast) and other useful
 375 auxiliary programs such as perl-TK and ptkdb are in many cases not yet
 376 available under OS X.
 377
 378 Steve Cannon has compiled installation notes and suggestions for Bioperl
 379 on OS X online at http://www.tc.umn.edu/~cann0010/Bioperl_OSX_install.html.
 380
 381 Many bioperl features require the use of CPAN modules, compiled
 382 extensions or external programs.  These features will probably will
 383 not work under some or all of these other operating systems.  If a
 384 script attempts to access these features from a non-unix OS, bioperl
 385 is designed to simply report that the desired capability is not available.
 386 However, since the testing of bioperl in these environments has been
 387 limited, the script may well crash in a less "graceful" manner.
 388
 389 =head1 II. Brief introduction to bioperl's objects
 390
 391 The purpose of this tutorial is to get you using bioperl to solve
 392 real-life bioinformatics problems as quickly as possible.  The aim is
 393 not to explain the structure of bioperl objects or perl
 394 object-oriented programming in general.  Indeed, the relationships
 395 among the bioperl objects is not simple; however, understanding them
 396 in detail is fortunately not necessary for successfully using the
 397 package.
 398
 399 Nevertheless, a little familiarity with the bioperl object "bestiary"
 400 can be very helpful even to the casual user of bioperl. For example
 401 there are (at least) seven different "sequence objects" - Seq,
 402 PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI, and SeqWithQuality.
 403 Understanding the relationships among these objects - and why there are
 404 so many of them - will help you select the appropriate one to use in your
 405 script.
 406
 407 =for html <A NAME ="ii.1"></A>
 408
 409 =head2 II.1 Sequence objects: (Seq, RichSeq, SeqWithQuality, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
 410
 411 Seq is the central sequence object in bioperl.  When in doubt this is
 412 probably the object that you want to use to describe a DNA, RNA or
 413 protein sequence in bioperl.  Most common sequence manipulations can
 414 be performed with Seq.  These capabilities are described in sections
 415 L<"III.3.1"> and L<"III.7.1">, or in L<Bio::Seq>.
 416
 417 Seq objects can be created explicitly (see section L<"III.2.1"> for an
 418 example).  However usually Seq objects will be created for you
 419 automatically when you read in a file containing sequence data using
 420 the SeqIO object.  This procedure is described in section L<"III.2.1">.
 421 In addition to storing its identification labels and the sequence itself,
 422 a Seq object can store multiple annotations and associated "sequence
 423 features".  This capability can be very useful - especially in
 424 development of automated genome annotation systems, see section
 425 L<"III.7.1">.
 426
 427 RichSeq objects store additional annotations beyond those used by
 428 standard Seq objects.  If you are using sources with very rich
 429 sequence annotation, you may want to consider using these objects
 430 which are described in section L<"III.7.1">. SeqWithQuality objects are
 431 used to manipulate sequences with quality data, like those produced by
 432 phred.  These objects are described in section L<"III.7.4">,
 433 L<Bio::Seq::RichSeqI>, and in L<Bio::Seq::SeqWithQuality>.
 434
 435 On the other hand, if you need a script capable of simultaneously
 436 handling many (hundreds or thousands) sequences at a time, then the
 437 overhead of adding annotations to each sequence can be significant.
 438 For such applications, you will want to use the PrimarySeq
 439 object. PrimarySeq is basically a "stripped down" version of Seq.
 440 It contains just the sequence data itself and a few identifying labels
 441 (id, accession number, molecule type = dna, rna, or protein).  For
 442 applications with hundreds or thousands or sequences, using PrimarySeq
 443 objects can significantly speed up program execution and decrease the
 444 amount of RAM the program requires. See L<Bio::PrimarySeq> for more
 445 details.
 446
 447 What is (for historical reasons) called a LocatableSeq object
 448 might be more appropriately called an "AlignedSeq" object.  It is a Seq
 449 object which is part of a multiple sequence alignment.  It has "start" and
 450 "end" positions indicating from where in a larger sequence it may have
 451 been extracted.  It also may have "gap" symbols corresponding to the
 452 alignment to which it belongs.  It is used by the alignment
 453 object SimpleAlign and other modules that use SimpleAlign objects
 454 (eg AlignIO.pm, pSW.pm).  In general you don't have to
 455 worry about creating LocatableSeq objects because they will be made for
 456 you automatically when you create an alignment (using pSW, Clustalw,
 457 Tcoffee or bl2seq) or when you input an alignment data file using AlignIO.
 458 However if you need to input a sequence alignment by hand (eg to build
 459 a SimpleAlign object), you will need to input the sequences as
 460 LocatableSeqs. Other sources of information include L<Bio::LocatableSeq>,
 461 L<Bio::SimpleAlign>, L<Bio::AlignIO>, and L<Bio::Tools::pSW>.
 462
 463 A LargeSeq object is a special type of Seq object used for
 464 handling very long ( eg E<gt> 100 MB) sequences.  If you need to
 465 manipulate such long sequences see section L<"III.7.2"> which describes
 466 LargeSeq objects, or L<Bio::Seq::LargeSeq>.
 467
 468 A LiveSeq object is another specialized object for storing
 469 sequence data. LiveSeq addresses the problem of features whose location
 470 on a sequence changes over time.  This can happen, for example, when
 471 sequence feature objects are used to store gene locations on newly
 472 sequenced genomes - locations which can change as higher quality
 473 sequencing data becomes available.  Although a LiveSeq object is not
 474 implemented in the same way as a Seq object, LiveSeq does implement
 475 the SeqI interface (see below).  Consequently, most methods available
 476 for Seq objects will work fine with LiveSeq objects. Section L<"III.7.2">
 477 and L<Bio::LiveSeq> contain further discussion of LiveSeq objects.
 478
 479 SeqI objects are Seq "interface objects" (see section L<"II.4"> and
 480 L<Bio::SeqI>). They are used to ensure bioperl's compatibility with
 481 other software packages. SeqI and other interface objects are not
 482 likely to be relevant to the casual bioperl user.
 483
 484 *** Having described these other types of sequence objects, the
 485     "bottom line" still is that if you store your sequence data in Seq
 486     objects (which is where they'll be if you read them in with
 487     SeqIO), you will usually do just fine. ***
 488
 489 =for html <A NAME ="ii.2"></A>
 490
 491 =head2 II.2 Alignment objects (SimpleAlign)
 492
 493 This module allows the user to convert between alignment formats
 494 as well as more sophisticated operations, like extracting specific regions
 495 of the alignment and generating consensus sequences. For more information
 496 see section L<"III.5.4"> and L<Bio::SimpleAlign>.
 497
 498
 499 =for html <A NAME ="ii.3"></A>
 500
 501 =head2 II.3 Location objects
 502
 503 A Location object is designed to be associated with a Sequence
 504 Feature object to indicate where on a larger structure (eg a chromosome or
 505 contig) the feature can be found.  The reason why this simple concept has
 506 evolved in a collection of rather complicated objects is that
 507
 508 1) Some objects have multiple locations or sub-locations (e.g. a gene's exons
 509 may have multiple start and stop locations)
 510 2) In unfinished genomes, the precise locations of features is not known with
 511 certainty.
 512
 513 Bioperl's various Location objects address these complications.  In addition
 514 there are "CoordinatePolicy" objects that allow the user to specify how to
 515 measure the "length" of a feature if its precise start and end coordinates are
 516 not know. In most cases, you will not need to worry about these complications
 517 if you are using bioperl to handle simple features with well-defined start
 518 and stop locations.  However, if you are using bioperl to annotate partially
 519 or unfinished genomes or to read annotations of such genomes with bioperl,
 520 understanding the various Location objects will be important.  See the
 521 documentation of the various modules in the Bio::Locations directory or
 522 L<Bio::Location::CoordinatePolicyI> for more information.
 523
 524 =for html <A NAME ="ii.4"></A>
 525
 526 =head2 II.4 Interface objects and implementation objects
 527
 528 Since release 0.6, bioperl has been moving to separate interface and
 529 implementation objects.  An interface is solely the definition of what
 530 methods one can call on an object, without any knowledge of how it is
 531 implemented. An implementation is an actual, working implementation of
 532 an object. In languages like Java, interface definition is part of the
 533 language. In Perl, you have to roll your own.
 534
 535 In bioperl, the interface objects usually have names like
 536 Bio::MyObjectI, with the trailing I indicating it is an interface
 537 object. The interface objects mainly provide documentation on what the
 538 interface is, and how to use it, without any implementations (though
 539 there are some exceptions).  Although interface objects are not of
 540 much direct utility to the casual bioperl user, being aware of their
 541 existence is useful since they are the basis to understanding how
 542 bioperl programs can communicate with other bioinformatics projects
 543 such as Ensembl and the Annotation Workbench (see section IV).
 544
 545 For more discussion of design and development issues please see the
 546 biodesign.pod file.
 547
 548 =head1 III. Using bioperl
 549
 550
 551 Bioperl provides software modules for many of the typical tasks of
 552 bioinformatics programming.  These include:
 553
 554 =over 7
 555
 556 =item * Accessing sequence data from local and remote databases
 557
 558 =item * Transforming formats of database/ file records
 559
 560 =item * Manipulating individual sequences
 561
 562 =item * Searching for "similar" sequences
 563
 564 =item * Creating and manipulating sequence alignments
 565
 566 =item * Searching for genes and other structures on genomic DNA
 567
 568 =item * Developing machine readable sequence annotations
 569
 570 =back
 571
 572 The following sections describe how bioperl can help perform all of
 573 these tasks.
 574
 575 =head2 III.1 Accessing sequence data from local and remote databases
 576
 577 Much of bioperl is focused on sequence manipulation.  However, before
 578 bioperl can manipulate sequences, it needs to have access to sequence
 579 data.  Now one can directly enter data sequence data into a bioperl
 580 Seq object, eg:
 581
 582   $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
 583                        '-desc'=>'Sample Bio::Seq object',
 584                        '-display_id' => 'something',
 585                        '-accession_number' => 'accnum',
 586                        '-alphabet' => 'dna' );
 587
 588 However, in most cases, it is preferable to access sequence data from
 589 some online data file or database (Note that in common with
 590 conventional bioinformatics usage we will call a "database" what might
 591 be more appropriately referred to as an "indexed flat file".)  Bioperl
 592 supports accessing remote databases as well as developing indices for
 593 setting up local databases.
 594
 595 =head2   III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
 596
 597 Accessing sequence data from the principal molecular biology databases
 598 is straightforward in bioperl.  Data can be accessed by means of the
 599 sequence's accession number or id.  Batch mode access is also
 600 supported to facilitate the efficient retrieval of multiple sequences.
 601 For retrieving data from genbank, for example, the code could be as
 602 follows:
 603
 604   $gb = new Bio::DB::GenBank();
 605   # this returns a Seq object :
 606   $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
 607   # this returns a Seq object :
 608   $seq2 = $gb->get_Seq_by_acc('AF303112'))
 609   # this returns a SeqIO object :
 610   $seqio = $gb->get_Stream_by_batch("J00522","AF303112","2981014");
 611
 612 Bioperl currently supports sequence data retrieval from the genbank,
 613 genpept, RefSeq, swissprot, and EMBL databases. See L<Bio::DB::GenBank>,
 614 L<Bio::DB::GenPept>, L<Bio::DB::SwissProt>, L<Bio::DB::RefSeq> and
 615 L<Bio::DB::EMBL> for more information. A user can also specify a different
 616 database mirror for a database - this is especially relevent for the SwissProt
 617 resource where there are many ExPaSy mirrors.  There are also configuration
 618 options for specifying local proxy servers for those behind firewalls.
 619
 620 The retrieval of NCBI RefSeqs sequences is supported through a special
 621 module called Bio::DB::RefSeq which actually queries an EBI server.
 622 Please see L<Bio::DB::RefSeq> before using it as there are some caveats
 623 with RefSeq retrieval. RefSeq ids in Genbank begin with "NT_", "NC_",
 624 "NG_", "NM_", "NP_", "XM_", "XR_", or "XP_" (for more information see
 625 http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html). Bio::DB::GenBank
 626 can be used to retrieve entries corresponding to these ids but bear in
 627 mind that these are not Genbank entries, strictly speaking. See
 628 L<Bio::DB::GenBank> for special details on retrieving entries beginning
 629 with "NT_", these are specially formatted "CONTIG" entries.
 630
 631 Bioperl also supports retrieval from a remote Ace database. This
 632 capability requires the presence of the external AcePerl module. You
 633 need to download and install the aceperl module from
 634 http://stein.cshl.org/AcePerl/.
 635
 636 An additional module is available for accessing remote databases, BioFetch,
 637 which queries the dbfetch script at EBI. The available databases are EMBL,
 638 GenBank, or SWALL, and the entries can be retrieved in different formats
 639 as objects or streams (SeqIO objects), or as "tempfiles". See
 640 L<Bio::DB::BioFetch> for the details.
 641
 642
 643 =head2  III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl, bpfetch.pl, Bio::DB::*)
 644
 645 Alternately, bioperl permits indexing local sequence data files by
 646 means of the Bio::Index or Bio::DB::Fasta objects.  The following sequence
 647 data formats are supported by Bio::Index: genbank, swissprot, pfam, embl and
 648 fasta.  Once the set of sequences have been indexed using Bio::Index,
 649 individual sequences can be accessed using syntax very similar to that
 650 described above for accessing remote databases.  For example, if one wants to
 651 set up an indexed flat-file database of fasta files, and later wants then to
 652 retrieve one file, one could write scripts like:
 653
 654   # script 1: create the index
 655   use Bio::Index::Fasta; # using fasta file format
 656   use strict; # some users have reported that this is required
 657
 658   my $Index_File_Name = shift;
 659   my $inx = Bio::Index::Fasta->new(
 660       -filename => $Index_File_Name,
 661       -write_flag => 1);
 662   $inx->make_index(@ARGV);
 663
 664   # script 2: retrieve some files
 665   use Bio::Index::Fasta;
 666   use strict; # some users have reported that this is required
 667
 668   my $Index_File_Name = shift;
 669   my $inx = Bio::Index::Fasta->new($Index_File_Name);
 670   foreach  my $id (@ARGV) {
 671       my $seq = $inx->fetch($id);  # Returns Bio::Seq object
 672       # do something with the sequence
 673   }
 674
 675 To facilitate the creation and use of more complex or flexible
 676 indexing systems, the bioperl distribution includes two sample scripts
 677 in the scripts/ directory, bpindex.pl and bpfetch.pl.  These scripts
 678 can be used as templates to develop customized local data-file indexing
 679 systems.
 680
 681 Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta
 682 format files. It's similar in spirit to Bio::Index::Fasta but offers more
 683 methods, eg
 684
 685   use Bio::DB::Fasta;
 686   use strict;
 687
 688   my $db = Bio::DB::Fasta->new($file);  # one file or many files
 689   my $seqstring = $db->seq($id);        # get a sequence as string
 690   my $seqobj = $db->get_Seq_by_id($id); # get a PrimarySeq obj
 691   my $desc = $db->header($id);          # get the header, or description, line
 692
 693 This module also offers the user the ability to designate a specific string
 694 within the fasta header as the desired id, such as the gi number within the
 695 string "gi|4556644|gb|X45555" (use the -makeid option for this capability).
 696 See L<Bio::DB::Fasta> for more information on this fully-featured module.
 697
 698
 699 =head2 III.2 Transforming formats of database/ file records
 700
 701 =cut
 702
 703 =for html <A NAME ="iii.2.1"></A>
 704
 705 =head2   III.2.1 Transforming sequence files (SeqIO)
 706
 707 A common - and tedious - bioinformatics task is that of converting
 708 sequence data among the many widely used data formats.  Bioperl's
 709 SeqIO object, however, makes this chore a breeze.  SeqIO can
 710 read a stream of sequences - located in a single or in multiple files -
 711 in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF,
 712 phd/phred, Ace, fastq, or raw (plain sequence). Once the sequence data
 713 has been read in with SeqIO, it is available to bioperl in the form of
 714 Seq objects.  Moreover, the Seq objects can then be written to another
 715 file (again using SeqIO) in any of the supported data formats making
 716 data converters simple to implement, for example:
 717
 718   use Bio::SeqIO;
 719   $in  = Bio::SeqIO->new('-file' => "inputfilename",
 720                          '-format' => 'Fasta');
 721   $out = Bio::SeqIO->new('-file' => ">outputfilename",
 722                          '-format' => 'EMBL');
 723   while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
 724
 725 In addition, perl "tied filehandle" syntax is available to SeqIO,
 726 allowing you to use the standard E<lt>E<gt> and print operations to read
 727 and write sequence objects, eg:
 728
 729   $in  = Bio::SeqIO->newFh('-file' => "inputfilename" ,
 730                            '-format' => 'Fasta');
 731   $out = Bio::SeqIO->newFh('-format' => 'EMBL');
 732   print $out $_ while <$in>;
 733
 734 If the "-format" argument isn't used then Bioperl will guess the format
 735 based on the file's suffix in a case-insensitive manner. Here are the
 736 current interpretations:
 737
 738    Format   Suffixes
 739
 740    fasta    fasta|fast|seq|fa|fsa|nt|aa
 741    genbank  gb|gbank|genbank
 742    scf      scf
 743    pir      pir
 744    embl     embl|ebl|emb|dat
 745    raw      txt
 746    gcg      gcg
 747    ace      ace
 748    bsml     bsm|bsml
 749    swiss    swiss|sp
 750    phd      phd|phred
 751    fastq    fastq
 752
 753 For more information see L<Bio::SeqIO>.
 754
 755 =for html <A NAME ="iii.2.2"></A>
 756
 757 =head2 III.2.2 Transforming alignment files (AlignIO)
 758
 759 Data files storing multiple sequence alignments also appear in varied
 760 formats.  AlignIO is the bioperl object for data conversion of
 761 alignment files. AlignIO is patterned on the SeqIO object and shares
 762 most of SeqIO's features.  AlignIO currently supports input in the
 763 following formats:
 764
 765    fasta
 766    mase (Seaview)
 767    stockholm
 768    prodom
 769    selex (HMMER))
 770    bl2seq
 771    clustalw (.aln)
 772    msf (GCG)
 773    water (needle and water from EMBOSS, see L<"III.3.6">)
 774    phylip (interleaved)
 775    stockholm
 776    nexus
 777    mega
 778    meme
 779    psi (PSI-BLAST)
 780
 781 AlignIO supports output in these formats: fasta, mase, selex, clustalw, msf/gcg,
 782 and phylip (interleaved).  One significant difference between AlignIO and SeqIO
 783 is that AlignIO handles IO for only a single alignment at a time but
 784 SeqIO.pm handles IO for multiple sequences in a single stream. Syntax
 785 for AlignIO is almost identical to that of SeqIO:
 786
 787   use Bio::AlignIO;
 788   $in  = Bio::AlignIO->new('-file' => "inputfilename" ,
 789                            '-format' => 'fasta');
 790   $out = Bio::AlignIO->new('-file' => ">outputfilename",
 791                            '-format' => 'pfam');
 792   while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
 793
 794 The only difference is that here, the returned object reference, $aln,
 795 is to a SimpleAlign object rather than a Seq object.
 796
 797 AlignIO also supports the tied filehandle syntax described above for
 798 SeqIO.  Note that currently AlignIO is usable only with SimpleAlign
 799 alignment objects. See L<Bio::AlignIO> and section L<"III.5.4"> for more
 800 information.
 801
 802 =head2 III.3 Manipulating sequences
 803
 804 Bioperl contains many modules with functions for sequence analysis. And
 805 if you cannot find the function you want in bioperl you may be able to
 806 find it in EMBOSS, which is accessible through bioperl (see L<"III.3.6">).
 807
 808 =cut
 809
 810 =for html <A NAME ="iii.3.1"></A>
 811
 812 =head2 III.3.1  Manipulating sequence data with Seq methods
 813
 814 OK, so we know how to retrieve sequences and access them as Seq
 815 objects.  Let's see how we can use the Seq objects to manipulate our
 816 sequence data and retrieve information.  Seq provides multiple
 817 methods for performing many common (and some not-so-common) tasks of
 818 sequence manipulation and data retrieval.  Here are some of the most
 819 useful:
 820
 821 The following methods return strings
 822
 823   $seqobj->display_id(); # the human read-able id of the sequence
 824   $seqobj->seq();        # string of sequence
 825   $seqobj->subseq(5,10); # part of the sequence as a string
 826   $seqobj->accession_number(); # when there, the accession number
 827   $seqobj->alphabet();   # one of 'dna','rna','protein'
 828   $seqobj->primary_id(); # a unique id for this sequence irregardless
 829                          # of its display_id or accession number
 830   $seqobj->desc()        # a description of the sequence
 831
 832 It is worth mentioning that some of these values correspond to specific
 833 fields of given formats. For example, the display_id method returns
 834 the LOCUS name of a Genbank entry, the (\S+) following the E<gt> character
 835 in a Fasta file, the ID from a SwissProt file, and so on. The desc()
 836 method will return the DEFINITION line of a Genbank file, the line
 837 following the display_id in a Fasta file, and the DE field in a SwissProt
 838 file.
 839
 840 The following methods return an array of Bio::SeqFeature objects
 841
 842    $seqobj->top_SeqFeatures # The 'top level' sequence features
 843    $seqobj->all_SeqFeatures # All sequence features, including sub
 844                             # seq features
 845
 846 Sequence features will be discussed further in section L<"III.7"> on
 847 machine-readable sequence annotation. A general description of the
 848 object can be found in L<Bio::SeqFeature::Generic>, and a description
 849 of related, top-level "annotation" is found in L<Bio::Annotation>.
 850
 851 The following methods returns new sequence objects, but do not transfer
 852 features across:
 853
 854   $seqobj->trunc(5,10)  # truncation from 5 to 10 as new object
 855   $seqobj->revcom       # reverse complements sequence
 856   $seqobj->translate    # translation of the sequence
 857
 858 Note that some methods return strings, some return arrays and some
 859 return references to objects.  See L<Bio::Seq> for more information.
 860
 861 Many of these methods are self-explanatory. However, bioperl's flexible
 862 translation methods warrant further comment. Translation in bioinformatics
 863 can mean two slightly different things:
 864
 865 =over 2
 866
 867 =item 1 Translating a nucleotide sequence from start to end.
 868
 869 =item 2 Taking into account the constraints of real coding regions in mRNAs.
 870
 871 =back
 872
 873 For historical reasons the bioperl implementation of translation does
 874 the first of these tasks easily. Any sequence object which is not of type
 875 'protein' can be translated by simply calling the method which returns
 876 a protein sequence object:
 877
 878   $translation1 = $my_seq_object->translate;
 879
 880 However, the translate method can also be passed several optional
 881 parameters to modify its behavior. For example, the first two
 882 arguments to "translate" can be used to modify the characters used to
 883 represent stop (default '*') and unknown amino acid ('X'). (These are
 884 normally best left untouched.)  The third argument determines the
 885 frame of the translation. The default frame is "0".  To get
 886 translations in the other two forward frames, we would write:
 887
 888   $translation2 = $my_seq_object->translate(undef,undef,1);
 889   $translation3 = $my_seq_object->translate(undef,undef,2);
 890
 891 The fourth argument to "translate" makes it possible to use
 892 alternative genetic codes. There are currently 16 codon tables
 893 defined, including tables for 'Verterbate Mitochondrial', 'Bacterial',
 894 'Alternative Yeast Nuclear' and 'Ciliate, Dasycladacean and Hexamita
 895 Nuclear' translation. These tables are located in the object
 896 Bio::Tools::CodonTable which is used by the translate method. For
 897 example, for mitochondrial translation:
 898
 899   $human_mitochondrial_translation =
 900       $my_seq_object->translate(undef,undef,undef, 2);
 901
 902 If we want to translate full coding regions (CDS) the way major
 903 nucleotide databanks EMBL, GenBank and DDBJ do it, the translate
 904 method has to perform more tricks. Specifically, 'translate' needs to
 905 confirm that the sequence has appropriate start and terminator codons
 906 at the beginning and the end of the sequence and that there are no
 907 terminator codons present within the sequence.  In addition, if the
 908 genetic code being used has an atypical (non-ATG) start codon, the
 909 translate method needs to convert the initial amino acid to
 910 methionine.  These checks and conversions are triggered by setting the
 911 fifth argument of the translate method to evaluate to "true".
 912
 913 If argument 5 is set to true and the criteria for a proper CDS are
 914 not met, the method, by default, issues a warning. By setting the
 915 sixth argument to evaluate to "true", one can instead instruct
 916 the program to die if an improper CDS is found, e.g.
 917
 918   $protein_object =
 919       $cds->translate(undef,undef,undef,undef,1,'die_if_errors');
 920
 921 See L<Bio::Tools::CodonTable> for related details.
 922
 923 =head2 III.3.2 Obtaining basic sequence statistics- MW, residue &codon
 924 frequencies(SeqStats, SeqWord)
 925
 926 In addition to the methods directly available in the Seq object,
 927 bioperl provides various "helper" objects to determine additional
 928 information about a sequence.  For example, SeqStats
 929 object provides methods for obtaining the molecular weight of the
 930 sequence as well the number of occurrences of each of the component
 931 residues (bases for a nucleic acid or amino acids for a protein.)
 932 For nucleic acids, SeqStats also returns counts of the number of codons
 933 used.  For example:
 934
 935   use SeqStats;
 936   $seq_stats  =  Bio::Tools::SeqStats->new($seqobj);
 937   $weight = $seq_stats->get_mol_wt();
 938   $monomer_ref = $seq_stats->count_monomers();
 939   $codon_ref = $seq_stats->count_codons();  # for nucleic acid sequence
 940
 941 Note: sometimes sequences will contain "ambiguous" codes.  For this
 942 reason, get_mol_wt() returns (a reference to) a two element array
 943 containing a greatest lower bound and a least upper bound of the
 944 molecular weight.
 945
 946 The SeqWords object is similar to SeqStats and provides
 947 methods for calculating frequencies of "words" (eg tetramers or hexamers)
 948 within the sequence. See L<Bio::Tools::SeqStats> and L<Bio::Tools::SeqWords>
 949 for more information.
 950
 951 =head2 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
 952
 953 Another common sequence manipulation task for nucleic acid sequences
 954 is locating restriction enzyme cutting sites.  Bioperl provides the
 955 RestrictionEnzyme object for this purpose. Bioperl's
 956 standard RestrictionEnzyme object comes with data for more than 150
 957 different restriction enzymes. A list of the available enzymes can be
 958 accessed using the available_list() method. For example to select all
 959 available enzymes that with cutting patterns that are six bases long one
 960 would write:
 961
 962   $re  = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
 963   @sixcutters = $re->available_list(6);
 964
 965 Once an appropriate enzyme has been selected, the sites for that
 966 enzyme on a given nucleic acid sequence can be obtained using the
 967 cut_seq() method.  The syntax for performing this task is:
 968
 969   $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');
 970   # $seqobj is the Seq object for the dna to be cut
 971   @fragments =  $re1->cut_seq($seqobj);
 972
 973 Adding an enzyme not in the default list is easily accomplished:
 974
 975   $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV--GAT^ATC',
 976                                            '-MAKE' =>'custom');
 977
 978 Once the custom enzyme object has been created, cut_seq() can be
 979 called in the usual manner. See L<Bio::Tools::RestrictionEnzyme> for
 980 details.
 981
 982
 983 =head2    III.3.4 Identifying amino acid cleavage sites (Sigcleave)
 984
 985 For amino acid sequences we may be interested to know whether the
 986 amino acid sequence contains a cleavable signal sequence for
 987 directing the transport of the protein within the cell.  SigCleave is
 988 a program (originally part of the EGCG molecular biology package) to
 989 predict signal sequences, and to identify the cleavage site based on
 990 the von Heijne algorithm.
 991
 992 The "threshold" setting controls the score reporting.  If no value for
 993 threshold is passed in by the user, the code defaults to a reporting
 994 value of 3.5.  SigCleave will only return score/position
 995 pairs which meet the threshold limit.
 996
 997 There are 2 accessor methods for this object. "signals" will return a
 998 perl hash containing the sigcleave scores keyed by amino acid
 999 position. "pretty_print" returns a formatted string similar to the
1000 output of the original sigcleave utility.
1001
1002 The syntax for using Sigcleave is as follows:
1003
1004   # doesn't currently accept a Seq object
1005   $seq = "AALLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV";
1006
1007   use Bio::Tools::Sigcleave;
1008   $sigcleave_object = new Bio::Tools::Sigcleave
1009       ( -seq       => 'sigtest.aa',
1010         -threshold => 3.5,
1011         -desc      => 'test sigcleave protein seq',
1012         -type      => 'AMINO'
1013       );
1014   %raw_results      = $sigcleave_object->signals;
1015   $formatted_output = $sigcleave_object->pretty_print;
1016
1017 Note that Sigcleave is passed a raw sequence rather than a sequence
1018 object. Also note that the "type" in the Sigcleave object is "amino"
1019 whereas in a Seq object it would be called "protein". Please see
1020 L<Bio::Tools::Sigcleave> for details.
1021
1022
1023 =head2 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
1024
1025 OddCodes:
1026
1027 For some purposes it's useful to have a listing of an amino acid
1028 sequence showing where the hydrophobic amino acids are located or
1029 where the positively charged ones are.  Bioperl provides this
1030 capability via the module Bio::Tools::OddCodes.
1031
1032 For example, to quickly see where the charged amino acids are located
1033 along the sequence we perform:
1034
1035   use Bio::Tools::OddCodes;
1036   $oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
1037   $output = $oddcode_obj->charge();
1038
1039 The sequence will be transformed into a three-letter sequence (A,C,N)
1040 for negative (acidic), positive (basic), and neutral amino acids.  For
1041 example the ACDEFGH would become NNAANNC.
1042
1043 For a more complete chemical description of the sequence one can call
1044 the chemical() method which turns sequence into one with an 8-letter
1045 chemical alphabet { A (acidic), L (aliphatic), M (amide), R
1046 (aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:
1047
1048   $output = $oddcode_obj->chemical();
1049
1050 In this case the sample sequence ACDEFGH would become LSAARAC.
1051
1052 OddCodes also offers translation into alphabets showing alternate
1053 characteristics of the amino acid sequence such as hydrophobicity,
1054 "functionality" or grouping using Dayhoff's definitions.  See the
1055 documentation in L<Bio::Tools::OddCodes> for further details.
1056
1057 SeqPattern:
1058
1059 The SeqPattern object is used to manipulate sequences
1060 that include perl "regular expressions".  A key motivation for
1061 SeqPattern is to have a way of generating a reverse complement of a
1062 nucleic acid sequence pattern that includes ambiguous bases and/or
1063 regular expressions.  This capability leads to significant performance
1064 gains when pattern matching on both the sense and anti-sense strands
1065 of a query sequence are required. Typical syntax for using SeqPattern
1066 is shown below.  For more information, there are several interesting
1067 examples in the script seq_pattern.pl in the scripts/tools directory.
1068
1069   Use Bio::Tools::SeqPattern;
1070   $pattern     = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
1071   $pattern_obj = new Bio::Tools::SeqPattern('-SEQ'  => $pattern,
1072                                             '-TYPE' => 'dna');
1073   $pattern_obj2  = $pattern_obj->revcom();
1074   $pattern_obj->revcom(1); # returns expanded rev complement pattern.
1075
1076 More detail can be found in L<Bio::Tools::SeqPattern>.
1077
1078
1079 =for html <A NAME ="iii.3.6"></A>
1080
1081 =head2  III.3.6 Sequence manipulation using the Bioperl EMBOSS interface (Tools::Run::EMBOSSApplication)
1082
1083 EMBOSS (European Molecular Biology Open Source Software) is an extensive
1084 collection of sequence analysis programs written in the C
1085 programming language, from http://www.uk.embnet.org/Software/EMBOSS.
1086 There are a number of algorithms in EMBOSS that are not found in "Bioperl
1087 proper" (eg. calculating DNA melting temperature, finding repeats,
1088 identifying prospective antigenic sites) so if you if you cannot find
1089 the function you want in bioperl you might be able to find it in EMBOSS.
1090
1091 EMBOSS programs are usually called from the command line but bioperl
1092 provides a Perl "wrapper" for EMBOSS function calls so that they can be
1093 executed from within a Perl script.  Of course, the EMBOSS package must
1094 be installed for the Bioperl wrapper to function.
1095
1096 In the future, it is planned that Bioperl EMBOSS objects will return
1097 appropriate Bioperl objects to the calling script in addition to
1098 generating standard EMBOSS reports.  This functionality is
1099 being initially implemented with the EMBOSS sequence alignment
1100 programs, so that they will return SimpleAlign objects in a manner
1101 similar to the way the Bioperl modules TCoffee.pm and Clustalw.pm
1102 work (see section L<"III.5.4"> for a discussion of SimpleAlign).
1103
1104 An example of the Bioperl EMBOSS wrapper where a file is returned
1105 would be:
1106
1107   $factory = new Bio::Factory::EMBOSS;
1108   $compseqapp = $factory->program('compseq');
1109   %input = ( -word     => 4,
1110              -sequence => $seqObj,
1111              -outfile  => $compseqoutfile );
1112   $compseqapp->run(\%input);
1113   $seqio = Bio::SeqIO->new( -file => $compseqoutfile ); # etc...
1114
1115 Note that a Seq object was used as input. The EMBOSS object can also
1116 accept a file name as input, eg
1117
1118   -sequence => "inputfasta.fa"
1119
1120 Some EMBOSS programs will return strings, others will create files that
1121 can be read directly using Bio::SeqIO (section L<"III.2.1">), as in the
1122 example above. It's worth mentioning that the AlignIO module can use files
1123 from EMBOSS's water and needle as input (see L<"III.2.2">) to create AlignIO
1124 objects.
1125
1126
1127 =head2 III.3.7 Sequence manipulation without creating Bioperl "objects" (Perl.pm)
1128
1129 Using the Bio::Perl.pm module, it is possible to manipulate sequence
1130 data in Bioperl without explicitly creating Seq or SeqIO objects.
1131 This feature may ease the Bioperl learning curve for new users
1132 unfamiliar or uncomfortable with using Perl objects.  However, only
1133 limited data manipulation are supported in this mode.  In addition,
1134 each method (i.e. function) that will be used by the script must be
1135 explicitly declared in the initial "use directive".  For example a
1136 simple data format conversion and sequence manipulation could be
1137 performed as follows - note that no "new" methods are called and that
1138 no Seq or SeqIO objects are created:
1139
1140   use Bio::Perl qw( get_sequence );
1141   # get a sequence from a database (assummes internet connection)
1142   $seq_object = get_sequence('swissprot',"ROA1_HUMAN");
1143   # $seq_object is Bio::Seq object, so the following methods work
1144   $seq_id  = $seq_object->display_id;
1145   $seq_as_string = $seq_object->seq();
1146
1147 For more details see L<Bio::Perl>.
1148
1149 =head2 III.4 Searching for "similar" sequences
1150
1151 One of the basic tasks in molecular biology is identifying sequences
1152 that are, in some way, similar to a sequence of interest.  The Blast
1153 programs, originally developed at the NCBI, are widely used for
1154 identifying such sequences.  Bioperl offers a number of modules to
1155 facilitate running Blast as well as to parse the often voluminous
1156 reports produced by Blast.
1157
1158 =head2   III.4.1 Running BLAST locally (StandAloneBlast)
1159
1160 There are several reasons why one might want to run the Blast programs
1161 locally - speed, data security, immunity to network problems, being
1162 able to run large batch runs etc.  The NCBI provides a downloadable
1163 version of blast in a stand-alone format, and running blast locally
1164 without any use of perl or bioperl is completely straightforward.
1165 However, there are situations where having a perl interface for running
1166 the blast programs locally is convenient.
1167
1168 The module Bio::Tools::Run::StandAloneBlast offers the ability to
1169 wrap local calls to blast from within perl.  All of the currently
1170 available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are
1171 available from within the bioperl StandAloneBlast interface.  Of course,
1172 to use StandAloneBlast, one needs to have installed locally ncbi-blast
1173 as well as one or more blast-readable databases.
1174
1175 Basic usage of the StandAloneBlast.pm module is simple.  Initially, a
1176 local blast "factory object" is created.
1177
1178   @params = ('program'  => 'blastn',
1179              'database' => 'ecoli.nt');
1180   $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
1181
1182 Any parameters not explicitly set will remain as the BLAST defaults.
1183 Once the factory has been created and the appropriate parameters set,
1184 one can call one of the supported blast executables.  The input
1185 sequence(s) to these executables may be fasta file(s), a Seq
1186 object or an array of Seq objects, eg
1187
1188   $input = Bio::Seq->new(-id  =>"test query",
1189                          -seq =>"ACTAAGTGGGGG");
1190   $blast_report = $factory->blastall($input);
1191
1192 The returned blast report will be in the form of a bioperl
1193 parsed-blast object.  The report object may be either a BPlite,
1194 BPpsilite, BPbl2seq or Blast object depending on the type of blast
1195 search.  The "raw" blast report is also available.
1196
1197 The syntax for running PHIBLAST, PSIBLAST and bl2seq searches via
1198 StandAloneBlast is also straightforward.  See
1199 L<Bio::Tools::Run::StandAloneBlast> documentation for details. In
1200 addition, the script standaloneblast.pl in the scripts/tools directory
1201 contains descriptions of various possible applications of the
1202 StandAloneBlast object. This script shows how the blast report object
1203 can access a blast parser directly, eg
1204
1205   while (my $sbjct = $blast_report->nextSbjct){
1206      while (my $hsp = $sbjct->nextHSP){
1207         print $hsp->score . " " . $hsp->subject->seqname . "\n";
1208      }
1209   }
1210
1211 See the section L<"III.4.4"> on parsing BLAST reports with Bio::Tools::BPlite,
1212 below, or L<Bio::Tools::BPlite> for details.
1213
1214 =head2   III.4.2 Running BLAST remotely (using RemoteBlast.pm)
1215
1216 Bioperl supports remote execution of blasts at NCBI by means of the
1217 RemoteBlast object.
1218
1219 A skeleton script to run a remote blast might look as follows:
1220
1221   $remote_blast = Bio::Tools::Run::RemoteBlast->new(
1222            '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );
1223   $r = $remote_blast->submit_blast("t/data/ecolitst.fa");
1224   while (@rids = $remote_blast->each_rid ) {
1225       foreach $rid ( @rids ) {$rc = $remote_blast->retrieve_blast($rid);}}
1226
1227 You may want to change some parameter of the remote job and this example
1228 shows how to change the matrix:
1229
1230 $Bio::Tools::Run::RemoteBlast::HEADER{'MATRIX_NAME'} = 'BLOSUM25';
1231
1232 For a description of the many CGI parameters see:
1233
1234 http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
1235
1236 Note that the script has to be broken into two parts. The actual Blast
1237 submission and the subsequent retrieval of the results. At times when the
1238 NCBI Blast is being heavily used, the interval between when a Blast
1239 submission is made and when the results are available can be substantial.
1240
1241 The object $rc would contain the blast report that could then be parsed with
1242 Bio::Tools::BPlite or Bio::SearchIO. The default object is BPlite
1243 in version 1.0 or earlier and it's SearchIO after version 1.0. The
1244 object type can be changed using the -readmethod parameter but bear
1245 in mind that the favored Blast parser is Bio::SearchIO, others won't be
1246 supported in later versions.
1247
1248 Note that to make this script actually useful, one should add details
1249 such as checking return codes from the Blast to see if it succeeded and
1250 a "sleep" loop to wait between consecutive requests to the NCBI server.
1251 See example 26 in the demonstration script in the appendix to see some
1252 working code you could use, or L<Bio::Tools::Run::RemoteBlast> for
1253 details.
1254
1255 It should also be noted that the syntax for creating a remote blast factory
1256 is slightly different from that used in creating StandAloneBlast, Clustalw,
1257 and T-Coffee factories.  Specifically RemoteBlast requires parameters to
1258 be passed with a leading hyphen, as in '-prog' =E<gt> 'blastp', while the
1259 other programs do not pass parameters with a leading hyphen.
1260
1261
1262 =for html <A NAME ="iii.4.3"></A>
1263
1264 =head2    III.4.3 Parsing BLAST and FASTA reports with Search and SearchIO
1265
1266 No matter how Blast searches are run (locally or remotely, with or
1267 without a perl interface), they return large quantities of data that
1268 are tedious to sift through.  Bioperl offers several different objects
1269 - Search.pm/SearchIO.pm, and BPlite.pm (along with its minor
1270 modifications, BPpsilite and BPbl2seq) for parsing Blast reports.  Search
1271 and SearchIO which are new in Bioperl 1.0 and are now the principal Bioperl
1272 interfaces for Blast (and FASTA) report parsing are described in this section.
1273 The older BPlite is described in section L<"III.4.4">. We recommend you use
1274 SearchIO, it's certain to be supported in future releases.
1275
1276 The Search and SearchIO modules provide a uniform interface for
1277 parsing sequence-similarity-search reports generated by BLAST (in
1278 standard and BLAST XML formats), PSI-BLAST, RPS-BLAST and FASTA. In the future,
1279 it is envisioned that the Search/SearchIO syntax will be extended to
1280 provide a uniform interface to a wider range of report parsers
1281 including parsers for Genscan.
1282
1283 Parsing sequence-similarity reports with Search and SearchIO is
1284 straightforward.  Initially a SearchIO object specifies a file
1285 containing the report(s). The method next_result reads the next report
1286 into a Search object in just the same way that the next_seq method of
1287 SeqIO reads in the next sequence in a file into a Seq object.
1288
1289 Once a report (i.e. a Search object) has been read in and is available
1290 to the script, the report's overall attributes (e.g. the query) can be
1291 determined and its individual "hits" can be accessed with the
1292 next_hit method.  Individual high-scoring-pairs for each hit
1293 can then be accessed with the next_hsp method. Except for
1294 the additional syntax required to enable the reading of multiple
1295 reports in a single file, the remainder of the Search/SearchIO parsing
1296 syntax is very similar to that of the BPlite object it is intended to replace.
1297 Sample code to read a BLAST report might look like this:
1298
1299   # Get the report
1300   $searchio = new Bio::SearchIO (-format => 'blast',
1301                                  -file   => $blast_report);
1302   $result = $searchio->next_result;
1303
1304   # Get info about the entire report
1305   $result->database_name;
1306   $algorithm_type =  $result->algorithm;
1307
1308   # get info about the first hit
1309   $hit = $result->next_hit;
1310   $hit_name = $hit->name ;
1311
1312   # get info about the first hsp of the first hit
1313   $hsp = $hit->next_hsp;
1314   $hsp_start = $hsp->query->start;
1315
1316 For more details on parsing with Search/SearchIO see the Search and SearchIO
1317 documentation: L<Bio::SearchIO::blast>, L<Bio::SearchIO::psiblast>,
1318 L<Bio::SearchIO::blastxml>, L<Bio::SearchIO::fasta>, and L<Bio::SearchIO>.
1319 There is also sample code in the Bio/scripts/searchio directory which
1320 illustrates how to use SearchIO.
1321
1322 =for html <A NAME ="iii.4.4"></A>
1323
1324 =head2 III.4.4 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq
1325
1326 Bioperl's older BLAST report parsers - BPlite, BPpsilite, BPbl2seq and
1327 Blast.pm - are expected to be phased out over a period of time. Since a
1328 considerable amount of legacy Bioperl scripts has been written which
1329 heavily use these objects, they are likely to remain within Bioperl
1330 for some time.
1331
1332 Much of the user interface of BPlite is very similar to that of Search.
1333 However accessing the next hit or HSP uses methods called next_Sbjct and
1334 next_HSP, respectively - in contrast to Search's next_hit and next_hsp.
1335
1336
1337 BPlite
1338
1339 The syntax for using BPlite is as follows where the method
1340 for retrieving hits is now called "nextSbjct" (for "subject"), while the
1341 method for retrieving high-scoring-pairs is called "nextHSP":
1342
1343   use Bio::Tools::BPlite;
1344   $report = new Bio::Tools::BPlite(-fh=>\*STDIN);
1345   $report->query;
1346   while(my $sbjct = $report->nextSbjct) {
1347        $sbjct->name;
1348        while (my $hsp = $sbjct->nextHSP) { $hsp->score; }
1349   }
1350
1351 A complete description of the module can be found in L<Bio::Tools::BPlite>.
1352
1353 BPpsilite
1354
1355 BPpsilite and BPbl2seq are objects for parsing (multiple iteration)
1356 PSIBLAST reports and Blast bl2seq reports, respectively.  They are
1357 both minor variations on the BPlite object. See L<Bio::Tools::BPbl2seq>
1358 and L<Bio::Tools::BPpsilite> for details.
1359
1360 The syntax for parsing a multiple iteration PSIBLAST report is as
1361 shown below.  The only significant additions to BPlite are methods to
1362 determine the number of iterated blasts and to access the results from
1363 each iteration.  The results from each iteration are parsed in the
1364 same manner as a (complete) BPlite object.
1365
1366   use Bio::Tools::BPpsilite;
1367   $report = new Bio::Tools::BPpsilite(-fh=>\*STDIN);
1368   $total_iterations = $report->number_of_iterations;
1369   $last_iteration = $report->round($total_iterations)
1370   while(my $sbjct =  $last_iteration ->nextSbjct) {
1371        $sbjct->name;
1372        while (my $hsp = $sbjct->nextHSP) {$hsp->score; }
1373   }
1374
1375 See L<Bio::Tools::BPpsilite> for details.
1376
1377 BPbl2seq
1378
1379 BLAST bl2seq is a program for comparing and aligning two sequences
1380 using BLAST.  Although the report format is similar to that of a
1381 conventional BLAST, there are a few differences.  Consequently, the
1382 standard bioperl parser BPlite ia unable to read bl2seq
1383 reports directly. From the user's perspective, one difference
1384 between bl2seq and other blast reports is that the bl2seq report does
1385 not print out the name of the first of the two aligned sequences.
1386 Consequently, BPbl2seq has no way of identifying the name of one of
1387 the initial sequence unless it is explicitly passed to constructor as
1388 a second argument as in:
1389
1390   use Bio::Tools::BPbl2seq;
1391   $report = Bio::Tools::BPbl2seq->new(-file => "t/data/dblseq.out",
1392                                       -queryname => "ALEU_HORVU");
1393   $hsp = $report->next_feature;
1394   $answer=$hsp->score;
1395
1396 In addition, since there will only be (at most) one "subject" (hit) in a
1397 bl2seq report one should use the method $report-E<gt>next_feature,
1398 rather than $report-E<gt>nextSbjct-E<gt>nextHSP to obtain the next high
1399 scoring pair. See L<Bio::Tools::BPbl2seq> for more details.
1400
1401 Blast.pm
1402
1403 The Bio::Tools::Blast parser has been removed from Bioperl as of version
1404 1.1. Consequently, the BPlite parser (described in the
1405 section L<"III.4.4">) or the Search/SearchIO parsers (section L<"III.4.3">)
1406 should be use for BLAST parsing within bioperl (SearchIO is the preferred
1407 approach and will be formally supported in subsequent releases).
1408
1409
1410 =head2 III.4.5 Parsing HMM reports (HMMER::Results, SearchIO)
1411
1412 Blast is not the only sequence-similarity-searching program supported
1413 by bioperl. HMMER is a Hidden Markov Model (HMM) program that
1414 (among other capabilities) enables sequence similarity searching, from
1415 http://hmmer.wustl.edu. Bioperl does not currently provide a perl interface
1416 for running HMMER.  However, bioperl does provide HMMER report parsers,
1417 one with the (perhaps not too descriptive) name of Results, the other
1418 within the SearchIO framework.
1419
1420 Results can parse reports generated both by the HMMER program
1421 hmmsearch - which searches a sequence database for sequences similar
1422 to those generated by a given HMM - and the program hmmpfam - which
1423 searches a HMM database for HMMs which match domains of a given
1424 sequence. For hmmsearch, a series of HMMER::Set objects
1425 are made, one for each sequence. For hmmpfam searches, only one Set
1426 object is made. Sample usage for parsing a hmmsearch report might be:
1427
1428   use Bio::Tools::HMMER::Results;
1429   $res = new Bio::Tools::HMMER::Results(-file => 'output.hmm',
1430                                         -type => 'hmmsearch' );
1431   foreach $seq ( $res->each_Set ) {
1432       print "Sequence bit score is ", $seq->bits, "\n";
1433       foreach $domain ( $seq->each_Domain ) {
1434           print " Domain start ", $domain->start, " end ",
1435               $domain->end," score ",$domain->bits,"\n";
1436       }
1437   }
1438
1439 Additional methods are described in L<Bio::Tools::HMMER::Results>.
1440
1441 In addition one can parse HMMER output files using Bio::SearchIO. This
1442 approach is the approved and supported Bioperl method. An example:
1443
1444   use Bio::SearchIO;
1445
1446   $in = new Bio::SearchIO(-format => 'hmmer',-file => '123.hmmsearch');
1447   while ( $res = $in->next_result ){
1448     # get a Bio::Search::Result::HMMERResult object
1449     print $res->query_name, " for HMM ", $res->hmm_name, "\n";
1450     while ( $hit = $res->next_hit ){
1451       print $hit->name, "\n";
1452       while ( $hsp = $hit->next_hsp ){
1453         print "length is ", $hsp->length, "\n";
1454       }
1455     }
1456   }
1457
1458 Purists may insist that the term "hsp" is not applicable to hmmsearch or
1459 hmmpfam results and they may be correct - this is an unintended
1460 consequence of using the flexible and extensible SearchIO approach. See
1461 L<Bio::Search::Result::HMMERResult> for more information.
1462
1463
1464 =head2 III.5 Creating and manipulating sequence alignments
1465
1466 Once one has identified a set of similar sequences, one often needs to
1467 create an alignment of those sequences. Bioperl offers several perl
1468 objects to facilitate sequence alignment: pSW, Clustalw.pm, TCoffee.pm
1469 and the bl2seq option of StandAloneBlast. All of these objects take
1470 as arguments a reference to an array of (unaligned) Seq objects.  All
1471 (except bl2seq) return a reference to a SimpleAlign object. bl2seq
1472 can also produce a SimpleAlign object when it is combined with
1473 Bio::AlignIO (see section below, L<"III.5.2">).
1474
1475
1476 =head2    III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
1477
1478 The Smith-Waterman (SW) algorithm is the standard method for producing
1479 an optimal local alignment of two sequences.  Bioperl supports the
1480 computation of SW alignments via the pSW object. Note that pSW only
1481 supports the alignment of protein sequences, not nucleotide.
1482
1483 The SW algorithm itself is implemented in C and incorporated into bioperl using
1484 an XS extension. This has significant efficiency advantages but means that
1485 pSW will B<not> work unless you have compiled the bioperl-ext
1486 package.  If you have compiled the bioperl-ext package, usage is
1487 simple, where the method align_and_show displays the alignment while
1488 pairwise_alignment produces a (reference to) a SimpleAlign object.
1489
1490   use Bio::Tools::pSW;
1491   $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla',
1492                                   '-gap' => 12,
1493                                   '-ext' => 2, );
1494   $factory->align_and_show($seq1, $seq2, STDOUT);
1495   $aln = $factory->pairwise_alignment($seq1, $seq2);
1496
1497 SW matrix, gap and extension parameters can be adjusted as shown.
1498 Bioperl comes standard with blosum62 and gonnet250 matrices.  Others
1499 can be added by the user.  For additional information on accessing the
1500 SW algorithm via pSW see the script psw.pl in the scripts/tools directory and
1501 the documentation in L<Bio::Tools::pSW>.
1502
1503 An alternative way to get Smith-Waterman alignments is the EMBOSS
1504 program 'water' (see Section L<"IV.3"> for more information on the EMBOSS
1505 package).  This can produce an output file that bioperl can read in with
1506 the AlignIO system
1507
1508   use Bio::AlignIO;
1509   my $in = new Bio::AlignIO(-format => 'emboss', -file => 'filename');
1510   my $aln = $in->next_aln();
1511
1512 =for html <A NAME ="iii.5.2"></A>
1513
1514 =head2   III.5.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
1515
1516 As an alternative to Smith-Waterman, two sequences can also be aligned
1517 in Bioperl using the bl2seq option of Blast within the StandAloneBlast
1518 object.  To get an alignment - in the form of a SimpleAlign object -
1519 using bl2seq, you need to parse the bl2seq report with the Bio::AlignIO
1520 file format reader as follows:
1521
1522   $factory = Bio::Tools::Run::StandAloneBlast->new('outfile' => 'bl2seq.out');
1523   $bl2seq_report = $factory->bl2seq($seq1, $seq2);
1524   # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
1525   $str = Bio::AlignIO->new('-file '=>' bl2seq.out',
1526                            '-format' => 'bl2seq');
1527   $aln = $str->next_aln();
1528
1529 =head2   III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
1530
1531 For aligning multiple sequences (ie two or more), bioperl offers a
1532 perl interface to the bioinformatics-standard clustalw and tcoffee
1533 programs.  Clustalw has been a leading program in global multiple
1534 sequence alignment (MSA) for several years.  TCoffee is a relatively
1535 recent program - derived from clustalw - which has been shown to
1536 produce better results for local MSA.
1537
1538 To use these capabilities, the clustalw and/or tcoffee programs
1539 themselves need to be installed on the host system.  In addition, the
1540 environmental variables CLUSTALDIR and TCOFFEEDIR need to be set to
1541 the directories containg the executables.  See section L<"I.3"> and the
1542 L<Bio::Tools::Run::Alignment::Clustalw> and
1543 L<Bio::Tools::Run::Alignment::TCoffee> for information on downloading
1544 and installing these programs.
1545
1546 From the user's perspective, the bioperl syntax for calling
1547 Clustalw.pm or TCoffee.pm is almost identical.  The only differences
1548 are the names of the modules themselves appearing in the initial "use"
1549 and constructor statements and the names of the some of the individual
1550 program options and parameters.
1551
1552 In either case, initially, a "factory object" must be created. The
1553 factory may be passed most of the parameters or switches of the
1554 relevant program.  In addition, alignment parameters can be changed
1555 and/or examined after the factory has been created.  Any parameters
1556 not explicitly set will remain as the underlying program's
1557 defaults. Clustalw.pm/TCoffee.pm output is returned in the form of a
1558 SimpleAlign object.  It should be noted that some Clustalw and TCoffee
1559 parameters and features (such as those corresponding to tree
1560 production) have not been implemented yet in the Perl interface.
1561
1562 Once the factory has been created and the appropriate parameters set,
1563 one can call the method align() to align a set of unaligned sequences,
1564 or profile_align() to add one or more sequences or a second alignment
1565 to an initial alignment.  Input to align() consists of a set of
1566 unaligned sequences in the form of the name of file containing the
1567 sequences or a reference to an array of Seq objects. Typical
1568 syntax is shown below. (We illustrate with Clustalw.pm, but the same
1569 syntax - except for the module name - would work for TCoffee.pm)
1570
1571   use Bio::Tools::Run::Alignment::Clustalw;
1572   @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
1573   $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
1574   $ktuple = 3;
1575   $factory->ktuple($ktuple);  # change the parameter before executing
1576   $seq_array_ref = \@seq_array;
1577       # where @seq_array is an array of Bio::Seq objects
1578   $aln = $factory->align($seq_array_ref);
1579
1580 Clustalw.pm/TCoffee.pm can also align two (sub)alignments to each
1581 other or add a sequence to a previously created alignment by using the
1582 profile_align method. For further details on the required syntax and
1583 options for the profile_align method, the user is referred to
1584 L<Bio::Tools::Run::Alignment::Clustalw> and
1585 L<Bio::Tools::Run::Alignment::TCoffee>. The user is also
1586 encouraged to examine the script clustalw.pl in the scripts/align directory.
1587
1588 =for html <A NAME ="iii.5.4"></A>
1589
1590 =head2 III.5.4 Manipulating and displaying alignments (SimpleAlign)
1591
1592 SimpleAlign objects are produced by bioperl alignment creation objects
1593 (eg Clustalw.pm, BLAST's bl2seq, and pSW) and they can read and write multiple
1594 alignment formats via AlignIO.
1595
1596 Some of the manipulations possible with SimpleALign include:
1597
1598 =over 4
1599
1600 =item *
1601
1602 slice(): Obtaining an alignment "slice", that is, a subalignment inclusive of
1603 specified start and end columns.  Sequences with no residues in the slice are
1604 excluded from the new alignment and a warning is printed.
1605
1606 =item *
1607
1608 column_from_residue_number(): Finding column in an alignment where a specified
1609 residue of a specified sequence is located.
1610
1611 =item *
1612
1613 consensus_string(): Making a consensus string. This method includes an
1614 optional threshold parameter, so that positions in the alignment with lower
1615 percent-identity than the threshold are marked by "?"'s in the consensus
1616
1617 =item *
1618
1619 percentage_identity(): A fast method for calculating the average percentage
1620 identity of the alignment
1621
1622 =item *
1623
1624 consensus_iupac(): Making a consensus using IUPAC ambiguity codes from
1625 DNA and RNA.
1626
1627
1628 =back
1629
1630 Skeleton code for using some of these features is shown below.  More detailed,
1631 working code is in Demo example 14 and in align_on_codons.pl in the scripts
1632 directory. Additional documentation on methods can be found in
1633 L<Bio::SimpleAlign> and L<Bio::LocatableSeq>.
1634
1635   use Bio::SimpleAlign;
1636   $aln = Bio::SimpleAlign->new('t/data/testaln.dna');
1637   $threshold_percent = 60;
1638   $consensus_with_threshold = $aln->consensus_string($threshold_percent);
1639   $iupac_consensus = $aln->consensus_iupac();   # dna/rna alignments only
1640   $percent_ident = $aln->percentage_identity;
1641   $seqname = '1433_LYCES';
1642   $pos = $aln->column_from_residue_number($seqname, 14);
1643
1644 =head2 III.6 Searching for genes and other structures on genomic DNA
1645 (Genscan, Sim4, Grail, Genemark, ESTScan, MZEF, EPCR)
1646
1647 Automated searching for putative genes, coding sequences,
1648 sequence-tagged-sites (STS's) and other functional
1649 units in genomic and expressed sequence tag (EST) data has
1650 become very important as the available quantity of sequence data has
1651 rapidly increased.  Many feature searching programs currently exist.
1652 Each produces reports containing predictions that must be read
1653 manually or parsed by automated report readers.
1654
1655 Parsers for six widely used gene prediction programs - Genscan, Sim4,
1656 Genemark, Grail, ESTScan and MZEF - are currently available or under active
1657 development in bioperl. The interfaces for the four parsers are similar.
1658 We illustrate the usage for Genscan and Sim4 here.  The syntax is relatively
1659 self-explanatory; see L<Bio::Tools::Genscan>, L<Bio::Tools::Genemark>,
1660 L<Bio::Tools::Grail>, L<Bio::Tools::ESTScan>, L<Bio::Tools::MZEF>, and
1661 L<Bio::Tools::Sim4::Results> for further details.
1662
1663   use Bio::Tools::Genscan;
1664   $genscan = Bio::Tools::Genscan->new(-file => 'result.genscan');
1665   # $gene is an instance of Bio::Tools::Prediction::Gene
1666   # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
1667   while($gene = $genscan->next_prediction())
1668       { @exon_arr = $gene->exons(); }
1669   $genscan->close();
1670
1671 See L<Bio::Tools::Prediction::Gene> and L<Bio::Tools::Prediction::Exon>
1672 for more details.
1673
1674   use Bio::Tools::Sim4::Results;
1675   $sim4 = new Bio::Tools::Sim4::Results(-file => 't/data/sim4.rev',
1676                                         -estisfirst => 0);
1677   # $exonset is-a Bio::SeqFeature::Generic with Bio::Tools::Sim4::Exons
1678   # as sub features
1679   $exonset = $sim4->next_exonset;
1680   @exons = $exonset->sub_SeqFeature();
1681   # $exon is-a Bio::SeqFeature::FeaturePair
1682   $exon = 1;
1683   $exonstart = $exons[$exon]->start();
1684   $estname = $exons[$exon]->est_hit()->seqname();
1685   $sim4->close();
1686
1687 See L<Bio::SeqFeature::Generic> and L<Bio::Tools::Sim4::Exons> for more
1688 information.
1689
1690 A parser for the ePCR program is also available. The ePCR program identifies
1691 potential PCR-based sequence tagged sites (STSs)
1692 For more details see the documentation in L<Bio::Tools::EPCR>.
1693 A sample skeleton script for parsing an ePCR report and using the data to
1694 annotate a genomic sequence might look like this:
1695
1696   use Bio::Tools::EPCR;
1697   use Bio::SeqIO;
1698   $parser = new Bio::Tools::EPCR(-file => 'seq1.epcr');
1699   $seqio = new Bio::SeqIO(-format => 'fasta', -file => 'seq1.fa');
1700   $seq = $seqio->next_seq;
1701   while( $feat = $parser->next_feature ) {
1702         # add EPCR annotation to a sequence
1703         $seq->add_SeqFeature($feat);}
1704
1705 =for html <A NAME ="iii.7"></A>
1706
1707 =head2 III.7 Developing machine readable sequence annotations
1708
1709 Historically, annotations for sequence data have been entered and read
1710 manually in flat-file or relational databases with relatively little
1711 concern for machine readability.  More recent projects - such as EBI's
1712 Ensembl project and the efforts to develop an XML molecular biology
1713 data specification - have begun to address this limitation.  Because
1714 of its strengths in text processing and regular-expression handling,
1715 perl is a natural choice for the computer language to be used for this
1716 task.  And bioperl offers numerous tools to facilitate this process -
1717 several of which are described in the following sub-sections.
1718
1719 =for html <A NAME ="iii.7.1"></A>
1720
1721 =head2 III.7.1 Representing sequence annotations (Annotation,SeqFeature)
1722
1723 As of the 0.7 release of bioperl, the fundamental sequence object,
1724 Seq, can have multiple sequence feature (SeqFeature) objects - eg
1725 Gene, Exon, Promoter objects - associated with it.  A Seq object can
1726 also have an Annotation object (used to store database links,
1727 literature references and comments) associated with it.  Creating a
1728 new SeqFeature and Annotation and associating it with a Seq is
1729 accomplished with syntax like:
1730
1731   $feat = new Bio::SeqFeature::Generic('-start'   => 40,
1732                                        '-end'     => 80,
1733                                        '-strand'  => 1,
1734                                        '-primary' => 'exon',
1735                                        '-source'  => 'internal' );
1736   $seqobj->add_SeqFeature($feat); # Add the SeqFeature to the parent
1737   $seqobj->annotation(new Bio::Annotation
1738       ('-description' => 'desc-here'));
1739
1740 Once the features and annotations have been associated with the Seq,
1741 they can be with retrieved, eg:
1742
1743   @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
1744   @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
1745   $ann = $seqobj->annotation(); # annotation object
1746
1747 The individual components of a SeqFeature can also be set or retrieved
1748 with methods including:
1749
1750   # attributes which return numbers
1751   $feat->start          # start position
1752   $feat->end            # end position
1753
1754   $feat->strand         # 1 means forward, -1 reverse, 0 not relevant
1755
1756   # attributes which return strings
1757   $feat->primary_tag    # the main 'name' of the sequence feature,
1758                         # eg, 'exon'
1759   $feat->source_tag     # where the feature comes from, eg'BLAST'
1760
1761   # attributes which return Bio::PrimarySeq objects
1762   $feat->seq            # the sequence between start,end
1763   $feat->entire_seq     # the entire sequence
1764
1765   # other useful methods include
1766   $feat->overlap($other)  # do SeqFeature $feat and SeqFeature $other overlap?
1767   $feat->contains($other) # is $other completely within $feat?
1768   $feat->equals($other)   # do $feat and $other completely agree?
1769   $feat->sub_SeqFeatures  # create/access an array of subsequence features
1770
1771 See L<Bio::Annotation> and L<Bio::SeqFeature::Generic> as starting points
1772 for further exploration, and see the scripts/gff2ps.pl script.
1773
1774 It is worth mentioning that one can also retrieve the start and end
1775 positions of a feature using a Bio::LocationI object:
1776
1777   $location = $feat->location # $location is a Bio::LocationI object
1778   $location->start;           # start position
1779   $location->end;             # end position
1780
1781 This is useful because one needs a Bio::Location::SplitLocationI object
1782 in order to retrieve the split coordinates inside the Genbank or EMBL join()
1783 statements (e.g. "CDS    join(51..142,273..495,1346..1474)"):
1784
1785   if ( $feat->location->isa('Bio::Location::SplitLocationI') &&
1786                $feat->primary_tag eq 'CDS' )  {
1787     foreach $loc ( $feat->location->sub_Location ) {
1788       print $loc->start . ".." . $loc->end . "\n";
1789     }
1790   }
1791
1792 See L<Bio::LocationI> and L<Bio::Location::SplitLocationI> for more
1793 information.
1794
1795 If more detailed annotation is required than is currently available in Seq
1796 objects the RichSeq object may be used. It is applicable in particular to
1797 database sequences (EMBL, GenBank and Swissprot) with detailed annotations.
1798 Sample usage might be:
1799
1800     @secondary   = $richseq->get_secondary_accessions;
1801     $division    = $richseq->division;
1802     @dates       = $richseq->get_dates;
1803     $seq_version = $richseq->seq_version;
1804
1805 See L<Bio::Seq::RichSeqI> for more details.
1806
1807
1808 =for html <A NAME ="iii.7.2"></A>
1809
1810 =head2 III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
1811
1812 This interface extends the Bio::SeqI interface to give additional functionality
1813 to sequences with richer data sources, in particular from database sequences
1814 (EMBL, GenBank and Swissprot).
1815
1816 Very large sequences and/or data files with sequences that are frequently being
1817 updated present special problems to automated sequence-annotation storage and
1818 retrieval projects.  Bioperl's LargeSeq and LiveSeq objects are designed to
1819 address these two situations.
1820
1821 LargeSeq
1822
1823 A LargeSeq object is a SeqI compliant object that stores
1824 a sequence as a series of files in a temporary directory (see sect L<"II.1">
1825 or L<Bio::SeqI> for a definition of SeqI objects). The aim is to enable
1826 storing very large sequences (eg, E<gt> 100MBases) without running out of
1827 memory and, at the same time, preserving the familiar bioperl Seq object
1828 interface. As a result, from the users perspective, using a LargeSeq
1829 object is almost identical to using a Seq object. The principal
1830 difference is in the format used in the SeqIO calls. Another difference
1831 is that the user must remember to only read in small chunks of the
1832 sequence at one time.  These differences are illustrated in the following
1833 code:
1834
1835   $seqio = new Bio::SeqIO('-format'=>'largefasta',
1836                           '-file'  =>'t/data/genomic-seq.fasta');
1837   $pseq = $seqio->next_seq();
1838   $plength = $pseq->length();
1839   $last_4 = $pseq->subseq($plength-3,$plength);  # this is OK
1840
1841   # On the other hand, the next statement would
1842   # probably cause the machine to run out of memory
1843   # $lots_of_data = $pseq->seq();  # NOT OK for a large LargeSeq object
1844
1845 LiveSeq
1846
1847 The LiveSeq object addresses the need for a sequence object capable of
1848 handling sequence data that may be changing over time.  In such a
1849 sequence, the precise locations of features along the sequence may
1850 change.  LiveSeq deals with this issue by re-implementing the sequence
1851 object internally as a "double linked chain." Each element of the
1852 chain is connected to other two elements (the PREVious and the NEXT
1853 one). There is no absolute position (like in an array), hence if
1854 positions are important, they need to be computed (methods are
1855 provided). Otherwise it's easy to keep track of the elements with
1856 their "LABELs". There is one LABEL (think of it as a pointer) to each
1857 ELEMENT. The labels won't change after insertions or deletions of the
1858 chain. So it's always possible to retrieve an element even if the
1859 chain has been modified by successive insertions or deletions.
1860
1861 Although the implementation of the LiveSeq object is novel, its
1862 bioperl user interface is unchanged since LiveSeq implements a
1863 PrimarySeqI interface (recall PrimarySeq is the subset of Seq without
1864 annotations or SeqFeatures - see section L<"II.1"> or L<Bio::PrimarySeq>).
1865 Consequently syntax for using LiveSeq objects is familiar although a
1866 modified version of SeqIO called Bio::LiveSeq::IO::Bioperl needs to be
1867 used to actually load the data, eg:
1868
1869   $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
1870                                           '-file'=>"t/data/factor7.embl");
1871   $gene=$loader->gene2liveseq('-gene_name' => "factor7");
1872   $id = $gene->get_DNA->display_id ;
1873   $maxstart = $gene->maxtranscript->start;
1874
1875 See L<Bio::LiveSeq::IO::BioPerl> for more details.
1876
1877 Creating, maintaining and querying of LiveSeq genes is quite memory
1878 and processor intensive.  Consequently, any additional information
1879 relating to mutational changes in a gene need to be stored separately
1880 from the sequence data itself. The next section describes the mutation
1881 and polymorphism objects used to accomplish this.
1882
1883
1884 =head2 III.7.3 Representing related sequences - mutations,
1885 polymorphisms etc (Allele, SeqDiff)
1886
1887 The Mutation object allows for a basic description of a sequence change
1888 in the DNA sequence of a gene. The Mutator object takes in mutations,
1889 applies them to a LiveSeq gene and returns a set of Bio::Variation
1890 objects describing the net effect of the mutation on the gene at the DNA,
1891 RNA and protein level.
1892
1893 The objects in Bio::Variation and Bio::LiveSeq directory were
1894 originally designed for the "Computational Mutation Expression
1895 Toolkit" project at European Bioinformatics Institute (EBI). The
1896 result of using them to mutate a gene is a holder object, 'SeqDiff',
1897 that can be printed out or queried for specific information. For
1898 example, to find out if restriction enzyme changes caused by a
1899 mutation are exactly the same in DNA and RNA sequences, we can write:
1900
1901   use Bio::LiveSeq::IO::BioPerl;
1902   use Bio::LiveSeq::Mutator;
1903   use Bio::LiveSeq::Mutation;
1904
1905   $loader = Bio::LiveSeq::IO::BioPerl->load('-file' => "$filename");
1906   $gene = $loader->gene2liveseq('-gene_name' => $gene_name);
1907   $mutation = new Bio::LiveSeq::Mutation ('-seq' =>'G',
1908                                           '-pos' => 100 );
1909   $mutate = Bio::LiveSeq::Mutator->new('-gene'      => $gene,
1910                                        '-numbering' => "coding"  );
1911   $mutate->add_Mutation($mutation);
1912   $seqdiff = $mutate->change_gene();
1913   $DNA_re_changes = $seqdiff->DNAMutation->restriction_changes;
1914   $RNA_re_changes = $seqdiff->RNAChange->restriction_changes;
1915   $DNA_re_changes eq $RNA_re_changes or print "Different!\n";
1916
1917 For a complete working script, see the change_gene.pl script
1918 in the scripts/liveseq directory. For more details on the use of these objects
1919 see L<Bio::LiveSeq::Mutator> and L<Bio::LiveSeq::Mutation> as well as
1920 the original documentation for the "Computational Mutation Expression
1921 Toolkit" project at http://www.ebi.ac.uk/mutations/toolkit/.
1922
1923 =for html <A NAME ="iii.7.4"></A>
1924
1925 =head2 III.7.4 Incorporating quality data in sequence annotation (SeqWithQuality)
1926
1927 SeqWithQuality objects are used to describe sequences with very
1928 specific annotations - that is, data quality annotaions.  Data quality
1929 information is important for documenting the reliability of base
1930 "calls" in newly sequenced or otherwise questionable sequence
1931 data. The quality data is contained within a Bio::Seq::PrimaryQual object.
1932 Syntax for using SeqWithQuality objects is as follows:
1933
1934   # first, make a PrimarySeq object
1935   $seqobj = Bio::PrimarySeq->new
1936           ( -seq => 'atcgatcg',            -id  => 'GeneFragment-12',
1937             -accession_number => 'X78121', -alphabet => 'dna');
1938   # now make a PrimaryQual object
1939   $qualobj = Bio::Seq::PrimaryQual->new
1940          ( -qual => '10 20 30 40 50 50 20 10', -id  => 'GeneFragment-12',
1941            -accession_number => 'X78121',      -alphabet => 'dna');
1942   # now make the SeqWithQuality object
1943   $swqobj = Bio::Seq::SeqQithQuality->new
1944           ( -seq  => $seqobj, -qual => $qualobj);
1945   # Now we access the sequence with quality object
1946   $swqobj->id(); # the id of the SeqWithQuality object may not match the
1947                  # id of the sequence or of the quality
1948   $swqobj->seq(); # the sequence of the SeqWithQuality object
1949   $swqobj->qual(); # the quality of the SeqWithQuality object
1950
1951 A SeqWithQuality object is created automatically when phred output, a *phd
1952 file, is read by Seqio, eg
1953
1954   $seqio = Bio::SeqIO->new(-file=>"my.phd",-format=>"phd");
1955   # or just 'Bio::SeqIO->new(-file=>"my.phd")'
1956   $seqWithQualObj = $seqio->next_seq;
1957
1958 See L<Bio::Seq::SeqWithQuality> for a detailed description of the methods,
1959 L<Bio::Seq::PrimaryQual>, and L<Bio::SeqIO::phd>.
1960
1961
1962 =head2 III.7.5 Sequence XML representations - generation and parsing (SeqIO::game, SeqIO::bsml)
1963
1964 The previous subsections have described tools for automated sequence
1965 annotation by the creation of an "object layer" on top of a
1966 traditional database structure.  XML takes a somewhat different
1967 approach.  In XML, the data structure is unmodified, but machine
1968 readability is facilitated by using a data-record syntax with special
1969 flags and controlled vocabulary.
1970
1971 Bioperl supports a set of XML flags and vocabulary words for molecular
1972 biology - called bioxml - detailed at
1973 http://www.bioxml.org/dtds/current/. The idea is that any bioxml
1974 features can be turned into bioperl Seq annotations.  Conversely
1975 Seq object features and annotations can be converted to XML so that
1976 they become available to any other systems that are XML (and bioxml)
1977 compliant.  Typical usage is shown below. No special syntax is
1978 required by the user. Note that some Seq annotation will be lost when
1979 using bioxml in this manner since in its current implementation
1980 bioxml does not support all the annotation information available in
1981 Seq objects.
1982
1983   $str = Bio::SeqIO->new('-file'=> 't/data/test.game',
1984                          '-format' => 'game');
1985   $seq = $str->next_primary_seq();
1986   $id = $seq->id;
1987   @feats = $seq->all_SeqFeatures();
1988   $first_primary_tag = $feats[0]->primary_tag;
1989
1990 Additional XML formats to describe sequences and their annotations
1991 have been created.  BSML and AGAVE are two additional formats that
1992 have been created in the last year.  Bioperl currently only supports
1993 BSML through the SeqIO system at this time.  Usage is similar to other
1994 SeqIO parsing.
1995
1996   $str = Bio::SeqIO->new('-file'=> 'bsmlfile.xml',
1997                          '-format' => 'bsml');
1998   $seq = $str->next_primary_seq();
1999   $id = $seq->id;
2000   @feats = $seq->all_SeqFeatures();
2001   $first_primary_tag = $feats[0]->primary_tag;
2002
2003
2004 =head2  III.8 Representing non-sequence data in Bioperl: structures, trees and maps
2005
2006 Though bioperl has its roots in describing and searching nucleotide and protein
2007 sequences it has also branched out into related fields of study,
2008 such as protein structure, phylogenetic trees and genetic maps.
2009
2010
2011 =head2 III.8.1 Using 3D structure objects and reading PDB files
2012 (StructureI, Structure::IO)
2013
2014 A StructureIO object can be created from one or more 3D structures
2015 represented in Protein Data Bank, or pdb, format (see
2016 http://www.rcsb.org/pdb for details).
2017
2018 StructureIO objects allow access to a variety of related Bio:Structure
2019 objects. An Entry object consist of one or more Model objects, which
2020 in turn consist of one or more Chain objects. A Chain is composed of
2021 Residue objects, which in turn consist of Atom objects. There's a
2022 wealth of methods, here are just a few:
2023
2024   $structio = Bio::Structure::IO->new( -file => "1XYZ.pdb");
2025   $struc = $structio->next_structure; # returns an Entry object
2026   $ann = $struc->annotation; # returns a Bio::Annotation object
2027   $pseq = $struc->seqres;    # returns a PrimarySeq object, thus
2028   $pseq->subseq(1,20);              # returns a sequence string
2029   @atoms = $struc->get_atoms($res); # Atom objects, given a Residue
2030   @xyz = $atom->xyz;                # the 3D coordinates of the atom
2031
2032 These lines show how one has access to a number of related objects and methods.
2033 For examples of typical usage of these modules, see the scripts in the
2034 scripts/structure subdirectory. Also see L<Bio::Structure::IO>,
2035 L<Bio::Structure::Entry>, L<Bio::Structure::Model>,
2036 L<Bio::Structure::Chain>, L<Bio::Structure::Residue>, and
2037 L<Bio::Structure::Atom> for more information.
2038
2039 =head2  III.8.2 Tree objects and phylogenetic trees (Tree::Tree, TreeIO)
2040
2041 Bioperl Tree objects can store data for all kinds of computer trees
2042 and are intended especially for phylogenetic trees.  Nodes and
2043 branches of trees can be individually manipulated.  The TreeIO object
2044 is used for stream I/O of tree objects.  Currently only phylip/newick
2045 tree format is supported.  Sample code might be:
2046
2047   $treeio = new Bio::TreeIO( -format => 'newick', -file   => $treefile);
2048   $tree = $treeio->next_tree;   # get the tree
2049   @nodes = $tree->get_nodes;    # get all the nodes
2050   $tree->get_root_node()->each_Descendent();  # get descendents of root node
2051
2052 See L<Bio::TreeIO> and L<Bio::Tree::Tree> for details.
2053
2054
2055 =head2 III.8.3 Map objects for manipulating genetic maps (Map::MapI,
2056 MapIO)
2057
2058 Bioperl map objects can be used to describe any type of biological map
2059 data including genetic maps, STS maps etc.  Map I/O is performed with
2060 the MapIO object which works in a similar manner to the SeqIO,
2061 SearchIO and similar I/O objects described previously. In principle,
2062 Map I/O with various map data formats can be performed.  However
2063 currently only "mapmaker" format is supported.  Manipulation of
2064 genetic map data with Bioperl Map objects might look like this:
2065
2066   $mapio = new Bio::MapIO( '-format' => 'mapmaker', '-file' => $mapfile);
2067   $map = $mapio->next_map;  # get a map
2068   $maptype =  $map->type ;
2069   foreach  $marker ( $map->each_element ) {
2070     $marker_name =  $marker->name ;  # get the name of each map marker
2071   }
2072
2073 See L<Bio::MapIO> and L<Bio::Map::SimpleMap> for more information.
2074
2075
2076 =head2 III.8.4 Bibliographic objects for querying bibliographic databases (Biblio)
2077
2078 Bio::Biblio objects are used to query bibliographic databases, such as MEDLINE.
2079 The associated modules are built to work with OpenBQS-compatible databases
2080 (see http://industry.ebi.ac.uk/openBQS). A Bio::Biblio object can execute a query
2081 like:
2082
2083   my $collection = $biblio->find ('brazma', 'authors');
2084   while ( $collection->has_next ) {
2085       print $collection->get_next;
2086   }
2087
2088 See L<Bio::Biblio> or the scripts/biblio/biblio.pl script for details.
2089
2090
2091 =for html <A NAME ="iii.8.5"></A>
2092
2093 =head2 III.8.5 Graphics objects for representing sequence objects as images (Graphics)
2094
2095 A user may want to represent Seq objects and their SeqFeatures graphically. The
2096 Bio::Graphics::* modules use Perl's GD.pm module to create a PNG or GIF image
2097 given the SeqFeatures (Section L<"III.7.1">) contained within a Seq object.
2098
2099 These modules contain numerous methods to dictate the sizes, colors, labels,
2100 and line formats within the image. See L<Bio::Graphics>, L<Bio::Graphics::Panel>,
2101 or the scripts/render_sequence.pl script for more information.
2102
2103 The Genquire application also provides ways to graphically represent Seq objects
2104 (see Section L<"IV.6">).
2105
2106
2107 =head2 III.9 Bioperl alphabets
2108
2109 Bioperl modules use the standard extended single-letter genetic
2110 alphabets to represent nucleotide and amino acid sequences.
2111
2112 In addition to the standard alphabet, the following symbols
2113 are also acceptable in a biosequence:
2114
2115  ?  (a missing nucleotide or amino acid)
2116  -  (gap in sequence)
2117
2118
2119 =head2 III.9.1 Extended DNA / RNA alphabet
2120
2121  (includes symbols for nucleotide ambiguity)
2122  ------------------------------------------
2123  Symbol       Meaning      Nucleic Acid
2124  ------------------------------------------
2125   A            A           Adenine
2126   C            C           Cytosine
2127   G            G           Guanine
2128   T            T           Thymine
2129   U            U           Uracil
2130   M          A or C
2131   R          A or G
2132   W          A or T
2133   S          C or G
2134   Y          C or T
2135   K          G or T
2136   V        A or C or G
2137   H        A or C or T
2138   D        A or G or T
2139   B        C or G or T
2140   X      G or A or T or C
2141   N      G or A or T or C
2142
2143
2144  IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE:
2145    Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.
2146
2147
2148 =head2 III.9.2 Amino Acid alphabet
2149
2150  ------------------------------------------
2151  Symbol   Meaning
2152  ------------------------------------------
2153  A        Alanine
2154  B        Aspartic Acid, Asparagine
2155  C        Cystine
2156  D        Aspartic Acid
2157  E        Glutamic Acid
2158  F        Phenylalanine
2159  G        Glycine
2160  H        Histidine
2161  I        Isoleucine
2162  K        Lysine
2163  L        Leucine
2164  M        Methionine
2165  N        Asparagine
2166  P        Proline
2167  Q        Glutamine
2168  R        Arginine
2169  S        Serine
2170  T        Threonine
2171  V        Valine
2172  W        Tryptophan
2173  X        Unknown
2174  Y        Tyrosine
2175  Z        Glutamic Acid, Glutamine
2176  *        Terminator
2177
2178
2179    IUPAC-IUP AMINO ACID SYMBOLS:
2180    Biochem J. 1984 Apr 15; 219(2): 345-373
2181    Eur J Biochem. 1993 Apr 1; 213(1): 2
2182
2183
2184 =for html <A NAME ="IV."></A>
2185
2186 =head1 IV.  Related projects - biocorba, biopython, biojava, Ensembl,
2187  Genquire /AnnotationWorkbench / bioperl-gui
2188
2189 There are several "sister projects" to bioperl currently under
2190 development. These include biocorba, biopython, biojava, EMBOSS, Ensembl,
2191 and Genquire / Annotation Workbench (which includes Bioperl-gui).  These are all
2192 large complex projects and describing them in detail here will not be
2193 attempted.  However a brief introduction seems appropriate since, in
2194 the future, they may each provide significant added utility to the
2195 bioperl user.
2196
2197 =head2 IV.1 Biocorba
2198
2199 Interface objects have facilitated interoperability between bioperl
2200 and other perl packages such as Ensembl and the Annotation Workbench.
2201 However, interoperability between bioperl and packages written in
2202 other languages requires additional support software.  CORBA is one
2203 such framework for interlanguage support, and the biocorba project is
2204 currently implementing a CORBA interface for bioperl.  With biocorba,
2205 objects written within bioperl will be able to communicate with
2206 objects written in biopython and biojava (see the next subsection).
2207 For more information, see the biocorba project website at
2208 http://biocorba.org/.  The Bioperl BioCORBA server and client bindings are
2209 available in the bioperl-corba-server and bioperl-corba-client bioperl CVS
2210 repositories respecitively. (see http://cvs.bioperl.org for more information).
2211
2212 =head2 IV.2 Biopython and biojava
2213
2214 Biopython and biojava are open source projects with very similar goals
2215 to bioperl.  However their code is implemented in python and java,
2216 respectively.  With the development of interface objects and biocorba,
2217 it is possible to write java or python objects which can be accessed
2218 by a bioperl script, or to call bioperl objects from java or python
2219 code.  Since biopython and biojava are more recent projects than
2220 bioperl, most effort to date has been to port bioperl functionality to
2221 biopython and biojava rather than the other way around.  However, in
2222 the future, some bioinformatics tasks may prove to be more effectively
2223 implemented in java or python in which case being able to call them
2224 from within bioperl will become more important.  For more information,
2225 go to the biojava http://biojava.org/ and biopython http://biopython.org/
2226 websites.
2227
2228
2229 =for html <A NAME ="iv.3"></A>
2230
2231 =head2 IV.3 EMBOSS
2232
2233 EMBOSS is another open source project with similar goals
2234 to bioperl.  However EMBOSS code is implemented in C and has been
2235 designed for standalone execution on the Unix command line, rather
2236 than for incorporation into a user script or program.  EMBOSS includes
2237 a wide array of useful bioinformatics functions similar to those of
2238 the GCG package after which it was designed.  For more information on EMBOSS
2239 refer to http://www.hgmp.mrc.ac.uk/Software/EMBOSS/.  The principal bioperl
2240 interface object to EMBOSS is described in L<Bio::Tools::Run::EMBOSSApplication>
2241 (and see section L<"III.3.6">).
2242
2243 =head2 IV.4  Ensembl and bioperl-db
2244
2245 Ensembl is an ambitious automated-genome-annotation project at EBI.
2246 Much of Ensembl's code is based on bioperl, and Ensembl developers, in
2247 turn, have contributed significant pieces of code to bioperl.  In
2248 particular, the bioperl code for automated sequence annotation has
2249 been largely contributed by Ensembl developers.
2250 Describing Ensembl and its capabilities is far beyond the scope of
2251 this tutorial The interested reader is referred to the Ensembl website
2252 at http://www.ensembl.org/.
2253
2254 Bioperl-db is a relatively new project intended to transfer some of
2255 Ensembl's capability of integrating bioperl syntax with a standalone
2256 Mysql database (http://www.mysql.com) to the bioperl code-base. More details
2257 on bioperl-db can be found in the bioperl-db CVS directory at
2258 http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/?cvsroot=bioperl.
2259 It is worth mentioning that most of the bioperl objects mentioned above
2260 map directly to tables in the bioperl-db schema. Therefore object data such
2261 as sequences, their features, and annotations can be easily loaded into the
2262 databases, as in
2263
2264   $loader->store($newid,$seqobj)
2265
2266 Similarly one can query the database in a variety of ways and retrieve
2267 arrays of Seq objects. See biodatabases.pod, L<Bio::DB::SQL::SeqAdaptor>,
2268 L<Bio::DB::SQL::QueryConstraint>, and L<Bio::DB::SQL::BioQuery> for examples.
2269
2270
2271 =head2 IV.5 GFF format and Bio::DB::GFF*
2272
2273 The Bio::DB::GFF module provides access to relational databases constructed
2274 from data files in GFF format. This file type is well suited to sequence
2275 annotation because it allows the ability to describe entries in terms of
2276 parent-child relationships (see http://www.sanger.ac.uk/software/GFF for
2277 details). Like bioperl-db, above, the current implementation uses mysql
2278 (http://www.mysql.com).
2279
2280 The module accesses not only by id but by annotation type and position or
2281 range. Those who would like to explore bioperl as a means to overlay nucleotide
2282 sequence, protein sequence, features, and annotations should take a close
2283 look at L<Bio::DB::GFF>and the load_gff.pl, bulk_load_gff.pl, gadfly_to_gff.pl,
2284 and sgd_to_gff.pl scripts in the scripts/Bio-DB-GFF directory.
2285
2286
2287 =for html <A NAME ="iv.6"></A>
2288
2289 =head2 IV.6 Genquire, the Annotation Workbench and bioperl-gui
2290
2291 The Annotation Workbench and Genquire were developed at the Plant
2292 Biotechnology Institute of the National Research Council of Canada.
2293 This is an integrated graphical suite of tools in Perl for examining a sequence,
2294 predicting gene structure, and creating annotations.  Information about
2295 Genquire is available at http://bioinformatics.org/project/?group_id=99.
2296 With Genquire and bioperl-gui one can display a Bioperl Seq object
2297 graphically.  You can download the current version of
2298 the gui software from the bioperl-gui CVS directory at
2299 http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-gui/?cvsroot=bioperl.
2300 You will also need Tcl/Tk.
2301
2302
2303 =head2 V.1 Appendix: Finding out which methods are used by which
2304 Bioperl Objects
2305
2306 At numerous places in the tutorial, the reader is directed to the
2307 "documentation included with each of the modules."  As was mentioned in
2308 the introduction, it is sometimes not easy in perl to determine the
2309 appropriate documentation to look for, because objects inherit methods
2310 from other objects (and the relevant documentation will be stored in the
2311 object from which the method was inherited.)
2312
2313 For example, say you wanted to find documentation on the "parse" method
2314 of the object Genscan.pm.  You would not find this documentation in
2315 the code for Genscan.pm, but rather in the code for AnalysisResult.pm
2316 from which Genscan.pm inherits the parse method!
2317
2318 So how would you know to look in AnalysisResult.pm
2319 for this documentation? The easy way is to use the special function
2320 "option 100" in the bptutorial script. Specifically if you run:
2321
2322  > perl -w bptutorial.pl 100 Bio::Tools::Genscan
2323
2324 you will receive the following output:
2325
2326  ***Methods for Object Bio::Tools::Genscan ********
2327
2328  Methods taken from package Bio::Root::IO
2329  catfile   close   gensym   new   qualify   qualify_to_ref
2330  rmtree   tempdir   tempfile   ungensym
2331
2332  Methods taken from package Bio::Root::RootI
2333  DESTROY   stack_trace   stack_trace_dump   throw   verbose   warn
2334
2335  Methods taken from package Bio::SeqAnalysisParserI
2336  carp   confess   croak   next_feature
2337
2338  Methods taken from package Bio::Tools::AnalysisResult
2339  analysis_date   analysis_method   analysis_method_version   analysis_query   analysis_subject   parse
2340
2341  Methods taken from package Bio::Tools::Genscan
2342  next_prediction
2343
2344 From this output, it is clear exactly from which object each method
2345 of Genscan.pm is taken, and, in particular that "parse" is
2346 taken from the package Bio::Tools::AnalysisResult.
2347
2348 With this approach you can easily determine the source of any method
2349 in any bioperl object.
2350
2351 =head2 V.2 Appendix: Tutorial demo scripts
2352
2353 The following scripts demonstrate many of the features of bioperl. To
2354 run all the demos run:
2355
2356  > perl -w  bptutorial.pl 0
2357
2358 To run a subset of the scripts do
2359
2360  > perl -w  bptutorial.pl
2361
2362 and use the displayed help screen.
2363
2364 =cut
2365
2366 #!/usr/bin/perl
2367
2368 # PROGRAM  : bptutorial.pl
2369 # PURPOSE  : Demonstrate various uses of the bioperl package
2370 # AUTHOR   : Peter Schattner schattner@alum.mit.edu
2371 # CREATED  : Dec 15 2000
2372 # REVISION : $Id$
2373
2374 use strict;
2375 use Bio::SimpleAlign;
2376 use Bio::AlignIO;
2377 use Bio::SeqIO;
2378 use Bio::Seq;
2379
2380 # subroutine references
2381
2382 my ($access_remote_db, $index_local_db, $fetch_local_db,
2383     $sequence_manipulations, $seqstats_and_seqwords,
2384     $restriction_and_sigcleave, $other_seq_utilities, $run_remoteblast,
2385     $run_standaloneblast,  $blast_parser, $bplite_parsing, $hmmer_parsing,
2386     $run_clustalw_tcoffee, $run_psw_bl2seq, $simplealign,
2387     $gene_prediction_parsing, $sequence_annotation, $largeseqs,
2388     $run_tree, $run_map, $run_struct, $run_perl, $searchio_parsing,
2389     $liveseqs, $demo_variations, $demo_xml, $display_help, $bpinspect1 );
2390
2391 # global variable file names.  Edit these if you want to try
2392 #out a tutorial script on a different file
2393
2394  Bio::Root::IO->catfile("t","data","ecolitst.fa");
2395
2396 my $dna_seq_file = Bio::Root::IO->catfile("t","data","dna1.fa");      # used in $sequence_manipulations
2397 my $amino_seq_file = Bio::Root::IO->catfile("t","data","cysprot1.fa");  # used in $other_seq_utilities and in $run_perl
2398                                        #and $sequence_annotation
2399 my $blast_report_file = Bio::Root::IO->catfile("t","data","blast.report");   # used in $blast_parser
2400 my $bp_parse_file1 = Bio::Root::IO->catfile("t","data","blast.report");       # used in $bplite_parsing
2401 my $bp_parse_file2 = Bio::Root::IO->catfile("t","data","psiblastreport.out"); # used in $bplite_parsing
2402 my $bp_parse_file3 = Bio::Root::IO->catfile("t","data","bl2seq.out");        # used in $bplite_parsing
2403 my $unaligned_amino_file = Bio::Root::IO->catfile("t","data","cysprot1a.fa"); # used in $run_clustalw_tcoffee
2404 my $aligned_amino_file = Bio::Root::IO->catfile("t","data","testaln.pfam");    # used in $simplealign
2405
2406 # other global variables
2407 my (@runlist, $n );
2408
2409
2410 ##############################
2411 #  display_help
2412 $display_help = sub {
2413
2414
2415 # Prints usage information for tutorial script.
2416
2417     print STDERR <<"QQ_PARAMS_QQ";
2418
2419 The following numeric arguments can be passed to run the corresponding demo-script.
2420 1  => access_remote_db ,
2421 2  => index_local_db ,
2422 3  => fetch_local_db ,               (NOTE: needs to be run with demo 2)
2423 4  => sequence_manipulations ,
2424 5  => seqstats_and_seqwords ,
2425 6  => restriction_and_sigcleave ,
2426 7  => other_seq_utilities ,
2427 8  => run_standaloneblast ,
2428 9  => bplite_parsing ,
2429 10 => hmmer_parsing ,
2430 11 => run_clustalw_tcoffee ,
2431 12 => run_psw_bl2seq ,
2432 13 => simplealign ,
2433 14 => gene_prediction_parsing ,
2434 15 => sequence_annotation ,
2435 15 => largeseqs ,
2436 17 => liveseqs ,
2437 18 => demo_variations ,
2438 19 => demo_xml ,
2439 20 => run_tree ,
2440 21 => run_map,
2441 22 => run_struct,
2442 23 => run_perl,
2443 24 => searchio_parsing ,
2444 25 => run_remoteblast ,
2445
2446 In addition the argument "100" followed by the name of a single
2447 bioperl object will display a list of all the public methods
2448 available from that object and from what object they are inherited.
2449
2450 Using the parameter "0" will run all tests.
2451 Using any other argument (or no argument) will run this display.
2452
2453 So typical command lines might be:
2454 To run all demo scripts:
2455  > perl -w  bptutorial.pl 0
2456 or to just run the local indexing demos:
2457  > perl -w  bptutorial.pl 2 3
2458 or to list all the methods available for object Bio::Tools::SeqStats -
2459  > perl -w  bptutorial.pl 100 Bio::Tools::SeqStats
2460
2461 QQ_PARAMS_QQ
2462
2463 exit(0);
2464 };
2465
2466
2467 ## Note: "main" routine is at very end of script
2468
2469 #################################################
2470 #   access_remote_db():
2471 #
2472
2473 $access_remote_db = sub {
2474
2475     use Bio::DB::GenBank;
2476
2477     print "Beginning remote database access example... \n";
2478
2479     # III.2.1 Accessing remote databases (Bio::DB::GenBank, etc)
2480
2481     my ($gb, $seq1, $seq2, $seq1_id, $seqobj, $seq2_id,$seqio);
2482
2483     eval {
2484         $gb = new Bio::DB::GenBank();
2485         $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
2486     };
2487
2488 if ($@ || !$seq1) {
2489     warn "Warning: Couldn't connect to Genbank with Bio::DB::GenBank.pm!\nProbably no network access.\n Skipping Test\n";
2490     return 0;
2491 }
2492     $seq1_id =  $seq1->display_id();
2493     print "seq1 display id is $seq1_id \n";
2494     $seq2 = $gb->get_Seq_by_acc('AF303112');
2495     $seq2_id =  $seq2->display_id();
2496     print "seq2 display id is $seq2_id \n";
2497     $seqio = $gb->get_Stream_by_batch([ qw(2981014 J00522 AF303112)]);
2498     $seqobj = $seqio->next_seq();
2499     print "Display id of first sequence in stream is ", $seqobj->display_id()  ,"\n";
2500     return 1;
2501 } ;
2502
2503
2504 #################################################
2505 #  index_local_db ():
2506 #
2507
2508 $index_local_db = sub {
2509
2510     use Bio::Index::Fasta; # using fasta file format
2511   use strict; # some users have reported that this is required
2512
2513     my ( $Index_File_Name, $inx1, $inx2, $id, $dir, $key,
2514          $keyfound, $seq, $indexhash);
2515     print "\nBeginning indexing local_db example... \n";
2516     print "This subroutine unlikely to run unless OS = unix\n";
2517
2518
2519     # III.2.2 Indexing and accessing local databases
2520     # (Bio::Index::*,  bpindex.pl,  bpfetch.pl)
2521
2522     # first create the index
2523     $Index_File_Name = 'bptutorial.indx';
2524
2525     eval { use Cwd; $dir = cwd; };
2526     # CWD not installed, revert to unix behavior, best we can do
2527     if( $@) { $dir = `pwd`;}
2528
2529
2530     $inx1 = Bio::Index::Fasta->new
2531         ('-FILENAME' => "$dir/$Index_File_Name",
2532          '-write_flag' => 1);
2533
2534     $inx1->make_index(Bio::Root::IO->catfile("$dir","t","data","multifa.seq") );
2535     $inx1->make_index(Bio::Root::IO->catfile("$dir","t","data","seqs.fas") );
2536
2537
2538 #    $inx1->make_index("$dir/t/data/multifa.seq");
2539 #    $inx1->make_index("$dir/t/data/seqs.fas");
2540     print "Finished indexing local_db example... \n";
2541
2542     return 1;
2543 } ;
2544
2545
2546 #################################################
2547 #  fetch_local_db ():
2548 #
2549
2550 $fetch_local_db = sub {
2551
2552     use Bio::Index::Fasta; # using fasta file format
2553     use strict; # some users have reported that this is required
2554
2555     my ( $Index_File_Name, $inx2, $id, $dir, $key,
2556          $keyfound, $seq, $indexhash,$value );
2557     print "\nBeginning retrieving local_db example... \n";
2558
2559     # then retrieve some files
2560     #$Index_File_Name = shift;
2561     $Index_File_Name = 'bptutorial.indx';
2562
2563     eval { use Cwd; $dir = cwd; };
2564     # CWD not installed, revert to unix behavior, best we can do
2565     if( $@) { $dir = `pwd`;}
2566
2567     $inx2 = Bio::Index::Abstract->new
2568         ('-FILENAME'  => Bio::Root::IO->catfile("$dir","$Index_File_Name") );
2569     #   ('-FILENAME'  => "$dir/$Index_File_Name");
2570
2571     $indexhash = $inx2->db();
2572     $keyfound = "";
2573     while ( ($key,$value) = each %$indexhash ) {
2574         if ( $key =~ /97/ ) { $keyfound = 'true'; $id = $key; last; }
2575     }
2576     if ($keyfound) {
2577         $seq = $inx2->fetch($id);
2578         print "Sequence ", $seq->display_id(),
2579         " has length  ", $seq->length()," \n";
2580     }
2581
2582     unlink "$dir/$Index_File_Name" ;
2583
2584     return 1;
2585 } ;
2586
2587
2588
2589 #################################################
2590 # sequence_manipulations  ():
2591 #
2592
2593 $sequence_manipulations = sub {
2594
2595     my ($infile, $in, $out, $seqobj);
2596     $infile = $dna_seq_file;
2597
2598     print "\nBeginning sequence_manipulations and SeqIO example... \n";
2599
2600
2601     # III.3.1 Transforming sequence files (SeqIO)
2602
2603     $in  = Bio::SeqIO->new('-file' => $infile ,
2604                            '-format' => 'Fasta');
2605     $seqobj = $in->next_seq();
2606
2607     # perl "tied filehandle" syntax is available to SeqIO,
2608     # allowing you to use the standard <> and print operations
2609     # to read and write sequence objects, eg:
2610     #$out = Bio::SeqIO->newFh('-format' => 'EMBL');
2611
2612     $out = Bio::SeqIO->newFh('-format' => 'fasta');
2613
2614     print "First sequence in fasta format... \n";
2615     print $out $seqobj;
2616
2617     # III.4 Manipulating individual sequences
2618
2619     # The following methods return strings
2620
2621
2622     print "Seq object display id is ",
2623     $seqobj->display_id(), "\n"; # the human read-able id of the sequence
2624     print "Sequence is ",
2625     $seqobj->seq()," \n";        # string of sequence
2626     print "Sequence from 5 to 10 is ",
2627     $seqobj->subseq(5,10)," \n"; # part of the sequence as a string
2628     print "Acc num is ",
2629     $seqobj->accession_number(), " \n"; # when there, the accession number
2630     print "Moltype is ",
2631     $seqobj->alphabet(), " \n";    # one of 'dna','rna','protein'
2632     print "Primary id is ", $seqobj->primary_seq->primary_id()," \n";
2633     # a unique id for this sequence irregardless
2634     #print "Primary id is ", $seqobj->primary_id(), " \n";
2635     # a unique id for this sequence irregardless
2636     # of its display_id or accession number
2637
2638     # The following methods return an array of  Bio::SeqFeature objects
2639     $seqobj->top_SeqFeatures; # The 'top level' sequence features
2640     $seqobj->all_SeqFeatures; # All sequence features, including sub
2641     # seq features
2642
2643     # The following methods returns new sequence objects,
2644     # but do not transfer features across
2645     # truncation from 5 to 10 as new object
2646     print "Truncated Seq object sequence is ",
2647     $seqobj->trunc(5,10)->seq(), " \n";
2648     # reverse complements sequence
2649     print "Reverse complemented sequence 5 to 10  is ",
2650     $seqobj->trunc(5,10)->revcom->seq, "  \n";
2651     # translation of the sequence
2652     print "Translated sequence 6 to 15 is ",
2653     $seqobj->translate->subseq(6,15), " \n";
2654
2655
2656     my $c = shift;
2657     $c ||= 'ctgagaaaataa';
2658
2659     print "\nBeginning 3-frame and alternate codon translation example... \n";
2660
2661     my $seq = new Bio::PrimarySeq('-SEQ' => $c,
2662                                   '-ID' => 'no.One');
2663     print "$c translated using method defaults   : ",
2664     $seq->translate->seq, "\n";
2665
2666     # Bio::Seq uses same sequence methods as PrimarySeq
2667     my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two');
2668     print "$c translated as a coding region (CDS): ",
2669     $seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";
2670
2671     print "\nTranslating in all six frames:\n";
2672     my @frames = (0, 1, 2);
2673     foreach my $frame (@frames) {
2674         print  " frame: ", $frame, " forward: ",
2675         $seq->translate(undef, undef, $frame)->seq, "\n";
2676         print  " frame: ", $frame, " reverse-complement: ",
2677         $seq->revcom->translate(undef, undef, $frame)->seq, "\n";
2678     }
2679
2680     print "Translating with all codon tables using method defaults:\n";
2681     my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
2682     foreach my $ct (@codontables) {
2683         print $ct, " : ",
2684         $seq->translate(undef, undef, undef, $ct)->seq, "\n";
2685     }
2686
2687     return 1;
2688 } ;
2689
2690 #################################################                                      ;
2691 #  seqstats_and_seqwords ():
2692 #
2693
2694 $seqstats_and_seqwords = sub {
2695
2696     use Bio::Tools::SeqStats;
2697     use Bio::Tools::SeqWords;
2698
2699     print "\nBeginning seqstats_and_seqwords example... \n";
2700
2701
2702     my ($seqobj, $weight, $monomer_ref, $codon_ref,
2703         $seq_stats, $words, $hash);
2704     $seqobj = Bio::Seq->new
2705         ('-seq'=>'ACTGTGGCGTCAACTGACTGTGGCGTCAACTGACTGTGGGCGTCAACTGACTGTGGCGTCAACTG',
2706          '-alphabet'=>'dna',
2707          '-id'=>'test');
2708
2709
2710     # III.4.1 Obtaining basic sequence statistics- MW,
2711     # residue &codon frequencies(SeqStats, SeqWord)
2712
2713
2714     $seq_stats  =  Bio::Tools::SeqStats->new($seqobj);
2715     $weight = $seq_stats->get_mol_wt();
2716     $monomer_ref = $seq_stats->count_monomers();
2717     $codon_ref = $seq_stats->count_codons();  # for nucleic acid sequence
2718     print "Sequence \'test\' is ", $seqobj->seq(), " \n";
2719     print "Sequence ", $seqobj->id()," molecular weight is $$weight[0] \n";
2720     # Note, $weight is an array reference
2721     print "Number of A\'s in sequence is $$monomer_ref{'A'} \n";
2722     print "Number of ACT codon\'s in sequence is $$codon_ref{'ACT'} \n";
2723
2724     $words = Bio::Tools::SeqWords->new($seqobj);
2725     $hash = $words->count_words(6);
2726     print "Number of 6-letter \'words\' of type ACTGTG in",
2727     " sequence is $$hash{'ACTGTG'} \n";
2728
2729     return 1;
2730     } ;
2731
2732
2733 #################################################
2734 # restriction_and_sigcleave  ():
2735 #
2736
2737 $restriction_and_sigcleave = sub {
2738
2739     use Bio::Tools::RestrictionEnzyme;
2740     use Bio::Tools::Sigcleave;
2741
2742     my ($re, $re1, $re2, @sixcutters, @fragments1,
2743         @fragments2, $seqobj, $dna);
2744     print "\nBeginning restriction enzyme example... \n";
2745
2746     # III.4.4 Identifying restriction enzyme sites (RestrictionEnzyme)
2747
2748     $dna = 'CCTCCGGGGACTGCCGTGCCGGGCGGGAATTCGCCATGGCGACCCTGGAAAAGCTGATATCGAAGGCCTTCGA';
2749
2750     # Build sequence and restriction enzyme objects.
2751     $seqobj = new Bio::Seq('-ID'  => 'test_seq',
2752                            '-SEQ' =>$dna);
2753
2754     #$re  = new Bio::Tools::RestrictionEnzyme();
2755     $re  = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
2756     @sixcutters = $re->available_list(6);
2757
2758     print "The following 6-cutters are available\n";
2759     print join(" ",@sixcutters),"\n";
2760
2761     $re1  = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
2762     @fragments1 =  $re1->cut_seq($seqobj);
2763     #$seqobj is the Seq object for the dna to be cut
2764
2765     print "\nThe sequence of " . $seqobj->display_id . " is " .
2766     $seqobj->seq . "\n";
2767     print "When cutting " . $seqobj->display_id() . " with " .
2768     $re1->seq->id . " the initial fragment is\n" . $fragments1[0];
2769
2770     $re2 = new Bio::Tools::RestrictionEnzyme
2771         ('-NAME' =>'EcoRV--GAT^ATC',
2772          '-MAKE' =>'custom');
2773     @fragments2 =  $re2->cut_seq($seqobj);
2774
2775     print "\nWhen cutting ", $seqobj->display_id(),
2776     " with ", $re2->seq->id;
2777     print " the second fragment is\n", $fragments2[1], " \n";
2778
2779     # III.4.7 Identifying amino acid cleavage sites (Sigcleave)
2780
2781     print "\nBeginning  signal cleaving site location example... \n";
2782
2783     my ( $sigcleave_object, %raw_results , $location,
2784          $formatted_output, $protein, $in, $seqobj2);
2785
2786     $protein =
2787 "MKVILLFVLAVFTVFVSSRGIPPEEQSQFLEFQDKFNKKYSHEEYLERFEIFKSNLGKIEELNLIAINHKADTKFGVNKFADLSSDEFKNYYLNNKEAIFTDDLPVADYLDDEFINSIPTAFDWRTRGAVTPVKNQGQCGSCWSFSTTGNVEGQHFISQNKLVSLSEQNLVDCDHECMEYEGEEACDEGCNGGLQPNAYNYIIKNGGIQTESSYPYTAETGTQCNFNSANIGAKISNFTMIPKNETVMAGYIVSTGPLAIAADAVEWQFYIGGVFDIPCNPNSLDHGILIVGYSAKNTIFRKNMPYWIVKNSWGADWGEQGYIYLRRGKNTCGVSNFVSTSII";
2788     $formatted_output = "";
2789
2790     # Build object
2791     # Note that Sigcleave is passed a raw sequence
2792     # rather than a sequence object when it is created.
2793     $sigcleave_object = new Bio::Tools::Sigcleave
2794         ('-id' => 'test_sigcleave_seq',
2795          '-type' => 'amino',
2796          '-threshold' => 3.0,
2797          '-seq' => $protein);
2798
2799     %raw_results      = $sigcleave_object->signals;
2800     $formatted_output = $sigcleave_object->pretty_print;
2801
2802     print " SigCleave \'raw results\': \n";
2803     foreach $location (sort keys %raw_results) {
2804         print " Cleave site found at location $location ",
2805         "with value of $raw_results{$location} \n";
2806     }
2807     print "\nSigCleave formatted output: \n $formatted_output \n";
2808     return 1;
2809 } ;
2810
2811
2812 #################################################
2813 #  run_standaloneblast ():
2814 #
2815
2816 $run_standaloneblast = sub {
2817     eval {Bio::Tools::Run::StandAloneBlast; }
2818
2819
2820     print "\nBeginning run_standaloneblast example... \n";
2821
2822     my (@params, $factory, $input, $blast_report, $blast_present,
2823         $database, $sbjct, $str, $seq1);
2824
2825     $database = $_[0] || 'ecoli.nt'; # user can select local nt database
2826
2827     $blast_present = Bio::Tools::Run::StandAloneBlast->exists_blast();
2828     unless ($blast_present) {
2829         warn "blast program not found. Skipping StandAloneBlast example\n";
2830         return 0;
2831     }
2832     #@params = ('program' => 'blastn', 'database' => 'ecoli.nt');
2833     @params = ('program' => 'blastn', 'database' => $database);
2834     $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
2835
2836
2837     $str = Bio::SeqIO->new('-file'=> Bio::Root::IO->catfile("t","data","dna2.fa") ,
2838 #    $str = Bio::SeqIO->new('-file'=>'t/data/dna2.fa' ,
2839                            '-format' => 'Fasta', );
2840     $seq1 = $str->next_seq();
2841
2842     $blast_report = $factory->blastall($seq1);
2843     $sbjct = $blast_report->nextSbjct;
2844
2845     print " Hit name is ", $sbjct->name, " \n";
2846
2847     return 1;
2848 } ;
2849
2850 #################################################
2851 #  run_remoteblast ():
2852 #
2853
2854 $run_remoteblast = sub {
2855     print "\nBeginning run_remoteblast example... \n";
2856     eval { require Bio::Tools::Run::RemoteBlast; };
2857
2858     if ( $@ ) {
2859       print STDERR "Cannot load Bio::Tools::Run::RemoteBlast\n";
2860       print STDERR "Cannot run run_remoteblast example:\n$@\n";
2861     } else {
2862       my (@params, $remote_blast_object, $blast_file, $r, $rc,
2863          $database);
2864
2865       $database =  'ecoli';
2866       @params = ('-prog'   => 'blastp',
2867                  '-data'   => $database,
2868                  '-expect' => '1e-10',
2869                  '-readmethod' => 'BPlite' );
2870
2871       $remote_blast_object = Bio::Tools::Run::RemoteBlast->new(@params);
2872       $blast_file = Bio::Root::IO->catfile("t","data","ecolitst.fa");
2873       $r = $remote_blast_object->submit_blast( $blast_file);
2874       print "submitted Blast job\n";
2875       while ( my @rids = $remote_blast_object->each_rid ) {
2876         foreach my $rid ( @rids ) {
2877           $rc = $remote_blast_object->retrieve_blast($rid);
2878           print "retrieving results...\n";
2879           if( !ref($rc) ) {   # $rc not a reference => either error
2880                               # or job not yet finished
2881             if( $rc < 0 ) {
2882               $remote_blast_object->remove_rid($rid);
2883               print "Error return code for BlastID code $rid ... \n";
2884             }
2885             sleep 5;
2886           } else {
2887             $remote_blast_object->remove_rid($rid);
2888             while ( my $sbjct = $rc->nextSbjct ) {
2889               print "sbjct name is ", $sbjct->name, "\n";
2890               while ( my $hsp = $sbjct->nextHSP ) {
2891                 print "score is ", $hsp->score, "\n";
2892               }
2893             }
2894           }
2895         }
2896       }
2897     }
2898   return 1;
2899 } ;
2900
2901
2902 #################################################
2903 #  searchio_parsing ():
2904 #
2905
2906
2907
2908 $searchio_parsing = sub {
2909
2910 my ($searchio, $result,$hit,$hsp);
2911
2912
2913 use lib '.';
2914 use Bio::SearchIO;
2915 use Bio::Root::IO;
2916
2917 print "\nBeginning searchio-parser example... \n";
2918
2919 $searchio = new Bio::SearchIO ('-format' => 'blast',
2920   '-file' => Bio::Root::IO->catfile('t','data','ecolitst.bls'));
2921
2922 $result = $searchio->next_result;
2923
2924 print "Database name is ", $result->database_name , "\n";
2925 print "Algorithm is ", $result->algorithm , "\n";
2926 print "Query length used is ", $result->query_length , "\n";
2927 print "Kappa value is ", $result->get_statistic('kappa') , "\n";
2928 print "Name of matrix used is ", $result->get_parameter('matrix') , "\n";
2929
2930
2931 $hit = $result->next_hit;
2932 print "First hit name is ", $hit->name , "\n";
2933 print "First hit length is ", $hit->length , "\n";
2934 print "First hit accession number is ", $hit->accession , "\n";
2935
2936 $hsp = $hit->next_hsp;
2937 print "First hsp query start is ", $hsp->query->start , "\n";
2938 print "First hsp query end is ", $hsp->query->end , "\n";
2939 print "First hsp hit start is ", $hsp->hit->start , "\n";
2940 print "First hsp hit end is ", $hsp->hit->end , "\n";
2941
2942     return 1;
2943 } ;
2944
2945
2946 #################################################
2947 #  bplite_parsing ():
2948 #
2949
2950 $bplite_parsing = sub {
2951
2952     use Bio::Tools::BPlite;
2953     my ($file1, $file2, $file3, $report,$report2 , $report3 ,
2954         $last_iteration, $sbjct,  $total_iterations, $hsp,
2955         $matches, $loop);
2956
2957     print "\nBeginning bplite, bppsilite, bpbl2seq parsing example... \n";
2958
2959     ($file1, $file2, $file3) = ($bp_parse_file1,
2960                                 $bp_parse_file2 ,$bp_parse_file3 );
2961     #open FH, "t/data/blast.report";
2962     $report = Bio::Tools::BPlite->new('-file'=>$file1);
2963     $sbjct = $report->nextSbjct;
2964
2965     print " Hit name is ", $sbjct->name, " \n";
2966     while ($hsp = $sbjct->nextHSP) {
2967         $hsp->score;
2968     }
2969
2970     use Bio::Tools::BPpsilite;
2971     $report2 = Bio::Tools::BPpsilite->new('-file'=>$file2);
2972     $total_iterations = $report2->number_of_iterations;
2973     $last_iteration = $report2->round($total_iterations);
2974
2975     # print first 10 psiblast hits
2976     for ($loop = 1; $loop <= 10; $loop++) {
2977         $sbjct =  $last_iteration->nextSbjct;
2978         $sbjct && print " PSIBLAST Hit is ", $sbjct->name, " \n";
2979     }
2980
2981     #
2982     use Bio::Tools::BPbl2seq;
2983     $report3 = Bio::Tools::BPbl2seq->new('-file'  => $file3,
2984                                          '-queryname' => "ALEU_HORVU");
2985     $hsp = $report3->next_feature;
2986     $matches = $hsp->match;
2987     print " Number of Blast2seq matches for first hsp equals $matches \n";
2988     #
2989
2990     return 1;
2991 } ;
2992
2993
2994 #################################################
2995 #  hmmer_parsing ():
2996 #
2997
2998 $hmmer_parsing = sub {
2999
3000     print "\nBeginning hmmer_parsing example... \n";
3001
3002     # III.5.4 Parsing HMM reports (HMMER::Results)
3003
3004     use Bio::Tools::HMMER::Domain;
3005     use Bio::Tools::HMMER::Set;
3006     use Bio::Tools::HMMER::Results;
3007
3008     my ($res, @seq, @domain);
3009
3010     $res = new Bio::Tools::HMMER::Results('-file' =>  Bio::Root::IO->catfile("t","data","hmmsearch.out"),
3011                                           '-type' => 'hmmsearch');
3012
3013     @seq = $res->each_Set;
3014     print "Initial sequence bit score is ",$seq[0]->bits,"\n";
3015     @domain = $seq[0]->each_Domain;
3016     print " For initial domain, start = ",$domain[0]->start,
3017     " end = ",$domain[0]->end,"and score =",$domain[0]->bits,"\n";
3018
3019     #foreach $seq ( $res->each_Set ) {
3020     #    print "Sequence bit score is",$seq->bits,"\n";
3021     #    foreach $domain ( $seq->each_Domain ) {
3022     #        print " Domain start ",$domain->start," end ",
3023     #        $domain->end," score ",$domain->bits,"\n";
3024     #    }
3025     #}
3026     return 1;
3027 } ;
3028
3029 #################################################
3030 # run_clustalw_tcoffee  ():
3031 #
3032
3033 $run_clustalw_tcoffee = sub {
3034
3035     use Bio::Tools::Run::Alignment::TCoffee;
3036     use Bio::Tools::Run::Alignment::Clustalw;
3037     my (@params, $factory, $ktuple, $aln, $seq_array_ref, $strout);
3038     my (@seq_array, $seq, $str, $infile);
3039
3040     $infile = $unaligned_amino_file;
3041
3042     # III.6.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
3043     print "\nBeginning run_clustalw example... \n";
3044
3045     # put unaligned sequences in a Bio::Seq array
3046     $str = Bio::SeqIO->new('-file'=> $infile,
3047                            '-format' => 'Fasta');
3048     @seq_array =();
3049     while ($seq = $str->next_seq() ) { push (@seq_array, $seq) ;}
3050     $seq_array_ref = \@seq_array;
3051     # where @seq_array is an array of Bio::Seq objects
3052
3053     my $clustal_present =
3054       Bio::Tools::Run::Alignment::Clustalw->exists_clustal();
3055     if ($clustal_present) {
3056         @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
3057         $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
3058         $ktuple = 3;
3059         $factory->ktuple($ktuple);  # change the parameter before executing
3060         $aln = $factory->align($seq_array_ref);
3061         $strout = Bio::AlignIO->newFh('-format' => 'msf');
3062         print "Output of clustalw alignment... \n";
3063         print $strout $aln;
3064     } else {
3065         warn "Clustalw program not found. Skipping run_clustalw example....\n";
3066     }
3067     print "\nBeginning run_tcoffee example... \n";
3068
3069     my $tcoffee_present =
3070       Bio::Tools::Run::Alignment::TCoffee->exists_tcoffee();
3071     if ($tcoffee_present) {
3072         @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
3073         $factory = Bio::Tools::Run::Alignment::TCoffee->new(@params);
3074         $ktuple = 3;
3075         $factory->ktuple($ktuple);  # change the parameter before executing
3076         $aln = $factory->align($seq_array_ref);
3077         $strout = Bio::AlignIO->newFh('-format' => 'msf');
3078         print "Output of TCoffee alignment... \n";
3079         print $strout $aln;
3080     } else {
3081         warn "TCoffee program not found. Skipping run_TCoffee example....\n";
3082     }
3083     return 1;
3084 } ;
3085
3086 #################################################
3087 # simplealign  ():
3088 #
3089
3090 $simplealign = sub {
3091
3092     my ($in, $out1,$out2, $aln, $threshold_percent, $infile);
3093     my ( $str, $resSlice1, $resSlice2, $resSlice3, $tmpfile);
3094
3095     print "\nBeginning simplealign and alignio example... \n";
3096
3097     # III.3.2 Transforming alignment files (AlignIO)
3098     $infile = $aligned_amino_file;
3099     #$tmpfile = "test.tmp";
3100     $in  = Bio::AlignIO->new('-file' => $infile ,
3101                              '-format' => 'pfam');
3102     $out1 = Bio::AlignIO->newFh('-format' => 'msf');
3103
3104     # while ( my $aln = $in->next_aln() ) { $out->write_aln($aln);  }
3105     $aln = $in->next_aln() ;
3106     print "Alignment to display in msf format... \n";
3107     print $out1 $aln;
3108
3109     $threshold_percent = 60;
3110     $str = $aln->consensus_string($threshold_percent) ;
3111     print "Consensus string with threshold = $threshold_percent is $str\n";
3112
3113     print "The average percentage identity of the alignment is ", $aln->percentage_identity. "\n";
3114
3115     my $seqname = '1433_LYCES';
3116     my $residue_number = 14;
3117     my $pos = $aln->column_from_residue_number($seqname, $residue_number);
3118     print "Residue number $residue_number of the sequence $seqname in the alignment occurs at column $pos. \n";
3119
3120     my $s1 = new Bio::LocatableSeq (-id => 'AAA',  -seq => 'aawtat-tn-', -start => 1, -end => 8,
3121                             -alphabet => 'dna' );
3122     my $s2 = new Bio::LocatableSeq (-id => 'BBB',  -seq => '-aaaat-tt-', -start => 1,
3123                             -end => 7,  -alphabet => 'dna');
3124     my $a = new Bio::SimpleAlign;
3125     $a->add_seq($s1);
3126     $a->add_seq($s2);
3127
3128     print "The first sequence in the dna alignment is... \t", $s1->seq , "\n";
3129     print "The second sequence in the dna alignment is... \t", $s2->seq , "\n";
3130     my $iupac_consensus =  $a->consensus_iupac;
3131     print "The IUPAC consensus of the dna alignment is... \t",$iupac_consensus  , "\n";
3132
3133     return 1;
3134 } ;
3135
3136 #################################################
3137 # run_psw_bl2seq  ():
3138 #
3139
3140 $run_psw_bl2seq = sub {
3141
3142     print "\nBeginning run_psw_bl2seq example... \n";
3143     eval { require Bio::Tools::pSW; };
3144     if( $@ ) {
3145         print STDERR ("\nThe C-compiled engine for Smith Waterman alignments (Bio::Ext::Align) has not been installed.\n Please read the install the bioperl-ext package\n\n");
3146         return 0;
3147     }
3148
3149     #use Bio::Tools::pSW;
3150     use Bio::Tools::Run::StandAloneBlast;
3151
3152     # III.6.1 Aligning 2 sequences with Smith-Waterman (pSW)
3153
3154     my ($factory, $aln, $out1, $out2, $str, $seq1, $seq2, @params);
3155
3156     # Get protein sequences from file
3157
3158     $str = Bio::SeqIO->new('-file'=>Bio::Root::IO->catfile("t","data","amino.fa") ,
3159                            '-format' => 'Fasta', );
3160     $seq1 = $str->next_seq();
3161     $seq2 = $str->next_seq();
3162
3163     $factory = new Bio::Tools::pSW( '-matrix' => 'scripts/tools/blosum62.bla',
3164                                     '-gap' => 12,
3165                                     '-ext' => 2, );
3166
3167     $factory->align_and_show($seq1,$seq2,'STDOUT');
3168     $aln = $factory->pairwise_alignment($seq1,$seq2);
3169
3170     $out1 = Bio::AlignIO->newFh('-format' => 'fasta');
3171     print "The Smith-Waterman alignment in fasta format... \n";
3172     print $out1 $aln;
3173
3174
3175     # III.6.2 Aligning 2 sequences with Blast using  bl2seq and AlignIO
3176
3177     # Bl2seq testing
3178     # first create factory for bl2seq
3179     @params = ('program' => 'blastp', 'outfile' => 'bl2seq.out');
3180     $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
3181     $factory->bl2seq($seq1, $seq2);
3182
3183     # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
3184     $str = Bio::AlignIO->new('-file'=> 'bl2seq.out',
3185                              '-format' => 'bl2seq');
3186     $aln = $str->next_aln();
3187
3188     $out2 = Bio::AlignIO->newFh('-format' => 'fasta');
3189     print "The Blast 2 sequence alignment in fasta format... \n";
3190     print $out2 $aln;
3191
3192     return 1;
3193 } ;
3194
3195
3196 #################################################
3197 #  gene_prediction_parsing ():
3198 #
3199
3200 $gene_prediction_parsing = sub {
3201
3202     use Bio::Tools::Genscan;
3203
3204     print "\nBeginning genscan_result_parsing example... \n";
3205
3206     my ($genscan, $gene, @exon_arr, $first_exon);
3207
3208     # III.7 Searching for genes and other structures
3209     # on genomic DNA (Genscan, ESTScan, MZEF)
3210
3211     use Bio::Tools::Genscan;
3212
3213     $genscan = Bio::Tools::Genscan->new('-file' => Bio::Root::IO->catfile("t","data","genomic-seq.genscan"));
3214     # $gene is an instance of Bio::Tools::Prediction::Gene
3215     # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
3216     while($gene = $genscan->next_prediction()) {
3217         print "Protein corresponding to current gene prediction is ",
3218         $gene->predicted_protein->seq()," \n";
3219         @exon_arr = $gene->exons();
3220         $first_exon = $exon_arr[0];
3221         print "Start location of first exon of current predicted gene is ",
3222         $first_exon->start," \n";
3223     }
3224     $genscan->close();
3225
3226     use Bio::Tools::Sim4::Results;
3227     print "\nBeginning Sim4_result_parsing example... \n";
3228     my ($sim4,$exonset, @exons, $exon, $homol);
3229
3230    $sim4 = new Bio::Tools::Sim4::Results(-file=>  Bio::Root::IO->catfile("t","data","sim4.rev"), -estisfirst=>0);
3231    $exonset = $sim4->next_exonset;
3232    @exons = $exonset->sub_SeqFeature();
3233    $exon = $exons[1];
3234     print "Start location of first exon of first exon set is ",$exon->start(), " \n";
3235    print "Name of est sequence is ",$exon->est_hit()->seqname(), " \n";
3236    print "The length of the est hit is ",$exon->est_hit()->seqlength(), " \n";
3237    $sim4->close();
3238
3239     return 1;
3240 } ;
3241
3242
3243 #################################################
3244 #  sequence_annotation ():
3245 #
3246
3247 $sequence_annotation = sub {
3248
3249     print "\nBeginning sequence_annotation example... \n";
3250     my ($feat, $feature1, $feature2,$seqobj, @topfeatures, @allfeatures,
3251         $ann, $in, $infile, $other, $answer);
3252     # III.8.1 Representing sequence annotations (Annotation, SeqFeature)
3253
3254     $infile = $amino_seq_file;
3255
3256     $in  = Bio::SeqIO->new('-file' => $infile ,
3257                            '-format' => 'Fasta');
3258     $seqobj = $in->next_seq();
3259
3260     $feature1 = new Bio::SeqFeature::Generic
3261         ('-start' => 10,
3262          '-end' => 40,
3263          '-strand' => 1,
3264          '-primary' => 'exon',
3265          '-source'  => 'internal', );
3266
3267     $feature2 = new Bio::SeqFeature::Generic
3268         ( '-start' => 20,
3269           '-end' => 30,
3270           '-strand' => 1,
3271           '-primary' => 'exon',
3272           '-source'  => 'internal', );
3273
3274     $seqobj->add_SeqFeature($feature1);   # Add the SeqFeature to the parent
3275     $seqobj->add_SeqFeature($feature2);   # Add the SeqFeature to the parent
3276     $seqobj->annotation(new Bio::Annotation('-description' => 'This is the annotation'));
3277
3278     # Once the features and annotations have been associated
3279     # with the Seq, they can be with retrieved, eg:
3280     @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
3281     @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
3282     $ann = $seqobj->annotation(); # annotation object
3283     $feat = $allfeatures[0];
3284     $other = $allfeatures[1];
3285
3286     print " Total number of sequence features is: ", scalar @allfeatures, " \n";
3287     print " The sequence annotation is: ", $ann->description," \n";
3288
3289     # The individual components of a SeqFeature can also be set
3290     # or retrieved with methods including:
3291     # attributes which return numbers
3292     $feat->start;          # start position
3293     $feat->end;            # end position
3294     $feat->strand;         # 1 means forward, -1 reverse, 0 not relevant
3295     # attributes which return strings
3296     $feat->primary_tag;    # the main 'name' of the sequence feature,  eg, 'exon'
3297     $feat->source_tag;     # where the feature comes from, eg'BLAST'
3298     # attributes which return Bio::PrimarySeq objects
3299     $feat->seq;            # the sequence between start,end
3300     $feat->entire_seq;     # the entire sequence
3301
3302     print " The primary tag of the feature is: ", $feat->primary_tag, "   \n";
3303     print " The first feature begins at location ", $feat->start, " \n";
3304     print "  ends at location ", $feat->end, " \n";
3305     print "  and is located on strand ", $feat->strand, " \n";
3306
3307     # other useful methods include
3308     $answer = $feat->overlaps($other);  # do SeqFeature $feat and SeqFeature $other overlap?
3309     print " Feature 1 does ", ($answer)? "":"not ", " overlap Feature2. \n";
3310
3311     $answer = $feat->contains($other); # is $other completely within $feat?
3312     print " Feature 1 does ", ($answer)? "":"not ", " contain Feature2. \n";
3313
3314     $answer = $feat->equals($other);   # do $feat and $other completely $agree?
3315     print " Feature 1 does ", ($answer)? "":"not ", " equal Feature2. \n";
3316
3317     $feat->sub_SeqFeature();  # create/access an array of subsequence features
3318
3319     return 1;
3320 } ;
3321
3322
3323 #################################################
3324 #  largeseqs ():
3325 #
3326
3327 $largeseqs = sub {
3328
3329     print "\nBeginning largeseqs example... \n";
3330
3331     # III.8.2 Representing and large sequences
3332     my ( $tmpfile, $seqio, $pseq, $plength, $last_4);
3333
3334
3335     $tmpfile = Bio::Root::IO->catfile("t","data","largefastatest.out") ;
3336     $seqio = new Bio::SeqIO('-format'=>'largefasta',
3337                             '-file'  =>Bio::Root::IO->catfile("t","data","genomic-seq.fasta"));
3338     $pseq = $seqio->next_seq();
3339     $plength = $pseq->length();
3340
3341     print " The length of the sequence ",
3342     $pseq->display_id()," is $plength  \n";
3343     $last_4 = $pseq->subseq($plength-3,$plength);
3344     print " The final four residues are $last_4 \n";
3345
3346     #END { unlink $tmpfile; }
3347     unlink $tmpfile;
3348
3349     return 1;
3350 } ;
3351
3352
3353 #################################################
3354 # liveseqs  ():
3355 #
3356
3357 $liveseqs = sub {
3358
3359     use Bio::LiveSeq::IO::BioPerl;
3360     my ($loader, $gene, $id, $maxstart);
3361
3362     print "\nBeginning liveseqs example... \n";
3363
3364     # III.8.2 Representing changing sequences (LiveSeq)
3365
3366     $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
3367                                             '-file'=>Bio::Root::IO->catfile("t","data","factor7.embl"));
3368     $gene=$loader->gene2liveseq('-gene_name' => "factor7");
3369     $id = $gene->get_DNA->display_id ;
3370     print " The id of the gene is $id , \n";
3371     $maxstart = $gene->maxtranscript->start;
3372     print " The start position of the max transcript is $maxstart , \n";
3373
3374     return 1;
3375 } ;
3376
3377
3378 #################################################
3379 # run_struct  ():
3380 #
3381
3382 $run_struct = sub {
3383   use Bio::Root::IO;
3384   eval { require Bio::Structure::Entry;
3385          require Bio::Structure::IO;
3386        };
3387   if ( $@ ){
3388     print STDERR "Cannot load Bio::Structure modules\n";
3389     print STDERR "Cannot run run_struct:\n$@\n";
3390   } else {
3391     print "\nBeginning Structure object example... \n";
3392
3393     # testing PDB format
3394     my $pdb_file = Bio::Root::IO->catfile("t","data","pdb1bpt.ent");
3395     my $structio = Bio::Structure::IO->new(-file  => $pdb_file,
3396                                            -format=> 'PDB');
3397     my $struc = $structio->next_structure;
3398     my ($chain) = $struc->chain;
3399     print " The current chain is ",  $chain->id ," \n";
3400     my $pseq = $struc->seqres;
3401     print " The first 20 residues of the sequence corresponding " .
3402     "to this structure are " . $pseq->subseq(1,20) . " \n";
3403     return 1;
3404   }
3405 } ;
3406
3407
3408
3409
3410
3411 #################################################
3412 # run_map  ():
3413 #
3414
3415 $run_map = sub {
3416
3417 use Bio::MapIO;
3418 use Bio::Root::IO;
3419
3420  print "\nBeginning MapIO example... \n";
3421
3422
3423 my $mapio = new Bio::MapIO(
3424                             '-format' => 'mapmaker',
3425                             '-file'   => Bio::Root::IO->catfile('t','data',
3426                             'mapmaker.out'));
3427
3428 my $map = $mapio->next_map;
3429
3430 print " The type of the map is " , $map->type  ," \n";
3431 print " The units of the map are " , $map->units  ," \n";
3432
3433 my $count = 0;
3434 foreach my $marker ( $map->each_element ) {
3435     print " The marker ", $marker->name," is at ordered position " ,  $marker->position->order  ," \n";
3436     if ($count++ >= 5) {return 1};
3437 }
3438
3439
3440     return 1;
3441 } ;
3442
3443 #################################################
3444 # run_tree  ():
3445 #
3446
3447 $run_tree = sub {
3448
3449 use Bio::TreeIO;
3450 use Bio::Root::IO;
3451
3452 print "\nBeginning phylogenetic tree example... \n";
3453
3454
3455 my $treeio = new Bio::TreeIO( -format => 'newick',
3456                             -file   => Bio::Root::IO->catfile('t','data',
3457                                                                'cysprot1b.newick'));
3458
3459 my $tree = $treeio->next_tree;
3460 my @nodes = $tree->get_nodes;
3461
3462  foreach my $node ( $tree->get_root_node()->each_Descendent() ) {
3463         print "node id and branch length: ", $node->to_string(), "\n";
3464         my @ch = $node->each_Descendent();
3465         if( @ch ) {
3466             print "\tchildren are: \n";
3467             foreach my $node ( $node->each_Descendent() ) {
3468                 print "\t\t id and length: ", $node->to_string(), "\n";
3469             }
3470         }
3471     }
3472     return 1;
3473 } ;
3474
3475
3476
3477 #################################################
3478 # run_perl  ():
3479 #
3480
3481 $run_perl = sub {
3482
3483 use Bio::Perl qw( read_sequence
3484                   read_all_sequences
3485                   write_sequence
3486                   new_sequence
3487                   get_sequence );
3488
3489 print "\nBeginning example of sequence manipulation without explicit Seq objects... \n";
3490
3491
3492   # getting a sequence from a database (assummes internet connection)
3493
3494    my $seq_object = get_sequence('swissprot',"ROA1_HUMAN");
3495
3496    # sequences are Bio::Seq objects, so the following methods work
3497    # (for more info see Bio::Seq documentation - try perldoc Bio::Seq)
3498
3499    print "Name of sequence retrieved from swissprot is ",$seq_object->display_id,"\n";
3500    print "Sequence acc  is ",$seq_object->accession_number,"\n";
3501    print "First 5 residues are ",$seq_object->subseq(1,5),"\n";
3502
3503   # getting sequence data from disk
3504
3505    $seq_object = read_sequence($amino_seq_file,'fasta');
3506    print "Name of sequence retrieved from disk is ",$seq_object->display_id,"\n";
3507
3508    return 1;
3509 } ;
3510
3511
3512 #################################################
3513 # demo_variations  ():
3514 #
3515
3516 $demo_variations = sub {
3517
3518     print "\nBeginning demo_variations example... \n";
3519
3520     # III.8.3 Representing related sequences
3521     # - mutations, polymorphisms etc (Allele, SeqDiff,)
3522
3523     use Bio::Variation::SeqDiff;
3524     use Bio::Variation::Allele;
3525     use Bio::Variation::DNAMutation;
3526     my ($a1, $a2, $dnamut, $seqDiff, @current_variants);
3527
3528     $a1 = Bio::Variation::Allele->new;
3529     $a1->seq('aaaa');
3530     $a2 = Bio::Variation::Allele->new;
3531     $a2->seq('ttat');
3532
3533     $dnamut = Bio::Variation::DNAMutation->new
3534         ('-start'  => 1000,
3535          '-length'  => 4,
3536          '-upStreamSeq'  => 'actggg',
3537          '-proof'  => 'experimental',
3538          '-isMutation' => 1, );
3539     $dnamut->allele_ori($a1);
3540     $dnamut->add_Allele($a2);
3541     $seqDiff = Bio::Variation::SeqDiff->new
3542         ('-id' => 'M20132',
3543          '-alphabet' => 'rna',
3544          #$seqDiff = Bio::Variation::SeqDiff->new
3545          #( '-id' => 'M20132', '-alphabet' => 'RNA',
3546          '-gene_symbol' => 'AR',
3547          '-chromosome' => 'X',
3548          '-numbering' => 'coding');
3549     # add it to a SeqDiff container object
3550     $seqDiff->add_Variant($dnamut);
3551     @current_variants = $seqDiff->each_Variant();
3552     print " The original sequence at current variation is ",
3553     $current_variants[0]->allele_ori->seq ," \n";
3554     print " The mutated sequence is ",
3555     $current_variants[0]->allele_mut->seq ," \n";
3556     print " The upstream sequence is ",
3557     $current_variants[0]->upStreamSeq ," \n";
3558
3559     return 1;
3560 } ;
3561
3562
3563 #################################################
3564 #  demo_xml ():
3565 #
3566
3567 $demo_xml = sub {
3568
3569     print "\nBeginning demo_xml example... \n";
3570     my ($str, $seqobj, $id, @feats, $first_primary_tag);
3571
3572     # III.8.4 Sequence XML representations
3573     # - generation and parsing (SeqIO::game)
3574
3575     eval { $str = Bio::SeqIO->new('-file'=>
3576            Bio::Root::IO->catfile("t","data","test.game"),
3577                            '-format' => 'game'); };
3578     if ( $@ ){
3579       print STDERR "Cannot run demo_xml\n";
3580       print STDERR "Problem parsing GAME format:\n$@\n";
3581     } else {
3582       $seqobj = $str->next_seq();
3583       # $seq = $str->next_primary_seq();
3584       # $id = $seq->id;
3585
3586       print "Seq object display id is ", $seqobj->display_id(),
3587       "\n"; # the human read-able id of the sequence
3588       print "The sequence description is: \n";
3589       print "   ", $seqobj->desc(), " \n";
3590       print "Acc num is ", $seqobj->accession_number(),
3591       " \n"; # when there, the accession number
3592       print "Moltype is ", $seqobj->alphabet(),
3593       " \n";    # one of 'dna','rna','protein'
3594
3595       @feats = $seqobj->all_SeqFeatures();
3596       $first_primary_tag = $feats[0]->primary_tag;
3597       print " Total number of sequence features is: ", scalar @feats, "
3598 \n";
3599       print " The primary tag of the first feature is:
3600 $first_primary_tag \n";
3601       print " The first feature begins at location ", $feats[0]->start,
3602 " \n";
3603       print "  and ends at location ", $feats[0]->end, " \n";
3604
3605       return 1;
3606    }
3607 } ;
3608
3609
3610 ################################################
3611 #  other_seq_utilities ():
3612 #
3613
3614 $other_seq_utilities = sub {
3615
3616     use Bio::Tools::OddCodes;
3617     use Bio::Tools::SeqPattern;
3618     use Bio::Tools::CodonTable;
3619
3620     my ( $oddcode_obj, $amino_obj, $in, $infile, $chargeseq, $chemseq);
3621     my (@ii, $i, @res, $myCodonTable);
3622     my ( $pattern,$pattern_obj, $pattern_obj2, $pattern_obj3);
3623
3624     print "\nBeginning sequence utilities example... \n";
3625
3626     # III.4.6 Identifying amino acid characteristics
3627     # - eg charge, hydrophobicity (OddCodes)
3628     $infile = $amino_seq_file;
3629     $in  = Bio::SeqIO->new('-file' => $infile ,
3630                            '-format' => 'Fasta');
3631     $amino_obj = $in->next_seq->trunc(1,20);
3632     $oddcode_obj  =  Bio::Tools::OddCodes->new($amino_obj);
3633     print "Initial sequence is: \n ", $amino_obj->seq(), " \n";
3634     $chargeseq = $oddcode_obj->charge;
3635     # Note OddCodes returns a reference to the desired
3636     # sequence, not the sequence itself.
3637     print "Sequence in \'charge\' alphabet is : \n ",$$chargeseq, " \n";
3638     $chemseq = $oddcode_obj->chemical;
3639     print "Sequence in \'chemical\' alphabet is : \n ",$$chemseq, " \n";
3640
3641     # III.4.2.1 non-standard nucleotide translation tables (CodonTable)
3642
3643     # create a table object by giving an ID
3644     $myCodonTable = Bio::Tools::CodonTable -> new ( '-id' => 16);
3645     print "\nThe name of the current codon table is ",
3646     $myCodonTable->name() , " \n";
3647
3648     # translate some codons
3649     @ii  = qw(act acc att ggg ttt aaa ytr yyy);
3650     @res =();
3651     for $i (0..$#ii) { $res[$i] = $myCodonTable->translate($ii[$i]); }
3652     print "\nWith the current codon table, the sequence \n ",
3653     @ii, "\n is translated as \n" , @res, " \n";
3654
3655     # III.4.2.2 utilities for ambiguous residues (SeqPattern, IUPAC)
3656     # III.4.2.3 regex-like nucleotide pattern manipulation
3657
3658     print "\nBeginning demo of SeqPattern object \n ";
3659
3660     $pattern     = '(CCCCT)N{1,200}(agyyg)N{1,80}(agggg)';
3661     print "\nInput regular expression = $pattern \n ";
3662     $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' =>$pattern,
3663                                               '-TYPE' =>'dna');
3664     $pattern_obj2  = $pattern_obj->revcom();
3665     print "\nReverse-complemented expression = ",$pattern_obj2->str, "\n ";
3666     $pattern_obj3 = $pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3667     print "\nExpanded reverse-complemented expression = ",$pattern_obj3->str, "\n ";
3668     #$pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3669
3670     return 1;
3671 } ;
3672
3673
3674 #################################################
3675 # bpinspect1  ():
3676 #
3677 #  Subroutine that identifies from which ancestor module each method
3678 #  of an object is derived
3679 #
3680
3681 $bpinspect1 = sub {
3682     my ($object, $object_class,$package );
3683     my @package_list = ();
3684     $object = shift;
3685     ($object_class = $object) =~ s/::/\//g;  # replace "::" with "/"
3686     $object_class = $object_class . ".pm";
3687     require $object_class;
3688
3689     my $method_hash = $object->methods('all');
3690
3691     print " \n ***Methods for Object $object ********  \n";
3692 # get all the *unique* package names providing methods
3693     my %seen=();
3694     foreach $package  (values %$method_hash) {
3695         push(@package_list, $package) unless $seen{$package}++;
3696     }
3697     for $package (sort @package_list) {
3698         print " \n\n Methods taken from package $package  \n";
3699         my $out_count = 1;
3700         for my $method (sort keys %$method_hash) {
3701             if ($$method_hash{$method} eq $package ) {
3702             print " $method  ";
3703             $out_count++;
3704                 if ($out_count == 7) {
3705                         print " \n";    # keeps output line from getting too long
3706                         $out_count = 1;
3707                 }
3708             }
3709          }
3710      }
3711     print "\n";
3712
3713 # The following subroutine is based closely on Mark Summerfield's and Piers
3714 # Cawley's clever Class_Methods_Introspection program
3715 #
3716     package UNIVERSAL;
3717     sub methods {
3718         my ($class, $types) = @_;
3719         $class = ref $class || $class;
3720         $types ||= '';
3721         my %classes_seen;
3722         my %methods;
3723         my @class = ($class);
3724
3725         no strict 'refs';
3726         while ($class = shift @class) {
3727             next if $classes_seen{$class}++;
3728             unshift @class, @{"${class}::ISA"} if $types eq 'all';
3729             # Based on methods_via() in perl5db.pl
3730             for my $method (grep {not /^[(_]/ and
3731                                   defined &{${"${class}::"}{$_}}}
3732                             keys %{"${class}::"}) {
3733                 $methods{$method} = $class;        #ps
3734 #                $methods{$method} = wantarray ? undef : $class->can($method);
3735             }
3736         }
3737
3738         wantarray ? keys %methods : \%methods;
3739     }
3740
3741     1;
3742
3743     return 1;
3744 } ;
3745
3746
3747 ## "main" program follows
3748 #no strict 'refs';
3749
3750     @runlist = @ARGV;
3751     if (scalar(@runlist)==0) {&$display_help;}; # display help if no option
3752     if ($runlist[0] == 0) {@runlist = (1..26); }; # argument = 0 means run all tests
3753     foreach $n  (@runlist) {
3754         if ($n ==100) {my $object = $runlist[1]; &$bpinspect1($object); last;}
3755         if ($n ==1) {&$access_remote_db; next;}
3756         if ($n ==2) {&$index_local_db; next;}
3757         if ($n ==3) {&$fetch_local_db; next;}
3758         if ($n ==4) {&$sequence_manipulations; next;}
3759         if ($n ==5) {&$seqstats_and_seqwords; next;}
3760         if ($n ==6) {&$restriction_and_sigcleave; next;}
3761         if ($n ==7) {&$other_seq_utilities; next;}
3762         if ($n ==8) {&$run_standaloneblast; next;}
3763         if ($n ==9) {&$bplite_parsing; next;}
3764         if ($n ==10) {&$hmmer_parsing; next;}
3765         if ($n ==11) {&$run_clustalw_tcoffee; next;}
3766         if ($n ==12) {&$run_psw_bl2seq; next;}
3767         if ($n ==13) {&$simplealign ; next;}
3768         if ($n ==14) {&$gene_prediction_parsing; next;}
3769         if ($n ==15) {&$sequence_annotation; next;}
3770         if ($n ==16) {&$largeseqs; next;}
3771         if ($n ==17) {&$liveseqs; next;}
3772         if ($n ==18) {&$demo_variations; next;}
3773         if ($n ==19) {&$demo_xml; next;}
3774         if ($n ==20) {&$run_tree; next;}
3775         if ($n ==21) {&$run_map; next;}
3776         if ($n ==22) {&$run_struct; next;}
3777         if ($n ==23) {&$run_perl; next;}
3778         if ($n ==24) {&$searchio_parsing; next;}
3779         if ($n ==25) {&$run_remoteblast; next;}
3780         &$display_help;
3781     }
3782
3783 ## End of "main"