bptutorial.pl

   1 # $Id$
   2
   3 =pod
   4
   5 =head1  BioPerl Tutorial
   6
   7   Cared for by Peter Schattner <schattner@alum.mit.edu>
   8
   9   Copyright Peter Schattner
  10
  11    This tutorial includes "snippets" of code and text from various
  12    Bioperl documents including module documentation, example scripts
  13    and "t" test scripts. You may distribute this tutorial under the
  14    same terms as perl itself.
  15
  16    This document is written in Perl POD (plain old documentation)
  17    format.  You can run this file through your favorite pod translator
  18    (pod2html, pod2man, pod2text, etc.) if you would like a more
  19    convenient formatting.
  20
  21   Table of Contents
  22
  23   I. Introduction
  24   I.1 Overview
  25   I.2 Software requirements
  26     I.2.1 For minimal bioperl installation
  27     I.2.2 For complete installation
  28   I.3 Installation procedures
  29   I.4 Additional comments for non-unix users
  30
  31   II. Brief overview to bioperl's objects
  32   II.1 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
  33   II.2 Alignment objects (SimpleAlign, UnivAln)
  34   II.3  Interface objects and implementation objects
  35
  36   III. Using bioperl
  37   III.1 Accessing sequence data from local and remote databases
  38      III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
  39      III.1.2 Indexing and accessing local databases (Bio::Index::*,  bpindex.pl,  bpfetch.pl)
  40   III.2 Transforming formats of database/ file records
  41      III.2.1 Transforming sequence files (SeqIO)
  42      III.2.2 Transforming alignment files (AlignIO)
  43   III.3 Manipulating sequences
  44     III.3.1 Manipulating sequence data with Seq methods (Seq)
  45     III.3.2 Obtaining basic sequence statistics- MW, residue &codon frequencies (SeqStats)
  46     III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
  47     III.3.4 Identifying amino acid cleavage sites (Sigcleave)
  48     III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
  49   III.4 Searching for "similar" sequences
  50      III.4.1 Running BLAST locally  (StandAloneBlast)
  51      III.4.2 Running BLAST remotely (using Blast.pm)
  52      III.4.3 Parsing BLAST reports with Blast.pm
  53      III.4.4 Parsing BLAST reports with BPlite, BPpsilite and BPbl2seq
  54      III.4.5  Parsing HMM reports (HMMER::Results)
  55   III.5 Creating and manipulating sequence alignments
  56      III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
  57      III.5.2 Aligning 2 sequences with Blast using  bl2seq and AlignIO
  58      III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
  59      III.5.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
  60   III.6 Searching for genes and other structures on genomic DNA (Genscan, Sim4, ESTScan, MZEF)
  61   III.7 Developing machine readable sequence annotations
  62      III.7.1 Representing sequence annotations (Annotation, SeqFeature)
  63      III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
  64      III.7.3 Representing related sequences - mutations, polymorphisms etc (Allele, SeqDiff,)
  65      III.7.4 Sequence XML representations - generation and parsing (SeqIO::game)
  66
  67   IV.  Related projects - biocorba, biopython, biojava, Ensembl, AnnotationWorkbench
  68      IV.1 Biocorba
  69      IV.2 Biopython and biojava
  70      IV.3  Ensembl
  71      IV.4 The Annotation Workbench and bioperl-gui
  72
  73   V.  Appendices
  74      V.1 Public Methods of Principal Bioperl Objects
  75      V.2 Tutorial Demo Scripts
  76
  77 =head1 I. Introduction
  78
  79 =head2  I.1 Overview
  80
  81 Bioperl is a collection of perl modules that facilitate the
  82 development of perl scripts for bio-informatics applications.  As
  83 such, it does not include ready to use programs in the sense that may
  84 commercial packages and free web-based interfaces (eg Entrez, SRS) do.
  85 On the other hand, bioperl does provide reusable perl modules that
  86 facilitate writing perl scripts for sequence manipulation, accessing
  87 of databases using a range of data formats and execution and parsing
  88 of the results of various molecular biology programs including Blast,
  89 clustalw, TCoffee, genscan, ESTscan and HMMER.  Consequently, bioperl
  90 enables developing scripts that can analyze large quantities of
  91 sequence data in ways that are typically difficult or impossible with
  92 web based systems.
  93
  94 In order to take advantage of bioperl, the user needs a basic
  95 understanding of the perl programming language including an
  96 understanding of how to use perl references, modules, objects and
  97 methods. If these concepts are unfamiliar the user is referred to any
  98 of the various introductory / intermediate books on perl. (I've liked
  99 S. Holzmer's Perl Core Language, Coriolis Technology Press, for
 100 example).  This tutorial is not intended to teach the fundamentals of
 101 perl to those with little or no experience in the perl language.  On
 102 the other hand, advanced knowledge of perl - such as how to write a
 103 perl object - is not required for successfully using bioperl.
 104
 105 Bioperl is open source software that is still under active
 106 development.  The advantages of open source software are well known.
 107 They include the ability to freely examine and modify source code and
 108 exemption from software licensing fees.  However, since open source
 109 software is typically developed by a large number of volunteer
 110 programmers, the resulting code is often not as clearly organized and
 111 its user interface not as standardized as in a mature commercial
 112 product.  In addition, in any project under active development,
 113 documentation may not keep up with the development of new features.
 114 Consequently the learning curve for actively developed, open source
 115 source software is sometimes steep.
 116
 117 This tutorial is intended to ease the learning curve for new users of
 118 bioperl.  To that end the tutorial includes:
 119
 120 =over 4
 121
 122 =item *
 123
 124 Descriptions of what bio-informatics tasks can be handled with bioperl
 125
 126 =item *
 127
 128 Directions on where to find the methods to accomplish these tasks
 129 within the bioperl package
 130
 131 =item *
 132
 133 Recommendations on where to go for additional information.
 134
 135 =item *
 136
 137 A separate tutorial script (tutorial.pl - located in the top bioperl
 138   directory) with examples of many of methods described in the
 139   tutorial.
 140
 141 =back
 142
 143 Running the tutorial.pl script while going through this tutorial - or
 144 better yet, stepping through it with an interactive debugger - is a
 145 good way of learning bioperl.  The tutorial script is also a good
 146 place from which to cut-and-paste code for your scripts(rather than
 147 using the code snippets in this tutorial). The tutorial script should
 148 work on your machine - and if it doesn't it would probably be a good
 149 idea to find out why, before getting too involved with bioperl!
 150
 151 This tutorial does not intend to be a comprehensive description of all
 152 the objects and methods available in bioperl.  For that the reader is
 153 directed to the documentation included with each of the modules as
 154 well as the additional documentation referred to below.
 155
 156 =head2 I.2 Software requirements
 157
 158 =head2   I.2.1 Minimal bioperl installation
 159
 160 For a "minimal" installation of bioperl, you will need to have perl
 161 itself installed as well as the bioperl "core modules".  Bioperl has
 162 been tested primarily using perl 5.005 and more recently perl 5.6.
 163 The minimal bioperl installation should still work under perl 5.004.
 164 However, as increasing numbers of bioperl objects are using modules
 165 from CPAN (see below), problems have been observed for bioperl running
 166 under perl 5.004.  So if you are having trouble running bioperl under
 167 perl 5.004, you should probably upgrade your version of perl.
 168
 169 In addition to a current version of perl, the new user of bioperl is
 170 encouraged to have access to, and familiarity with, an interactive
 171 perl debugger.  Bioperl is a large collection of complex interacting
 172 software objects.  Stepping through a script with an interactive
 173 debugger is a very helpful way of seeing what is happening in such a
 174 complex software system - especially when the software is not behaving
 175 in the way that you expect.  The free graphical debugger ptkdb
 176 (available as Devel::ptkdb from CPAN) is highly recommended.  Active
 177 State offers a commercial graphical debugger for windows systems.  The
 178 standard perl distribution also contains a powerful interactive
 179 debugger - though with a more cumbersome (command line) interface.
 180
 181 =head2     I.2.2 Complete installation
 182
 183 Taking full advantage of bioperl requires software beyond that for the
 184 minimal installation.  This additional software includes perl modules
 185 from CPAN, bioperl perl extensions, a bioperl xs-extension, and
 186 several standard compiled bioinformatics programs.
 187
 188 B<Perl - extensions>
 189
 190 The following perl modules are available from bioperl
 191 (http://bioperl.org/Core/external.shtml)or from CPAN
 192 (http://www.perl.com/CPAN/) are used by bioperl.  The listing also
 193 indicates what bioperl features will not be available if the
 194 corresponding CPAN module is not downloaded.  If these modules are not
 195 available (eg non-unix operating systems), the remainder of bioperl
 196 should still function correctly.
 197
 198 For accessing remote databases you will need:
 199
 200 =over 2
 201
 202 =item *
 203
 204 File-Temp-0.09
 205
 206 =item *
 207
 208 IO-String-1.01
 209
 210 =back
 211
 212 For accessing Ace databases you will need:
 213
 214 =over 1
 215
 216 =item *
 217
 218 AcePerl-1.68.
 219
 220 =back
 221
 222 For remote blast searches you will need:
 223
 224 =over 7
 225
 226 =item *
 227
 228 libwww-perl-5.48
 229
 230 =item *
 231
 232 Digest-MD5-2.12.
 233
 234 =item *
 235
 236 HTML-Parser-3.13
 237
 238 =item *
 239
 240 libnet-1.0703
 241
 242 =item *
 243
 244 MIME-Base64-2.11
 245
 246 =item *
 247
 248 URI-1.09
 249
 250 =item *
 251
 252 IO-stringy-1.216
 253
 254 =back
 255
 256 For xml parsing you will need:
 257
 258 =over 5
 259
 260 =item *
 261
 262 libxml-perl-0.07
 263
 264 =item *
 265
 266 XML-Node-0.10
 267
 268 =item *
 269
 270 XML-Parser.2.30
 271
 272 =item *
 273
 274 XML-Writer-0.4
 275
 276 =item *
 277
 278 expat-1.95.1 from http://sourceforge.net/projects/expat/
 279
 280 =back
 281
 282 For more current and additional information on external modules
 283 required by bioperl, check http://bioperl.org/Core/external.shtml
 284
 285 B<Bioperl c extensions & external bio-informatics programs>
 286
 287 Bioperl also uses several c-programs for sequence alignment and local
 288 blast searching. To use these features of bioperl you will need an
 289 ANSI C or Gnu C compiler as well as the actual program available from
 290 sources such as:
 291
 292 for smith-waterman alignments- bioperl-ext-0.6 from
 293 http://bioperl.org/Core/external.shtml
 294
 295 for clustalw alignments-
 296 http://corba.ebi.ac.uk/Biocatalog/Alignment_Search_software.html/
 297
 298 for tcoffee alignments-
 299   http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
 300
 301 for local blast searching- ftp://ncbi.nlm.nih.gov/blast
 302
 303 =head2  I.3 Installation
 304
 305 The actual installation of the various system components is
 306 accomplished in the standard manner:
 307
 308 =over 6
 309
 310 =item *
 311
 312 Locate the package on the network
 313
 314 =item *
 315
 316 Download
 317
 318 =item *
 319
 320 Decompress (with gunzip or a similiar utility)
 321
 322 =item *
 323
 324 Remove the file archive (eg with tar -xvf)
 325
 326 =item *
 327
 328 Create a "makefile" (with "perl Makefile.PL" for perl modules or a
 329 supplied "install" or "configure" program for non-perl program
 330
 331 =item *
 332
 333 Run "make", "make test" and "make install" This procedure must be
 334 repeated for every CPAN module, bioperl-extension and external
 335 module to be installed. A helper module CPAN.pm is available from
 336 CPAN which automates the process for installing the perl modules.
 337
 338 =back
 339
 340 For the external programs (clustal, Tcoffee, ncbi-blast), there is an
 341 extra step:
 342
 343 =over 1
 344
 345 =item *
 346
 347 Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or
 348 BLASTDIR) to the directory holding the executable in your startup
 349 file - eg in .bashrc. (For running local blasts, it is also
 350 necessary that the name of local-blast database directory is known
 351 to bioperl.  This will typically happen automatically, but in case
 352 of difficulty, refer to the documentation for StandAloneBlast.pm)
 353
 354 =back
 355
 356 The only likely complication (at least on unix systems) that may occur
 357 is if you are unable to obtain system level writing privileges.  For
 358 instructions on modifying the installation in this case and for more
 359 details on the overall installation procedure, see the README file in
 360 the bioperl distribution as well as the README files in the external
 361 programs you want to use (eg bioperl-ext, clustalw, TCoffee,
 362 NCBI-blast).
 363
 364 =head2 I.4 Additional comments for non-unix users
 365
 366 Bioperl has mainly been developed and tested under various unix
 367 environments (including Linux) and this tutorial is intended primarily
 368 for unix users.  The minimal installation of bioperl *should* work
 369 under other OS's (NT, windows, perl).  However, bioperl has not been
 370 widely tested under these OS's and problems have been noted in the
 371 bioperl mailing lists.  In addition, many bioperl features require the
 372 use of CPAN modules, compiled extensions or external programs.  These
 373 features will probably will not work under some or all of these other
 374 operating systems.  If a script attempts to access these features from
 375 a non-unix OS, bioperl is designed to simply report that the desired
 376 capability is not available.  However, since the testing of bioperl in
 377 these environments has been limited, the script may well crash in a
 378 less "graceful" manner.
 379
 380 Todd Richmond has written of his experiences with BioPerl on MacOs
 381 at http://bioperl.org/Core/mac-bioperl.html
 382
 383 =head1 II. Brief introduction to bioperl's objects
 384
 385 The purpose of this tutorial is to get you using bioperl to solve
 386 real-life bioinformatics problems as quickly as possible.  The aim is
 387 not to explain the structure of bioperl objects or perl
 388 object-oriented programming in general.  Indeed, the relationships
 389 among the bioperl objects is not simple; however, understanding them
 390 in detail is fortunately not necessary for successfully using the
 391 package.
 392
 393 Nevertheless, a little familiarity with the bioperl object "bestiary"
 394 can be very helpful even to the casual user of bioperl. For example
 395 there are (at least) six different "sequence objects" - Seq,
 396 PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI.  Understanding the
 397 relationships among these objects - and why there are so many of them
 398 - will help you select the appropriate one to use in your script.
 399
 400 =head2 II.2 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
 401
 402 Seq is the central sequence object in bioperl.  When in doubt this is
 403 probably the object that you want to use to describe a dna, rna or
 404 protein sequence in bioperl.  Most common sequence manipulations can
 405 be performed with Seq.  These capabilities are described in sections
 406 III.3.1 and III.7.1.
 407
 408 Seq objects can be created explicitly (see section III.2.1 for an
 409 example).  However usually Seq objects will be created for you
 410 automatically when you read in a file containing sequence data using
 411 the SeqIO object.  This procedure is described in section III.2.1.  In
 412 addition to storing its identification labels and the sequence itself,
 413 a Seq object can store multiple annotations and associated "sequence
 414 features".  This capability can be very useful - especially in
 415 development of automated genome annotation systems, see section
 416 III.7.1.
 417
 418 On the other hand, if you need a script capable of simultaneously
 419 handling many (hundreds or thousands) sequences at a time, then the
 420 overhead of adding annotations to each sequence can be significant.
 421 For such applications, you will want to use the PrimarySeq object.
 422 PrimarySeq is basically a "stripped down" version of Seq.  It contains
 423 just the sequence data itself and a few identifying labels (id,
 424 accession number, molecule type = dna, rna, or protein).  For
 425 applications with hundreds or thousands or sequences, using PrimarySeq
 426 objects can significantly speed up program execution and decrease the
 427 amount of RAM the program requires.
 428
 429 The LocatableSeq object is just a Seq object which has "start" and
 430 "end" positions associated with it.  It is used by the alignment
 431 object SimpleAlign and other modules that use SimpleAlign objects (eg
 432 AlignIO, pSW).  In general you don't have to worry about creating
 433 LocatableSeq objects because they will be made for you automatically
 434 when you create an alignment (using pSW, Clustalw, Tcoffee or bl2seq)
 435 or when input an alignment data file using AlignIO.  However if you
 436 need to input a sequence alignment by hand (ieg to build a SimpleAlign
 437 object), you will need to input the sequences as LocatableSeqs.
 438
 439 A LargeSeq object is a special type of Seq object used for handling
 440 very long ( eg L<gt> 100 MB) sequences.  If you need to manipulate such
 441 long sequences see section III.7.2 which describes LargeSeq objects.
 442
 443 A LiveSeq object is another specialized object for storing sequence
 444 data.  LiveSeq addresses the problem of features whose location on a
 445 sequence changes over time.  This can happen, for example, when
 446 sequence feature objects are used to store gene locations on newly
 447 sequenced genomes - locations which can change as higher quality
 448 sequencing data becomes available.  Although a LiveSeq object is not
 449 implemented in the same way as a Seq object, LargeSeq does implement
 450 the SeqI interface (see below).  Consequently, most methods available
 451 for Seq objects will work fine with LiveSeq objects. Section III.7.2
 452 contains further discussion of LiveSeq objects.
 453
 454 SeqI objects are Seq "interface objects" (see section II.4) They are
 455 used to ensure bioperl's compatibility with other software packages.
 456 SeqI and other interface objects are not likely to be relevant to the
 457 casual bioperl user.
 458
 459 *** Having described these other types of sequence objects, the
 460     "bottom line" still is that if you store your sequence data in Seq
 461     objects (which is where they'll be if you read them in with
 462     SeqIO), you will usually do just fine. ***
 463
 464 =head2 II.3 Alignment objects (SimpleAlign, UnivAln)
 465
 466 There are two "alignment objects" in bioperl: SimpleAlign and UnivAln.
 467 Both store an array of sequences as an alignment.  However their
 468 internal data structures are quite different and converting between
 469 them - though certainly possible - is rather awkward.  In contrast to
 470 the sequence objects - where there are good reasons for having 6
 471 different classes of objects, the presence of two alignment objects is
 472 just an unfortunate relic of the two systems having been designed
 473 independently at different times.
 474
 475 Since each object has some capabilities that the other lacks it has
 476 not yet been feasible to unify bioperl's sequence alignment methods
 477 into a single object (see section III.5.4 for a description of
 478 SimpleAlign's and UnivAln's features) .  However, recent development
 479 in bioperl involving alignments has been focused on using SimpleAlign
 480 and the new user should generally use SimpleAlign where possible.
 481
 482 =head2 II.4 Interface objects and implementation objects
 483
 484 Since release 0.6, bioperl has been moving to separate interface and
 485 implementation objects.  An interface is solely the definition of what
 486 methods one can call on an object, without any knowledge of how it is
 487 implemented. An implementation is an actual, working implementation of
 488 an object. In languages like Java, interface definition is part of the
 489 language. In Perl, you have to roll your own.
 490
 491 In bioperl, the interface objects usually have names like
 492 Bio::MyObjectI, with the trailing I indicating it is an interface
 493 object. The interface objects mainly provide documentation on what the
 494 interface is, and how to use it, without any implementations (though
 495 there are some exceptions).  Although interface objects are not of
 496 much direct utility to the casual bioperl user, being aware of their
 497 existence is useful since they are the basis to understanding how
 498 bioperl programs can communicate with other bioinformatics projects
 499 such as Ensembl and the Annotation Workbench (see section IV)
 500
 501
 502 =head1 III. Using bioperl
 503
 504 Bioperl provides software modules for many of the typical tasks of
 505 bioinformatics programming.  These include:
 506
 507 =over 7
 508
 509 =item * Accessing sequence data from local and remote databases
 510
 511 =item * Transforming formats of database/ file records
 512
 513 =item * Manipulating individual sequences
 514
 515 =item * Searching for "similar" sequences
 516
 517 =item * Creating and manipulating sequence alignments
 518
 519 =item * Searching for genes and other structures on genomic DNA
 520
 521 =item * Developing machine readable sequence annotations
 522
 523 =back
 524
 525 The following sections describe how bioperl can help perform all of
 526 these tasks.
 527
 528 =head2 III.1 Accessing sequence data from local and remote databases
 529
 530 Much of bioperl is focused on sequence manipulation.  However, before
 531 bioperl can manipulate sequences, it needs to have access to sequence
 532 data.  Now one can directly enter data sequence data into a bioperl
 533 Seq object, eg:
 534
 535   $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
 536                        '-desc'=>'Sample Bio::Seq object',
 537                        '-display_id' => 'something',
 538                        '-accession_number' => 'accnum',
 539                        '-moltype' => 'dna' );
 540
 541 However, in most cases, it is preferable to access sequence data from
 542 some online data file or database (Note that in common with
 543 conventional bioinformatics usage we will call a "database" what might
 544 be more appropriately referred to as an "indexed flat file".)  Bioperl
 545 supports accessing remote databases as well as developing indices for
 546 setting up local databases.
 547
 548 =head2   III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
 549
 550 Accessing sequence data from the principal molecular biology databases
 551 is straightforward in bioperl.  Data can be accessed by means of the
 552 sequence's accession number or id.  Batch mode access is also
 553 supported to facilitate the efficient retrieval of multiple sequences.
 554 For retrieving data from genbank, for example, the code could be as
 555 follows:
 556
 557   $gb = new Bio::DB::GenBank();
 558   $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
 559   $seq2 = $gb->get_Seq_by_acc('AF303112'))
 560   $seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112 2981014)]));
 561
 562 Bioperl currently supports sequence data retrieval from the genbank,
 563 genpept, swissprot and gdb databases. Bioperl also supports retrieval
 564 from a remote Ace database. This capability requires the presence of
 565 the external AcePerl module. You need to download and install the aceperl
 566 module from http://stein.cshl.org/AcePerl/.
 567
 568 =head2  III.1.2 Indexing and accessing local databases (Bio::Index::*,
 569    bpindex.pl, bpfetch.pl)
 570
 571 Alternately, bioperl permits indexing local sequence data files by
 572 means of the Bio::Index objects.  The following sequence data formats
 573 are supported: genbank, swissprot, pfam, embl and fasta.  Once the set
 574 of sequences have been indexed using Bio::Index, individual sequences
 575 can be accessed using syntax very similar to that described above for
 576 accessing remote databases.  For example, if one wants to set up an
 577 indexed (flat-file) database of fasta files, and later wants then to
 578 retrieve one file, one could write a scripts like:
 579
 580   # script 1: create the index
 581   use Bio::Index::Fasta; # using fasta file format
 582   $Index_File_Name = shift;
 583   $inx = Bio::Index::Fasta->new(
 584       -filename => $Index_File_Name,
 585       -write_flag => 1);
 586   $inx->make_index(@ARGV);
 587
 588   # script 2: retrieve some files
 589   use Bio::Index::Fasta;
 590   $Index_File_Name = shift;
 591   $inx = Bio::Index::Fasta->new($Index_File_Name);
 592   foreach  $id (@ARGV) {
 593       $seq = $inx->fetch($id);  # Returns Bio::Seq object
 594       # do something with the sequence
 595   }
 596
 597 To facilitate the creation and use of more complex or flexible
 598 indexing systems, the bioperl distribution includes two sample scripts
 599 bpindex.pl and bpfetch.pl.  These scripts can be used as templates to
 600 develop customized local data-file indexing systems.
 601
 602 =head2 III.2 Transforming formats of database/ file records
 603
 604 =head2   III.2.1 Transforming sequence files (SeqIO)
 605
 606 A common - and tedious - bioinformatics task is that of converting
 607 sequence data among the many widely used data formats.  Bioperl's
 608 SeqIO object, however, makes this chore a breeze.  SeqIO can read a
 609 stream of sequences - located in a single or in multiple files - in
 610 any of six formats: Fasta, EMBL. GenBank, Swissprot, PIR and GCG.
 611 Once the sequence data has been read in with SeqIO, it is available to
 612 bioperl in the form of Seq objects.  Moreover, the Seq objects can
 613 then be written to another file (again using SeqIO) in any of the
 614 supported data formats making data converters simple to implement, for
 615 example:
 616
 617   use Bio::SeqIO;
 618   $in  = Bio::SeqIO->new('-file' => "inputfilename",
 619                          '-format' => 'Fasta');
 620   $out = Bio::SeqIO->new('-file' => ">outputfilename",
 621                          '-format' => 'EMBL');
 622   while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
 623
 624 In addition, perl "tied filehandle" syntax is available to SeqIO,
 625 allowing you to use the standard E<lt>E<gt> and print operations to read and
 626 write sequence objects, eg:
 627
 628   $in  = Bio::SeqIO->newFh('-file' => "inputfilename" ,
 629                            '-format' => 'Fasta');
 630   $out = Bio::SeqIO->newFh('-format' => 'EMBL');
 631   print $out $_ while <$in>;
 632
 633 =head2   III.2.2 Transforming alignment files (AlignIO)
 634
 635 Data files storing multiple sequence alignments also appear in varied
 636 formats.  AlignIO is the bioperl object for data conversion of
 637 alignment files. AlignIO is patterned on the SeqIO object and shares
 638 most of SeqIO's features.  AlignIO currently supports input in the
 639 following formats: fasta, mase, stockholm, prodom, selex, bl2seq,
 640 msf/gcg and output in these formats: : fasta, mase, selex, clustalw,
 641 msf/gcg.  One significant difference between AlignIO and SeqIO is that
 642 AlignIO handles IO for only a single alignment at a time (SeqIO.pm
 643 handles IO for multiple sequences in a single stream.)  Syntax for
 644 AlignIO is almost identical to that of SeqIO: use Bio::AlignIO;
 645
 646   $in  = Bio::AlignIO->new('-file' => "inputfilename" ,
 647                            '-format' => 'fasta');
 648   $out = Bio::AlignIO->new('-file' => ">outputfilename",
 649                            '-format' => 'pfam');
 650   while ( my $aln = $in->next_aln() ) { $out->write_aln($aln);  }
 651
 652 The only difference is that here, the returned object reference, $aln,
 653 is to a SimpleAlign object rather than a Seq object.
 654
 655 AlignIO also supports the tied filehandle syntax described above for
 656 SeqIO.  (Note that currently AlignIO is usable only with SimpleAlign
 657 alignment objects.  IO for UnivAln objects can only be done for files
 658 in fasta data format.)
 659
 660 =head2 III.3 Manipulating sequences
 661
 662 =head2 III.3.1  Manipulating sequence data with Seq methods
 663
 664 OK, so we know how to retrieve sequences and access them as Seq
 665 objects.  Let's see how we can use the Seq objects to manipulate our
 666 sequence data and retrieve information.  Seq provides multiple methods
 667 for performing many common (and some not-so-common) tasks of sequence
 668 manipulation and data retrieval.  Here are some of the most useful:
 669
 670 The following methods return strings
 671
 672   $seqobj->display_id(); # the human read-able id of the sequence
 673   $seqobj->seq();        # string of sequence
 674   $seqobj->subseq(5,10); # part of the sequence as a string
 675   $seqobj->accession_number(); # when there, the accession number
 676   $seqobj->moltype();    # one of 'dna','rna','protein'
 677   $seqobj->primary_id(); # a unique id for this sequence irregardless
 678                          # of its display_id or accession number
 679
 680 The following methods return an array of  Bio::SeqFeature objects
 681
 682    $seqobj->top_SeqFeatures # The 'top level' sequence features
 683    $seqobj->all_SeqFeatures # All sequence features, including sub
 684                             # seq features
 685
 686 Sequence features will be discussed further in section III.7 on
 687 machine-readable sequence annotation.
 688
 689 The following methods returns new sequence objects, but do not transfer features across
 690
 691   $seqobj->trunc(5,10)  # truncation from 5 to 10 as new object
 692   $seqobj->revcom       # reverse complements sequence
 693   $seqobj->translate    # translation of the sequence
 694
 695 Note that some methods return strings, some return arrays and some return
 696 references to objects.  Here (as elsewhere in perl and bioperl) it is the user's
 697 responsibility to check the relevant documentation so they know the
 698 format of the data being returned.
 699
 700 Many of these methods are self-explanatory. However, bioperl's flexible
 701 translation methods warrant further comment. Translation in bioinformatics
 702 can mean two slightly different things:
 703
 704 =over 2
 705
 706 =item 1 Translating a nucleotide sequence from start to end.
 707
 708 =item 2 Taking into account the constraints of real coding regions in mRNAs.
 709
 710 =back
 711
 712 For historical reasons the bioperl implementation of translation does
 713 the first of these tasks easily. Any sequence object which is not of moltype
 714 'protein' can be translated by simply calling the method which returns
 715 a protein sequence object:
 716
 717   $translation1 = $my_seq_object->translate;
 718
 719 However, the translate method can also be passed several optional parameters
 720 to modify its behavior. For example, the first two arguments to "translate"
 721 can be used to modify the characters used to represent stop (default '*')
 722 and unknown amino acid ('X'). (These are normally best left untouched.)
 723 The third argument determines the frame of the translation. The default
 724 frame is "0".  To get translations in the other two forward frames,
 725 we would write:
 726
 727   $translation2 = $my_seq_object->translate(undef,undef,1);
 728   $translation3 = $my_seq_object->translate(undef,undef,2);
 729
 730 The fourth argument to "translate" makes it possible to use alternative
 731 genetic codes. There are currently 16 codon tables defined, including tables for
 732 'Verterbate Mitochondrial', 'Bacterial', 'Alternative Yeast Nuclear'
 733 and 'Ciliate, Dasycladacean and Hexamita Nuclear' translation. These
 734 tables are located in the object Bio::Tools::CodonTable which is used
 735 by the translate method. For example, for mitochondrial translation:
 736
 737   $human_mitochondrial_translation =
 738       $my_seq_object->translate(undef,undef,undef, 2);
 739
 740 If we want to translate full coding regions (CDS) the way major
 741 nucleotide databanks EMBL, GenBank and DDBJ do it, the translate
 742 method has to perform more tricks. Specifically, 'translate' needs
 743 to confirm that the sequence has appropriate start and terminator codons
 744 at the beginning and the end of the sequence and that there are no terminator
 745 codons present within the sequence.  In addition, if the genetic code being used has
 746 an atypical (non-ATG) start codon, the translate method needs to convert
 747 the initial amino acid to methionine.  These checks and conversions are triggered by
 748 setting the fifth argument of the translate method to evaluate to "true".
 749
 750 If argument 5 is set to true and the criteria for a proper CDS are
 751 not met, the method, by default, issues a warning. By setting the
 752 sixth argument to evaluate to "true", one can instead instruct
 753 the program to die if an improper CDS is found, e.g.
 754
 755   $protein_object =
 756       $cds->translate(undef,undef,undef,undef,1,'die_if_errors');
 757
 758 =head2 III.3.2 Obtaining basic sequence statistics- MW, residue &codon
 759 frequencies(SeqStats, SeqWord)
 760
 761 In addition to the methods directly available in the Seq object,
 762 bioperl provides various "helper" objects to determine additional
 763 information about a sequence.  For example, the SeqStats object
 764 provides methods for obtaining the molecular weight of the sequence as
 765 well the number of occurrences of each of the component residues
 766 (bases for a nucleic acid or amino acids for a protein.)  For nucleic
 767 acids, SeqStats also returns counts of the number of codons used.  For
 768 example:
 769
 770   use SeqStats
 771   $seq_stats  =  Bio::Tools::SeqStats->new($seqobj);
 772   $weight = $seq_stats->get_mol_wt();
 773   $monomer_ref = $seq_stats->count_monomers();
 774   $codon_ref = $seq_stats->count_codons();  # for nucleic acid sequence
 775
 776 Note: sometimes sequences will contain "ambiguous" codes.  For this
 777 reason, get_mol_wt() returns (a reference to) a two element array
 778 containing a greatest lower bound and a least upper bound of the
 779 molecular weight.
 780
 781 The SeqWords object is similar to SeqStats and provides methods for
 782 calculating frequencies of "words" (eg tetramers or hexamers) within
 783 the sequence.
 784
 785 =head2 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
 786
 787 Another common sequence manipulation task for nucleic acid sequences
 788 is locating restriction enzyme cutting sites.  Bioperl provides the
 789 RestrictionEnzyme object for this purpose. Bioperl's standard
 790 RestrictionEnzyme object comes with data for XXX different restriction
 791 enzymes. A list of the available enzymes can be accessed using the
 792 available_list() method.  For example to select all available enzymes
 793 that with cutting patterns that are six bases long one would write:
 794
 795   $re  = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
 796   @sixcutters = $re->available_list(6);
 797
 798 Once an appropriate enzyme has been selected, the sites for that
 799 enzyme on a given nucleic acid sequence can be obtained using the
 800 cut_seq() method.  The syntax for performing this task is:
 801
 802   $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');
 803   # $seqobj is the Seq object for the dna to be cut
 804   @fragments =  $re1->cut_seq($seqobj);
 805
 806 Adding an enzyme not in the default list is easily accomplished:
 807
 808   $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV--GAT^ATC',
 809                                            '-MAKE' =>'custom');
 810
 811 Once the custom enzyme object has been created, cut_seq() can be
 812 called in the usual manner.
 813
 814 =head2    III.3.4 Identifying amino acid cleavage sites (Sigcleave)
 815
 816 For amino acid sequences we may be interested to know whether the
 817 amino acid sequence contains a cleavable "signal sequence" for
 818 directing the transport of the protein within the cell.  SigCleave is
 819 a program (originally part of the EGCG molecular biology package) to
 820 predict signal sequences, and to identify the cleavage site.
 821
 822 The "threshold" setting controls the score reporting.  If no value for
 823 threshold is passed in by the user, the code defaults to a reporting
 824 value of 3.5.  SigCleave will only return score/position pairs which
 825 meet the threshold limit.
 826
 827 There are 2 accessor methods for this object. "signals" will return a
 828 perl hash containing the sigcleave scores keyed by amino acid
 829 position. "pretty_print" returns a formatted string similar to the
 830 output of the original sigcleave utility.
 831
 832 Syntax for using the modules is as follows:
 833
 834   use Bio::Tools::Sigcleave;
 835   $sigcleave_object = new Bio::Tools::Sigcleave
 836       ('-file'=>'sigtest.aa',
 837        '-threshold'=>'3.5'
 838        '-desc'=>'test sigcleave protein seq',
 839        '-type'=>'AMINO
 840       ');
 841   %raw_results      = $sigcleave_object->signals;
 842   $formatted_output = $sigcleave_object->pretty_print;
 843
 844 Note that Sigcleave is passed a raw sequence (or file containing a
 845 sequence) rather than a sequence object when it is created. Also note
 846 that the "type" in the Sigcleave object is "amino" whereas in a Seq
 847 object it is "protein".
 848
 849 =head2 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
 850
 851 OddCodes:
 852
 853 For some purposes it's useful to have a listing of an amino acid
 854 sequence showing where the hydrophobic amino acids are located or
 855 where the positively charged ones are.  Bioperl provides this
 856 capability via the module OddCodes.pm.
 857
 858 For example, to quickly see where the charged amino acids are located
 859 along the sequence we perform:
 860
 861   use Bio::Tools::OddCodes;
 862   $oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
 863   $output = $oddcode_obj->charge();
 864
 865 The sequence will be transformed into a three-letter sequence (A,C,N)
 866 for negative (acidic), positive (basic), and neutral amino acids.  For
 867 example the ACDEFGH would become NNAANNC.
 868
 869 For a more complete chemical description of the sequence one can call
 870 the chemical() method which turns sequence into one with an 8-letter
 871 chemical alphabet { A (acidic), L (aliphatic), M (amide), R
 872 (aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:
 873
 874   $output = $oddcode_obj->chemical();
 875
 876 In this case the sample sequence ACDEFGH would become LSAARAC.
 877
 878 OddCodes also offers translation into alphabets showing alternate
 879 characteristics of the amino acid sequence such as hydrophobicity,
 880 "functionality" or grouping using Dayhoff's definitions.  See the
 881 documentation for OddCodes.pm for further details.
 882
 883 SeqPattern:
 884
 885 The SeqPattern object is used to manipulate sequences that include
 886 perl "regular expressions".  A key motivation for SeqPattern is to
 887 have a way of generating a reverse complement of a nucleic acid
 888 sequence pattern that includes ambiguous bases and/or regular
 889 expressions.  This capability leads to significant performance gains
 890 when pattern matching on both the sense and anti-sense strands of a
 891 query sequence are required.  Typical syntax for using SeqPattern is
 892 shown below.  For more information, there are several interesting
 893 examples in the script SeqPattern.pl in the examples directory.
 894
 895   Use Bio::Tools::SeqPattern;
 896   $pattern     = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
 897   $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' =>$pattern,
 898                                             '-TYPE' =>'dna');
 899   $pattern_obj2  = $pattern_obj->revcom();
 900   $pattern_obj->revcom(1); ## returns expanded rev complement pattern.
 901
 902 =head2 III.4 Searching for "similar" sequences
 903
 904 One of the basic tasks in molecular biology is identifying sequences
 905 that are, in some way, similar to a sequence of interest.  The Blast
 906 programs, originally developed at the NCBI, are widely used for
 907 identifying such sequences.  Bioperl offers a number of modules to
 908 facilitate running Blast as well as to parse the often voluminous
 909 reports produced by Blast.
 910
 911 =head2   III.4.1 Running BLAST locally  (StandAloneBlast)
 912
 913 There are several reasons why one might want to run the Blast programs
 914 locally - speed, data security, immunity to network problems, being
 915 able to run large batch runs etc.  The NCBI provides a downloadable
 916 version of blast in a stand-alone format, and running blast locally
 917  without any use of perl or bioperl - is completely
 918 straightforward.  However, there are situations where having a perl
 919 interface for running the blast programs locally is convenient.
 920
 921 The module StandAloneBlast.pm offers the ability to wrap local calls
 922 to blast from within perl.  All of the currently available options of
 923 NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within
 924 the bioperl StandAloneBlast interface.  Of course, to use
 925 StandAloneBlast, one needs to have installed locally ncbi-blast as
 926 well as one or more blast-readable databases.
 927
 928 Basic usage of the StandAloneBlast.pm module is simple.  Initially, a
 929 local blast "factory object" is created.
 930
 931   @params = ('program' => 'blastn',
 932              'database' => 'ecoli.nt');
 933   $factory = Bio::Tools::StandAloneBlast->new(@params);
 934
 935 Any parameters not explicitly set will remain as the BLAST defaults.
 936 Once the factory has been created and the appropriate parameters set,
 937 one can call one of the supported blast executables.  The input
 938 sequence(s) to these executables may be fasta file(s), a Bio::Seq
 939 object or an array of Bio::Seq objects, eg
 940
 941   $input = Bio::Seq->new('-id'=>"test query",
 942                          '-seq'=>"ACTAAGTGGGGG");
 943   $blast_report = $factory->blastall($input);
 944
 945 The returned blast report will be in the form of a bioperl
 946 parsed-blast object.  The report object may be either a BPlite,
 947 BPpsilite, BPbl2seq or Blast object depending on the type of blast
 948 search.  The "raw" blast report is also available.
 949
 950 The syntax for running PHIBLAST, PSIBLAST and bl2seq searches via
 951 StandAloneBlast is also straightforward.  See the StandAloneBlast.pm
 952 documentation for details.  In addition, the script standaloneblast.pl
 953 in the examples directory contains descriptions of various possible
 954 applications of the StandAloneBlast object.
 955
 956 =head2   III.4.2 Running BLAST remotely (using Blast.pm)
 957
 958 Bioperl supports remote execution of blasts at NCBI by means of the
 959 Blast.pm object.  (Note: the bioperl Blast object is referred to here
 960 as Blast.pm to distinguish it from the actual Blast program).
 961 Blast.pm is capable of both running Blasts and parsing the report
 962 results. Blast.pm supports a wide array of modes, options and
 963 parameters.  As a result, using Blast.pm directly can be somewhat
 964 complicated. Consequently, it is recommended to use, and if necessary
 965 modify, the supplied scripts - run_blast_remote.pl and
 966 retrieve_blast.pl in the examples/blast/ subdirectory - rather than to
 967 use Blast.pm directly.  Sample syntax looks like this:
 968
 969    run_blast_remote.pl seq/yel009c.fasta -prog blastp -db swissprot
 970    retrieve_blast.pl < YEL009C.blastp2.swissprot.temp.html
 971
 972 The NCBI blast server will respond with an ID number indicating the
 973 file in which the blast results are stored (with a line like "Obtained
 974 request ID: 940912366-18156-27559").  That result file will then be
 975 stored locally with a name like 940905064-15626-17267.txt, and can
 976 subsequently be read with Blast.pm or BPlite as described below.
 977
 978 Run the scripts run_blast_remote.pl, retrieve_blast.pl and
 979 blast_config.pl with the options "-h" or "-eg" for more examples on
 980 how to use Blast.pm to perform remote blasts.
 981
 982 =head2   III.4.3 Parsing BLAST reports with Blast.pm
 983
 984 No matter how Blast searches are run (locally or remotely, with or
 985 without a perl interface), they return large quantities of data that
 986 are tedious to sift through.  Bioperl offers two different objects -
 987 Blast.pm and BPlite.pm (along with its minor modifications, BPpsilite
 988 and BPbl2seq) for parsing Blast reports.
 989
 990 The parser contained within the Blast.pm module is the original Blast
 991 parser developed for Bioperl.  It is very full featured and has a
 992 large array of options and output formats.  Typical syntax for parsing
 993 a blast report with Blast.pm is:
 994
 995   use Bio::Tools::Blast;
 996   $blast = Bio::Tools::Blast->new(-file   =>'t/blast.report',
 997                                   -signif => 1e-5,
 998                                   -parse  => 1,
 999                                   -stats  => 1,
1000                                   -check_all_hits => 1, );
1001   $blast->display();
1002   $num_hits =  $blast->num_hits;
1003   @hits  = $blast->hits;
1004   $frac1 = $hits[1]->frac_identical;
1005   @inds = $hits[1]->hsp->seq_inds('query', 'iden', 1);
1006
1007 Here the method "hits" returns an object containing the names of the
1008 sequences which produced a match and the "hsp" method returns a "high
1009 scoring pair" object containing the actual sequence alignments that
1010 each of the hits produced.
1011
1012 One very nice feature of the Blast.pm parser is being able to define
1013 an arbitrary "filter function" for use while parsing the Blast hits.
1014 With this feature, you can filter your results to just save hits with
1015 specific pattern in their id fields (eg "homo sapiens") or specific
1016 sequence patterns in a returned high-scoring-pair or just about
1017 anything else that can be found in the blast report record.
1018
1019 While the Blast object is parsing the report, each hit is checked by
1020 calling &filter($hit). All hits that generate false return values from
1021 &filter are screened out of the Blast object.. Note that the Blast
1022 object will normally stop parsing after the first non-significant hit
1023 or the first hit that does not pass the filter function. To force the
1024 Blast object to check all hits, include a " -check_all_hits =E<gt> 1"
1025 parameter. For example, to eliminate all hits with gaps or with less
1026 than 50% conserved residues one could use the following filter
1027 function:
1028
1029   sub filter { $hit=shift;
1030   return ($hit->gaps == 0 and $hit->frac_conserved > 0.5); }
1031
1032 and use it like this:
1033
1034   $blastObj = Bio::Tools::Blast->new( '-file'      => '/tmp/blast.out',
1035                                       '-parse'     => 1,
1036                                       '-check_all_hits' => 1,
1037                                       '-filt_func' => \&filter );
1038
1039 Unfortunately the flexibility of the Blast.pm parser comes at a cost
1040 of complexity.  As a result of this complexity and the fact that
1041 Blast.pm's original developer is no longer actively supporting the
1042 module, the Blast.pm parser has been difficult to maintain and has not
1043 been upgraded to handle the output of the newer blast options such as
1044 PSIBLAST and BL2SEQ.  Consequently, the BPlite parser (described in
1045 the following section) is recommended for most blast parsing within
1046 bioperl.
1047
1048 =head2 III.4.4 Parsing BLAST reports with BPlite, BPpsilite and BPbl2seq
1049
1050 Because of the issues with Blast.pm discussed above, Ian Korf's BPlite
1051 parser has been recently ported to Bioperl.  BPlite is less complex
1052 and easier to maintain than Blast.pm.  Although it has fewer options
1053 and display modes than Blast.pm, you will probably find that it
1054 contains the functionality that you need.  ( One exception might be if
1055 you want to set up an arbitrary filter function - as described above -
1056 in which case you may want to use the Blast.pm parser.)
1057
1058 BPlite
1059
1060 The syntax for using BPlite is as follows where the method for
1061 retrieving hits is now called "nextSbjct" (for "subject"), while the
1062 method for retrieving high-scoring-pairs is called "nextHSP":
1063
1064   use Bio::Tools::BPlite;
1065   $report = new BPlite(-fh=>\*STDIN);
1066   $report->query;
1067   while(my $sbjct = $report->nextSbjct) {
1068        $sbjct->name;
1069        while (my $hsp = $sbjct->nextHSP) { $hsp->score; }
1070   }
1071
1072 BPpsilite
1073
1074 BPpsilite and BPbl2seq are objects for parsing (multiple iteration)
1075 PSIBLAST reports and Blast bl2seq reports, respectively.  They are
1076 both minor variations on the BPlite object.
1077
1078 The syntax for parsing a multiple iteration PSIBLAST report is as
1079 shown below.  The only significant additions to BPlite are methods to
1080 determine the number of iterated blasts and to access the results from
1081 each iteration.  The results from each iteration are parsed in the
1082 same manner as a (complete) BPlite object.
1083
1084   use Bio::Tools::BPpsilite;
1085   $report = new BPpsilite(-fh=>\*STDIN);
1086   $total_iterations = $report->number_of_iterations;
1087   $last_iteration = $report->round($total_iterations)
1088   while(my $sbjct =  $last_iteration ->nextSbjct) {
1089        $sbjct->name;
1090        while (my $hsp = $sbjct->nextHSP) {$hsp->score; }
1091   }
1092
1093 BPbl2seq
1094
1095 BLAST bl2seq is a program for comparing and aligning two sequences
1096 using BLAST.  Although the report format is similar to that of a
1097 conventional BLAST, there are a few differences.  Consequently, the
1098 standard bioperl parsers Blast.pm and BPlite are unable to read bl2seq
1099 reports directly.  From the user's perspective, the main difference
1100 between bl2seq and other blast reports is that the bl2seq report does
1101 not print out the name of the first of the two aligned sequences.
1102 Consequently, BPbl2seq has no way of identifying the name of one of
1103 the initial sequence unless it is explicitly passed to constructor as
1104 a second argument as in:
1105
1106   use Bio::Tools::BPbl2seq;
1107   $report = Bio::Tools::BPbl2seq->new(-file => "t/bl2seq.out", -queryname => "ALEU_HORVU");
1108   $matches = $report->match;
1109
1110 =head2 III.4.5 Parsing HMM reports (HMMER::Results)
1111
1112 Blast is not the only sequence-similarity-searching program supported
1113 by bioperl.  HMMER is a Hidden Markov-chain Model (HMM) program that
1114 (among other capabilities) enables sequence similarity
1115 searching. Bioperl does not currently provide a perl interface for
1116 running HMMER.  However, bioperl does provide a HMMER report parser
1117 with the (perhaps not too descriptive) name of Results.
1118
1119 Results can parse reports generated both by the HMMER program
1120 hmmsearch - which searches a sequence database for sequences similar
1121 to those generated by a given HMM - and the program hmmpfam - which
1122 searches a HMM database for HMMs which match domains of a given
1123 sequence. For hmmsearch, a series of HMMER::Set objects are made, one
1124 for each sequence. For hmmpfam searches, only one Set object is made.
1125 Sample usage for parsing a hmmsearch report might be:
1126
1127   use Bio::Tools::HMMER::Results;
1128   $res = new Bio::Tools::HMMER::Results('-file' => 'output.hmm' ,
1129                                         '-type' => 'hmmsearch');
1130   foreach $seq ( $res->each_Set ) {
1131       print "Sequence bit score is ", $seq->bits, "\n";
1132       foreach $domain ( $seq->each_Domain ) {
1133           print " Domain start ", $domain->start, " end ",
1134               $domain->end," score ",$domain->bits,"\n";
1135       }
1136   }
1137
1138 =head2 III.5 Creating and manipulating sequence alignments
1139
1140 Once one has identified a set of similar sequences, one often needs to
1141 create an alignment of those sequences.  Bioperl offers several perl
1142 objects to facilitate sequence alignment: pSW, Clustalw.pm, TCoffee.pm
1143 and the bl2seq option of StandAloneBlast.  All of these objects take
1144 as arguments a reference to an array of (unaligned) Seq objects.  All
1145 (except bl2seq) return a reference to a SimpleAlign object.  bl2seq
1146 can also produce a SimpleAlign object when it is combined with AlignIO
1147 (see below section III.5.2).
1148
1149 =head2    III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
1150
1151 The Smith-Waterman (SW) algorithm is the standard method for producing
1152 an optimal alignment of two sequences.  Bioperl supports the
1153 computation of SW alignments via the pSW object.  The SW algorithm
1154 itself is implemented in C and incorporated into bioperl using an XS
1155 extension.  This has significant efficiency advantages but means that
1156 pSW will **not** work unless you have compiled the bioperl-ext
1157 package.  If you have compiled the bioperl-ext package, usage is
1158 simple, where the method align_and_show displays the alignment while
1159 pairwise_alignment produces a (reference to) a SimpleAlign object.
1160
1161   use Bio::Tools::pSW;
1162   $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla',
1163                                   '-gap' => 12,
1164                                   '-ext' => 2, );
1165   $factory->align_and_show($seq1, $seq2, STDOUT);
1166   $aln = $factory->pairwise_alignment($seq1, $seq2);
1167
1168 SW matrix, gap and extension parameters can be adjusted as shown.
1169 Bioperl comes standard with blosum62 and gonnet250 matrices.  Others
1170 can be added by the user.  For additional information on accessing the
1171 SW algorithm via pSW see the example script pSW.pl and the
1172 documentation in pSW.pm.
1173
1174 =head2   III.5.2 Aligning 2 sequences with Blast using  bl2seq and AlignIO
1175
1176 As an alternative to Smith-Waterman, two sequences can also be aligned
1177 in Bioperl using the bl2seq option of Blast within the StandAloneBlast
1178 object.  To get an alignment - in the form of a SimpleAlign object -
1179 using bl2seq, you need to parse the bl2seq report with the AlignIO
1180 file format reader as follows:
1181
1182   $factory = Bio::Tools::StandAloneBlast->new('outfile' => 'bl2seq.out');
1183   $bl2seq_report = $factory->bl2seq($seq1, $seq2);
1184   # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
1185   $str = Bio::AlignIO->new('-file '=>' bl2seq.out',
1186                            '-format' => 'bl2seq');
1187   $aln = $str->next_aln();
1188
1189 =head2   III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
1190
1191 For aligning multiple sequences (ie two or more), bioperl offers a
1192 perl interface to the bioinformatics-standard clustalw and tcoffee
1193 programs.  Clustalw has been a leading program in global multiple
1194 sequence alignment (MSA) for several years.  TCoffee is a relatively
1195 recent program - derived from clustalw - which has been shown to
1196 produce better results for local MSA.
1197
1198 To use these capabilities, the clustalw and/or tcoffee programs
1199 themselves need to be installed on the host system.  In addition, the
1200 environmental variables CLUSTALDIR and TCOFFEEDIR need to be set to
1201 the directories containg the executables.  See section I.3 and the
1202 Clustalw.pm and TCoffee.pm module documentation for information on
1203 downloading and installing these programs.
1204
1205 From the user's perspective, the bioperl syntax for calling
1206 Clustalw.pm or TCoffee.pm is almost identical.  The only differences
1207 are the names of the modules themselves appearing in the initial "use"
1208 and constructor statements and the names of the some of the individual
1209 program options and parameters.
1210
1211 In either case, initially, a "factory object" must be created. The
1212 factory may be passed most of the parameters or switches of the
1213 relevant program.  In addition, alignment parameters can be changed
1214 and/or examined after the factory has been created.  Any parameters
1215 not explicitly set will remain as the underlying program's
1216 defaults. Clustalw.pm/TCoffee.pm output is returned in the form of a
1217 SimpleAlign object.  It should be noted that some Clustalw and TCoffee
1218 parameters and features (such as those corresponding to tree
1219 production) have not been implemented yet in the Perl interface.
1220
1221 Once the factory has been created and the appropriate parameters set,
1222 one can call the method align() to align a set of unaligned sequences,
1223 or profile_align() to add one or more sequences or a second alignment
1224 to an initial alignment.  Input to align() consists of a set of
1225 unaligned sequences in the form of the name of file containing the
1226 sequences or a reference to an array of Bio::Seq objects. Typical
1227 syntax is shown below. (We illustrate with Clustalw.pm, but the same
1228 syntax - except for the module name - would work for TCoffee.pm)
1229
1230   use Bio::Tools::Run::Alignment::Clustalw;
1231   @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
1232   $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
1233   $ktuple = 3;
1234   $factory->ktuple($ktuple);  # change the parameter before executing
1235   $seq_array_ref = \@seq_array;
1236       # where @seq_array is an array of Bio::Seq objects
1237   $aln = $factory->align($seq_array_ref);
1238
1239 Clustalw.pm/TCoffee.pm can also align two (sub)alignments to each
1240 other or add a sequence to a previously created alignment by using the
1241 profile_align method. For further details on the required syntax and
1242 options for the profile_align method, the user is referred to the
1243 Clustalw.pm/TCoffee.pm documentation.  The user is also encouraged to
1244 run the script clustalw.pl in the examples directory.
1245
1246 =head2 III.5.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
1247
1248 As described in section II.2, bioperl currently includes two alignment
1249 objects, SimpleAlign and UnivAln.  SimpleAlign objects are usually
1250 more useful, since they are directly produced by bioperl alignment
1251 creation objects ( eg Clustalw.pm and pSW) and can be used to read and
1252 write multiple alignment formats via AlignIO.
1253
1254 However, SimpleAlign currently only offers limited functionality for
1255 alignment manipulation.  One useful method offered by SimpleAlign is
1256 consensus_string().  This method returns a string with the most common
1257 residue in the alignment at each string position. An optional
1258 threshold ranging from 0 to 100 can be passed to consensus_string.  If
1259 the consensus residue appears in fewer than the threshold % of the
1260 sequences, consensus_string will return a "?" at that
1261 location. Typical usage is:
1262
1263   use Bio::SimpleAlign;
1264   $aln = Bio::SimpleAlign->new('t/alnfile.fasta');
1265   $threshold_percent = 60;
1266   $str = $aln->consensus_string($threshold_percent)
1267
1268 UnivAln, on the other hand, offers a variety of methods for "slicing
1269 and dicing" an alignment including methods for removing gaps, reverse
1270 complementing specified rows and/or columns of an alignment, and
1271 extracting consensus sequences with specified thresholds for the
1272 entire alignment or a sub-alignment.  Typical usage is:
1273
1274   use Bio::UnivAln;
1275   $aln = Bio::UnivAln->new('t/alnfile.fasta');
1276   $resSlice1 = $aln->remove_gaps(); # original sequences without gaps
1277   $resSlice2 = $aln->revcom([1,3]); # reverse complement, rows 1+3 only
1278   $resSlice3 = $aln->consensus(0.6, [1,3]);
1279        # 60% majority, columns 1+3 only
1280
1281 Many additional - and more intricate - methods exist.  See the UnivAln
1282 documentation.  Note that if you do want to use UnivAln's methods on
1283 an alignment, you will first need to convert the alignment into fasta
1284 format (which can be done via the SimpleAlign and AlignIO objects
1285 discussed above.)
1286
1287 =head2 III.6 Searching for genes and other structures on genomic DNA
1288 (Genscan, Sim4, ESTScan, MZEF)
1289
1290 Automated searching for putative genes, coding sequences and other
1291 functional units in genomic and expressed sequence tag (EST) data has
1292 become very important as the available quantity of sequence data has
1293 rapidly increased.  Many gene searching programs currently exist.
1294 Each produces reports containing predictions that must be read
1295 manually or parsed by automated report readers.
1296
1297 Parsers for four widely used gene prediction programs- Genscan, Sim4,
1298 ESTScan and MZEF - are currently available or under active development
1299 in bioperl.  The interfaces for the four parsers are similar. We illustrate
1300 the usage for Genscan and Sim4 here.  The syntax is relatively
1301 self-explanatory; further details are available in the module
1302 documentation in the Bio::Tools directory.
1303
1304   use Bio::Tools::Genscan;
1305   $genscan = Bio::Tools::Genscan->new(-file => 'result.genscan');
1306   # $gene is an instance of Bio::Tools::Prediction::Gene
1307   # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
1308   while($gene = $genscan->next_prediction())
1309       { @exon_arr = $gene->exons(); }
1310   $genscan->close();
1311
1312   use Bio::Tools::Sim4::Results;
1313   $sim4 = new Bio::Tools::Sim4::Results(-file=> 't/sim4.rev', -estisfirst=>0);
1314   # $exonset is-a Bio::SeqFeature::Generic with Bio::Tools::Sim4::Exons
1315   # as sub features
1316   $exonset = $sim4->next_exonset;
1317   @exons = $exonset->sub_SeqFeature();
1318   # $exon is-a Bio::SeqFeature::FeaturePair
1319   $exon = 1;
1320   $exonstart = $exons[$exon]->start();
1321   $estname = $exons[$exon]->est_hit()->seqname();
1322   $sim4->close();
1323
1324 =head2 III.7 Developing machine readable sequence annotations
1325
1326 Historically, annotations for sequence data have been entered and read
1327 manually in flat-file or relational databases with relatively little
1328 concern for machine readability.  More recent projects - such as EBI's
1329 Ensembl project and the efforts to develop an XML molecular biology
1330 data specification - have begun to address this limitation.  Because
1331 of its strengths in text processing and regular-expression handling,
1332 perl is a natural choice for the computer language to be used for this
1333 task.  And bioperl offers numerous tools to facilitate this process -
1334 several of which are described in the following sub-sections.
1335
1336 =head2 III.7.1 Representing sequence annotations (Annotation,
1337 SeqFeature)
1338
1339 As of the 0.7 release of bioperl, the fundamental sequence object,
1340 Seq, can have multiple sequence feature (SeqFeature) objects - eg
1341 Gene, Exon, Promoter objects - associated with it.  A Seq object can
1342 also have an Annotation object (used to store database links,
1343 literature references and comments) associated with it.  Creating a
1344 new SeqFeature and Annotation and associating it with a Seq is
1345 accomplished with syntax like:
1346
1347   $feat = new Bio::SeqFeature::Generic('-start' => 40,
1348                                         '-end' => 80,
1349                                         '-strand' => 1,
1350                                         '-primary' => 'exon',
1351                                         '-source' => 'internal' );
1352   $seqobj->add_SeqFeature($feat); # Add the SeqFeature to the parent
1353   $seqobj->annotation(new Bio::Annotation
1354       ('-description' => 'desc-here'));
1355
1356 Once the features and annotations have been associated with the Seq,
1357 they can be with retrieved, eg:
1358
1359   @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
1360   @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
1361   $ann = $seqobj->annotation(); # annotation object
1362
1363 The individual components of a SeqFeature can also be set or retrieved
1364 with methods including:
1365
1366   # attributes which return numbers
1367   $feat->start          # start position
1368   $feat->end            # end position
1369
1370   $feat->strand         # 1 means forward, -1 reverse, 0 not relevant
1371
1372   # attributes which return strings
1373   $feat->primary_tag    # the main 'name' of the sequence feature,
1374                         # eg, 'exon'
1375   $feat->source_tag     # where the feature comes from, eg'BLAST'
1376
1377   # attributes which return Bio::PrimarySeq objects
1378   $feat->seq            # the sequence between start,end
1379   $feat->entire_seq     # the entire sequence
1380
1381   # other useful methods include
1382   $feat->overlap($other)  # do SeqFeature $feat and SeqFeature $other overlap?
1383   $feat->contains($other) # is $other completely within $feat?
1384   $feat->equals($other)   # do $feat and $other completely $agree?
1385   $feat->sub_SeqFeatures  # create/access an array of subsequence features
1386
1387 =head2 III.7.2 Representing and large and/or changing sequences
1388 (LiveSeq,LargeSeq)
1389
1390 Very large sequences and/or data files with sequences that are
1391 frequently being updated present special problems to automated
1392 sequence-annotation storage and retrieval projects.  Bioperl's
1393 LargeSeq and LiveSeq objects are designed to address these two
1394 situations.
1395
1396 LargeSeq
1397
1398 A LargeSeq object is a SeqI compliant object that stores a sequence as
1399 a series of files in a temporary directory (see sect II.1 for a
1400 definition of SeqI objects). The aim is to enable storing very large
1401 sequences (eg, E<gt> 100MBases) without running out of memory and, at the
1402 same time, preserving the familiar bioperl Seq object interface.  As a
1403 result, from the users perspective, using a LargeSeq object is almost
1404 identical to using a Seq object.  The principal difference is in the
1405 format used in the SeqIO calls.  Another difference is that the user
1406 must remember to only read in small chunks of the sequence at one
1407 time.  These differences are illustrated in the following code:
1408
1409   $seqio = new Bio::SeqIO('-format'=>'largefasta',
1410                           '-file'  =>'t/genomic-seq.fasta');
1411   $pseq = $seqio->next_seq();
1412   $plength = $pseq->length();
1413   $last_4 = $pseq->subseq($plength-3,$plength);  # this is OK
1414
1415   #On the other hand, the next statement would
1416   #probably cause the machine to run out of memory
1417   #$lots_of_data = $pseq->seq();  #NOT OK for a large LargeSeq object
1418
1419 LiveSeq
1420
1421 The LiveSeq object addresses the need for a sequence object capable of
1422 handling sequence data that may be changing over time.  In such a
1423 sequence, the precise locations of features along the sequence may
1424 change.  LiveSeq deals with this issue by re-implementing the sequence
1425 object internally as a "double linked chain." Each element of the
1426 chain is connected to other two elements (the PREVious and the NEXT
1427 one). There is no absolute position (like in an array), hence if
1428 positions are important, they need to be computed (methods are
1429 provided). Otherwise it's easy to keep track of the elements with
1430 their "LABELs". There is one LABEL (think of it as a pointer) to each
1431 ELEMENT. The labels won't change after insertions or deletions of the
1432 chain. So it's always possible to retrieve an element even if the
1433 chain has been modified by successive insertions or deletions.
1434
1435 Although the implementation of the LiveSeq object is novel, its
1436 bioperl user interface is unchanged since LiveSeq implements a
1437 PrimarySeqI interface (recall PrimarySeq is the subset of Seq without
1438 annotations or SeqFeatures - see sec II.1).  Consequently syntax for
1439 using LiveSeq objects is familiar although a modified version of SeqIO
1440 called LiveSeq::IO::Bioperl needs to be used to actually load the
1441 data, eg:
1442
1443   $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
1444                                           '-file'=>"t/factor7.embl");
1445   $gene=$loader->gene2liveseq('-gene_name' => "factor7");
1446   $id = $gene->get_DNA->display_id ;
1447   $maxstart = $gene->maxtranscript->start;
1448
1449 Creating, maintaining and querying of LiveSeq genes is quite memory
1450 and processor intensive.  Consequently, any additional information
1451 relating to mutational changes in a gene need to be stored separately
1452 from the sequence data itself. The next section describes the mutation
1453 and polymorphism objects used to accomplish this.
1454
1455
1456 =head2 III.7.3 Representing related sequences - mutations,
1457 polymorphisms etc (Allele, SeqDiff,)
1458
1459 Bio::LiveSeq::Mutation object allows for a basic description of a
1460 sequence change in DNA or cDNA sequence of a gene.
1461 Bio::LiveSeq::Mutator takes in mutations, applies them to a LiveSeq
1462 gene and returns a set of Bio::Variation objects describing the net
1463 effect of the mutation on the gene at the DNA, RNA and protein stages.
1464
1465 The objects in Bio::Variation and Bio::LiveSeq directory were
1466 originally designed for the "Computational Mutation Expression
1467 Toolkit" project at European Bioinformatics Institute (EBI). The
1468 result of using them to mutate a gene is a holder object, 'SeqDiff',
1469 that can be printed out or queried for specific information. For
1470 example, to find out if restriction enzyme changes caused by a
1471 mutation are exactly the same in DNA and RNA sequences, we can write:
1472
1473   use Bio::LiveSeq::IO::BioPerl;
1474   use Bio::LiveSeq::Mutator;
1475   use Bio::LiveSeq::Mutation;
1476
1477   $loader=Bio::LiveSeq::IO::BioPerl->load('-file' => "$filename");
1478   $gene=$loader->gene2liveseq('-gene_name' => $gene_name);
1479   $mutation = new Bio::LiveSeq::Mutation ('-seq' =>'G',
1480                                           '-pos' => 100,
1481                                          );
1482   $mutate = Bio::LiveSeq::Mutator->new('-gene' => $gene,
1483                                        '-numbering' => "coding"
1484                                       );
1485   $mutate->add_Mutation($mutation);
1486   $seqdiff = $mutate->change_gene();
1487   $DNA_re_changes = $seqdiff->DNAMutation->restriction_changes;
1488   $RNA_re_changes = $seqdiff->RNAChange->restriction_changes;
1489   $DNA_re_changes eq $RNA_re_changes or print "Different!\n";
1490
1491 For a complete working script, see the change_gene.pl script
1492 in the examples directory. For more details on the use of these objects
1493 see the documentation in the modules as well as the original
1494 documentation for the "Computational Mutation Expression Toolkit"
1495 project at http://www.ebi.ac.uk/mutations/toolkit/.
1496
1497 =head2 III.7.4 Sequence XML representations - generation and parsing
1498 (SeqIO::game)
1499
1500 The previous subsections have described tools for automated sequence
1501 annotation by the creation of an "object layer" on top of a
1502 traditional database structure.  XML takes a somewhat different
1503 approach.  In XML, the data structure is unmodified, but machine
1504 readability is facilitated by using a data-record syntax with special
1505 flags and controlled vocabulary.
1506
1507 Bioperl supports a set of XML flags and vocabulary words for molecular
1508 biology - called bioxml - detailed at
1509 http://www.bioxml.org/dtds/current/ The idea is that any bioxml
1510 features can be turned into bioperl Bio::Seq annotations.  Conversely
1511 Seq object features and annotations can be converted to XML so that
1512 they become available to any other systems that are XML (and bioxml)
1513 compliant.  Typical usage is shown below. No special syntax is
1514 required by the user. Note that some Seq annotation will be lost when
1515 using bioxml in this manner - since in its current implementation,
1516 bioxml does not support all the annotation information available in
1517 Seq objects.
1518
1519   $str = Bio::SeqIO->new('-file'=> 't/test.game',
1520                          '-format' => 'game');
1521   $seq = $str->next_primary_seq();
1522   $id = $seq->id;
1523   @feats = $seq->all_SeqFeatures();
1524   $first_primary_tag = $feats[0]->primary_tag;
1525
1526
1527 =head1 IV.  Related projects - biocorba, biopython, biojava, Ensembl,
1528  AnnotationWorkbench / bioperl-gui
1529
1530 There are several "sister projects" to bioperl currently under
1531 development. These include biocorba, biopython, biojava, Ensembl, and
1532 the Annotation Workbench (which includes Bioperl-gui).  These are all
1533 large complex projects and describing them in detail here will not be
1534 attempted.  However a brief introduction seems appropriate since, in
1535 the future, they may each provide significant added utility to the
1536 bioperl user.
1537
1538 =head2 IV.1 Biocorba
1539
1540 Interface objects have facilitated interoperability between bioperl
1541 and other perl packages such as Ensembl and the Annotation Workbench.
1542 However, interoperability between bioperl and packages written in
1543 other languages requires additional support software.  CORBA is one
1544 such framework for interlanguage support, and the biocorba project is
1545 currently implementing a CORBA interface for bioperl.  With biocorba,
1546 objects written within bioperl will be able to communicate with
1547 objects written in biopython and biojava (see the next subsection).
1548 For more information, se the biocorba project website at
1549 http://biocorba.org/.
1550
1551 =head2 IV.2 Biopython and biojava
1552
1553 Biopython and biojava are open source projects with very similar goals
1554 to bioperl.  However their code is implemented in python and java,
1555 respectively.  With the development of interface objects and biocorba,
1556 it is possible to write java or python objects which can be accessed
1557 by a bioperl script.  Or to call bioperl objects from java or python
1558 code.  Since biopython and biojava are more recent projects than
1559 bioperl, most effort to date has been to port bioperl functionality to
1560 biopython and biojava rather than the other way around.  However, in
1561 the future, some bioinformatics tasks may prove to be more effectively
1562 implemented in java or python in which case being able to call them
1563 from within bioperl will become more important.  For more information,
1564 go to the biojava http://biojava.org/ and biopython
1565 http://biopython.org/ websites.
1566
1567 =head2 IV.3  Ensembl
1568
1569 Ensembl is an ambitious automated-genome-annotation project at EBI.
1570 Much of Ensembl's code is based on bioperl and Ensembl developers, in
1571 turn, have contributed significant pieces of code to bioperl.  In
1572 particular, the bioperl code for automated sequence annotation has
1573 been largely contributed by Ensembl developers. (The close association
1574 between bioperl and Ensembl is not surprising, since the same
1575 individual - Ewan Birney - has been coordinating both projects).
1576 Describing Ensembl and its capabilities is far beyond the scope of
1577 this tutorial The interested reader is referred to the Ensembl website
1578 at http://www.ensembl.org/ .
1579
1580 =head2 IV.4 The Annotation Workbench and bioperl-gui
1581
1582 The Annotation Workbench (AW) being developed at the Plant
1583 Biotechnology Institute of the National Research Council of Canada is
1584 designed to be an integrated suite of tools for examining a sequence,
1585 predicting gene structure, and creating annotations. The workbench
1586 features a graphical user interface and is implemented completely in
1587 perl.  Information about the AW is available at
1588 http://bioinfo.pbi.nrc.ca/dblock/wiki/html/Bioinfo/AnnotationWorkbench.htm . A
1589 goal of the AW team is to port much of the functionality of the AW to
1590 bioperl.  The porting process has begun and displaying a Seq object
1591 graphically is now possible.  You can download the current version of
1592 the gui software from the bioperl bioperl-gui CVS directory at
1593 http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-gui/?cvsroot=bioperl
1594  .
1595
1596
1597 =head2 V.1 Appendix: Public Methods of Bioperl Objects
1598
1599 Appendix V.1 lists the public methods for the principal bioperl
1600 objects.  The methods are grouped together under the name of the
1601 "ancestor" object from which the object inherits the method. Knowing
1602 the name of the ancestor object is useful since the ancestor module
1603 will contain the documentation describing the method.
1604
1605 Since nearly all bioperl objects inherit from Bio::Root::RootI, the
1606 methods inherited from Bio::Root::RootI are listed separately - and
1607 only once.
1608
1609 If a listing of public methods for a bioperl object not listed in this
1610 appendix is required, the listing can be obtained by running the
1611 bptutorial.pl script using with command number 100 as in:
1612
1613   perl -w  bptutorial.pl 100 Bio::Tools::SeqStats
1614
1615 (where Bio::Tools::SeqStats would be replaced with the name of the
1616 bioperl object for which the methods list is needed - see Appendix V.2
1617 for details)
1618
1619 B<Methods for Object Bio::Root::RootI>
1620
1621 I<Methods taken from package Bio::Root::RootI>
1622
1623  DESTROY gensym new qualify qualify_to_ref stack_trace
1624  stack_trace_dump tempdir tempfile throw ungensym verbose warn
1625
1626 B<Methods for Object Bio::AlignIO>
1627
1628 I<Methods taken from package Bio::AlignIO>
1629
1630  PRINT READLINE TIEHANDLE close fh newFh next_aln write_aln
1631
1632 B<Methods for Object Bio::DB::GenBank>
1633
1634 I<Methods taken from package Bio::DB::NCBIHelper>
1635
1636  get_Stream_by_batch get_params
1637
1638 I<Methods taken from package Bio::DB::RandomAccessI>
1639
1640  get_Seq_by_acc get_Seq_by_id
1641
1642 I<Methods taken from package Bio::DB::WebDBSeqI>
1643
1644  GET HEAD POST PUT default_format get_Stream_by_acc get_Stream_by_id
1645  get_request get_seq_stream postprocess_data proxy request_format
1646  retrieval_type ua url_base_address url_params
1647
1648 B<Methods for Object Bio::Index::Fasta>
1649
1650 I<Methods taken from package Bio::DB::RandomAccessI>
1651
1652  get_Seq_by_acc get_Seq_by_id
1653
1654 I<Methods taken from package Bio::DB::SeqI>
1655
1656  get_PrimarySeq_stream get_Seq_by_primary_id get_all_primary_ids
1657
1658 I<Methods taken from package Bio::Index::Abstract>
1659
1660  O_CREAT O_RDWR add_record db dbm_package filename get_stream
1661  make_index open_dbm pack_record unpack_record write_flag
1662
1663 I<Methods taken from package Bio::Index::AbstractSeq>
1664
1665  fetch
1666
1667 I<Methods taken from package Bio::Index::Fasta>
1668
1669  default_id_parser id_parser
1670
1671 B<Methods for Object Bio::LocatableSeq>
1672
1673 I<Methods taken from package Bio::LocatableSeq>
1674
1675  get_nse
1676
1677 I<Methods taken from package Bio::PrimarySeqI>
1678
1679  GCG_checksum accession_number ary can_call_new desc display_id getseq
1680  id moltype out_fasta primary_id revcom seq seq_len setseq str subseq
1681  translate translate_old trunc type
1682
1683 I<Methods taken from package Bio::RangeI>
1684
1685  carp confess contains croak end equals intersection length new
1686  overlap_extent overlaps start strand union
1687
1688 I<Methods taken from package Bio::Seq>
1689
1690  accession add_SeqFeature add_date add_secondary_accession annotation
1691  division each_date each_secondary_accession feature_count keywords
1692  molecule primary_seq species sv
1693
1694 I<Methods taken from package Bio::SeqI>
1695
1696  all_SeqFeatures top_SeqFeatures write_GFF
1697
1698 B<Methods for Object Bio::Seq>
1699
1700 I<Methods taken from package Bio::PrimarySeqI>
1701
1702  GCG_checksum accession_number ary can_call_new carp confess croak
1703  desc display_id getseq id length moltype out_fasta primary_id revcom
1704  seq seq_len setseq str subseq translate translate_old trunc type
1705
1706 I<Methods taken from package Bio::Seq>
1707
1708  accession add_SeqFeature add_date add_secondary_accession annotation
1709  division each_date each_secondary_accession feature_count keywords
1710  molecule primary_seq species sv
1711
1712 I<Methods taken from package Bio::SeqI>
1713
1714  all_SeqFeatures top_SeqFeatures write_GFF
1715
1716 B<Methods for Object Bio::SeqIO>
1717
1718 I<Methods taken from package Bio::SeqIO>
1719
1720  PRINT READLINE TIEHANDLE close fh moltype newFh next_primary_seq
1721  next_seq write_seq
1722
1723 B<Methods for Object Bio::SimpleAlign>
1724
1725 I<Methods taken from package Bio::SimpleAlign>
1726
1727  addSeq alpha_startend column_from_residue_number consensus_aa
1728  consensus_string eachSeq eachSeqWithId each_alphabetically
1729  get_displayname id is_flush length_aln map_chars
1730  maxdisplayname_length maxname_length maxnse_length no_residues
1731  no_sequences percentage_identity purge read_MSF read_Pfam
1732  read_Pfam_file read_Prodom read_fasta read_mase read_selex
1733  read_stockholm removeSeq set_displayname set_displayname_count
1734  set_displayname_flat set_displayname_normal sort_alphabetically
1735  uppercase write_MSF write_Pfam write_clustalw write_fasta write_selex
1736
1737 B<Methods for Object Bio::Tools::BPbl2seq>
1738
1739 I<Methods taken from package Bio::RangeI>
1740
1741  carp confess contains croak end equals intersection length new
1742  overlap_extent overlaps start strand union
1743
1744 I<Methods taken from package Bio::SeqFeature::FeaturePair>
1745
1746  feature1 feature2 hend hframe hprimary_tag hscore hseqname
1747  hsource_tag hstart hstrand invert
1748
1749 I<Methods taken from package Bio::SeqFeature::Generic>
1750
1751  add_sub_SeqFeature add_tag_value annotation attach_seq entire_seq
1752  flush_sub_SeqFeature frame remove_tag score seq seqname
1753  slurp_gff_file
1754
1755 I<Methods taken from package Bio::SeqFeature::SimilarityPair>
1756
1757  bits from_searchResult query significance subject
1758
1759 I<Methods taken from package Bio::SeqFeatureI>
1760
1761  all_tags each_tag_value gff2_string gff_string has_tag primary_tag
1762  source_tag sub_SeqFeature
1763
1764 I<Methods taken from package Bio::Tools::BPbl2seq>
1765
1766  P homologySeq hs match percent positive qs querySeq sbjctSeq ss
1767
1768 B<Methods for Object Bio::Tools::BPlite>
1769
1770 I<Methods taken from package Bio::Tools::BPlite>
1771
1772  database fh nextSbjct pattern qlength query query_pattern_location
1773
1774 B<Methods for Object Bio::Tools::BPpsilite>
1775
1776 I<Methods taken from package Bio::Tools::BPpsilite>
1777
1778  database fh number_of_iterations pattern qlength query
1779  query_pattern_location round
1780
1781 B<Methods for Object Bio::Tools::Blast>
1782
1783 I<Methods taken from package Bio::Root::Object>
1784
1785  clear_err clone compress_file containment debug delete_file destroy
1786  display dont_warn err err_state err_string fatal_on_warn fh file
1787  file_date find_object has_name has_warning index make monitor name
1788  parent print_err read record_err set_display set_err_data set_log_err
1789  set_read set_stats show src_obj strict strictness terse testing
1790  to_string uncompress_file verbosity warn_on_fatal xref
1791
1792 I<Methods taken from package Bio::Tools::Blast>
1793
1794  ambiguous_aln carp confess croak db_local db_remote expect filter
1795  gap_creation gap_extension gapped highest_expect highest_p
1796  highest_signif hit hits homol_data is_signif karlin_altschul
1797  lowest_expect lowest_p lowest_signif matrix min_length num_hits
1798  overlap s signif signif_fmt table table_labels table_labels_tiled
1799  table_tiled to_html word_size
1800
1801 I<Methods taken from package Bio::Tools::SeqAnal>
1802
1803  best database database_letters database_release database_seqs date
1804  length parse program program_version query query_desc roman2int run
1805  set_date
1806
1807 I<Methods taken from package Exporter>
1808
1809  export export_fail export_ok_tags export_tags export_to_level import
1810  require_version
1811
1812 B<Methods for Object Bio::Tools::CodonTable>
1813
1814 I<Methods taken from package Bio::Root::RootI>
1815
1816  DESTROY gensym new qualify qualify_to_ref stack_trace
1817  stack_trace_dump tempdir tempfile throw ungensym verbose warn
1818
1819 I<Methods taken from package Bio::Tools::CodonTable>
1820
1821  id is_start_codon is_ter_codon is_unknown_codon name revtranslate
1822  translate translate_strict
1823
1824 B<Methods for Object Bio::Tools::Genscan>
1825
1826 I<Methods taken from package Bio::SeqAnalysisParserI>
1827
1828  carp confess croak next_feature parse
1829
1830 I<Methods taken from package Bio::Tools::AnalysisResult>
1831
1832  analysis_date analysis_method analysis_method_version analysis_query
1833  analysis_subject close
1834
1835 I<Methods taken from package Bio::Tools::Genscan>
1836
1837  next_prediction
1838
1839 B<Methods for Object Bio::Tools::HMMER::Results>
1840
1841 I<Methods taken from package Bio::Tools::HMMER::Results>
1842
1843  add_Domain add_Set carp confess croak dictate_hmm_acc
1844  domain_bits_cutoff_from_evalue each_Domain each_Set filter_on_cutoff
1845  get_Set get_unit_nse highest_noise lowest_true number write_FT_output
1846  write_GDF write_GDF_bits write_ascii_out write_scores_bits
1847
1848 B<Methods for Object Bio::Tools::OddCodes>
1849
1850 I<Methods taken from package Bio::Tools::OddCodes>
1851
1852  Dayhoff Sneath Stanfel charge chemical functional hydrophobic
1853  structural
1854
1855 B<Methods for Object Bio::Tools::RestrictionEnzyme>
1856
1857 I<Methods taken from package Bio::Tools::RestrictionEnzyme>
1858
1859  annotate_seq available available_list cut_locations cut_seq
1860  cuts_after is_available name palindromic revcom seq site string
1861
1862 I<Methods taken from package Exporter>
1863
1864  export   export_fail   export_ok_tags   export_tags   export_to_level   import
1865  require_version
1866
1867 B<Methods for Object Bio::Tools::Run::Alignment::Clustalw>
1868
1869 I<Methods taken from package Bio::Tools::Run::Alignment::Clustalw>
1870
1871  AUTOLOAD align exists_clustal profile_align
1872
1873 B<Methods for Object Bio::Tools::Run::Alignment::TCoffee>
1874
1875 I<Methods taken from package Bio::Tools::Run::Alignment::TCoffee>
1876
1877  AUTOLOAD align exists_tcoffee profile_align
1878
1879 B<Methods for Object Bio::Tools::Run::StandAloneBlast>
1880
1881 I<Methods taken from package Bio::Tools::Run::StandAloneBlast>
1882
1883  AUTOLOAD bl2seq blastall blastpgp exists_blast
1884
1885 B<Methods for Object Bio::Tools::SeqPattern>
1886
1887 I<Methods taken from package Bio::Root::Object>
1888
1889  clear_err clone compress_file containment debug delete_file destroy
1890  display dont_warn err err_state err_string fatal_on_warn fh file
1891  file_date find_object has_name has_warning index make monitor name
1892  parent print_err read record_err set_display set_err_data set_log_err
1893  set_read set_stats show src_obj strict strictness terse testing
1894  to_string uncompress_file verbosity warn_on_fatal xref
1895
1896 I<Methods taken from package Bio::Tools::SeqPattern>
1897
1898  alphabet_ok expand revcom str type
1899
1900 B<Methods for Object Bio::Tools::SeqStats>
1901
1902 I<Methods taken from package Bio::Tools::SeqStats>
1903
1904  count_codons count_monomers get_mol_wt
1905
1906 B<Methods for Object Bio::Tools::SeqWords>
1907
1908 I<Methods taken from package Bio::Root::Object>
1909
1910  clear_err clone compress_file containment debug delete_file destroy
1911  display dont_warn err err_state err_string fatal_on_warn fh file
1912  file_date find_object has_name has_warning index make monitor name
1913  parent print_err read record_err set_display set_err_data set_log_err
1914  set_read set_stats show src_obj strict strictness terse testing
1915  to_string uncompress_file verbosity warn_on_fatal xref
1916
1917 I<Methods taken from package Bio::Tools::SeqWords>
1918
1919  count_words
1920
1921 B<Methods for Object Bio::Tools::Sigcleave>
1922
1923 I<Methods taken from package Bio::PrimarySeqI>
1924
1925  GCG_checksum accession_number ary can_call_new carp confess croak
1926  desc display_id getseq id length moltype out_fasta primary_id revcom
1927  seq seq_len setseq str subseq translate translate_old trunc type
1928
1929 I<Methods taken from package Bio::Seq
1930
1931  accession add_SeqFeature add_date add_secondary_accession annotation
1932  division each_date each_secondary_accession feature_count keywords
1933  molecule primary_seq species sv
1934
1935 I<Methods taken from package Bio::SeqI>
1936
1937  all_SeqFeatures   top_SeqFeatures   write_GFF
1938
1939 I<Methods taken from package Bio::Tools::Sigcleave>
1940
1941  debug dont_warn fatal_on_warn monitor pretty_print signals strictness
1942  testing threshold verbosity warn_on_fatal
1943
1944 B<Methods for Object Bio::Tools::Sim4::Results>
1945
1946 I<Methods taken from package Bio::SeqAnalysisParserI>
1947
1948  carp confess croak next_feature parse
1949
1950 I<Methods taken from package Bio::Tools::AnalysisResult>
1951
1952  analysis_date analysis_method analysis_method_version analysis_query
1953  analysis_subject close
1954
1955 I<Methods taken from package Bio::Tools::Sim4::Results>
1956
1957  basename dirname fileparse fileparse_set_fstype next_exonset
1958  parse_next_alignment
1959
1960 B<Methods for Object Bio::Tools::pSW>
1961
1962 I<Methods taken from package Bio::Tools::AlignFactory>
1963
1964  kbyte report set_memory_and_report
1965
1966 I<Methods taken from package Bio::Tools::pSW>
1967
1968  align_and_show ext gap matrix pairwise_alignment
1969
1970 B<Methods for Object Bio::UnivAln>
1971
1972 I<Methods taken from package Bio::UnivAln>
1973
1974  abort access acos aln alphabet_check asctime asin atan basename carp
1975  catch ceil clock col_descs col_ids complement confess consensus copy
1976  cosh croak ctermid ctime cuserid desc descffmt difftime dirname dup
1977  dup2 equal_nogaps equalize_lengths ffmt fileparse
1978  fileparse_set_fstype floor fmod fpathconf frexp gap_free_cols
1979  gap_free_inds gap_free_sites height id inplace invar_inds invar_sites
1980  isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct
1981  isspace isupper isxdigit layout ldexp localeconv log10 lseek map_c
1982  map_r mblen mbstowcs mbtowc mkfifo mktime modf names new
1983  no_allgap_inds no_allgap_sites numbering out_bad out_fasta out_fasta2
1984  out_raw out_raw2 out_readseq parse parse_bad parse_fasta parse_raw
1985  parse_unknown pathconf pause remove_gaps revcom reverse row_descs
1986  row_ids samelength seqs setlocale setpgid setsid sigaction sigpending
1987  sigprocmask sigsuspend sinh special_free_inds special_free_sites
1988  strcoll strftime strtod strtol strtoul strxfrm sysconf tan tanh
1989  tcdrain tcflow tcflush tcgetpgrp tcsendbreak tcsetpgrp tmpnam ttyname
1990  type tzname tzset uname unknown_free_inds unknown_free_sites var_inds
1991  var_sites wcstombs wctomb width
1992
1993 I<Methods taken from package Exporter>
1994
1995  export export_fail export_ok_tags export_tags export_to_level import
1996  require_version
1997
1998 B<Methods for Object Bio::Variation::Allele>
1999
2000 I<Methods taken from package Bio::DBLinkContainerI>
2001
2002  carp confess croak each_DBLink
2003
2004 I<Methods taken from package Bio::PrimarySeqI>
2005
2006  GCG_checksum accession_number ary can_call_new desc display_id getseq
2007  id length moltype out_fasta primary_id revcom seq seq_len setseq str
2008  subseq translate translate_old trunc type
2009
2010 I<Methods taken from package Bio::Variation::Allele>
2011
2012  add_DBLink is_reference repeat_count repeat_unit
2013
2014 B<Methods for Object Bio::Variation::DNAMutation>
2015
2016 I<Methods taken from package Bio::DBLinkContainerI>
2017
2018  carp confess croak each_DBLink
2019
2020 I<Methods taken from package Bio::RangeI>
2021
2022  contains end equals intersection length new overlap_extent overlaps
2023  start strand union
2024
2025 I<Methods taken from package Bio::SeqFeature::Generic>
2026
2027  add_sub_SeqFeature add_tag_value annotation attach_seq entire_seq
2028  flush_sub_SeqFeature frame remove_tag score seq seqname
2029  slurp_gff_file
2030
2031 I<Methods taken from package Bio::SeqFeatureI>
2032
2033  all_tags each_tag_value gff2_string gff_string has_tag primary_tag
2034  source_tag sub_SeqFeature
2035
2036 I<Methods taken from package Bio::Variation::DNAMutation>
2037
2038  CpG RNAChange sysname
2039
2040 I<Methods taken from package Bio::Variation::VariantI>
2041
2042  SeqDiff add_Allele add_DBLink allele_mut allele_ori dnStreamSeq
2043  each_Allele id isMutation label mut_number numbering proof region
2044  region_value restriction_changes status upStreamSeq
2045
2046 B<Methods for Object Bio::Variation::SeqDiff>
2047
2048 I<Methods taken from package Bio::Variation::SeqDiff>
2049
2050  aa_mut aa_ori add_Gene add_Variant alignment cds_end cds_start
2051  chromosome description dna_mut dna_ori each_Gene each_Variant
2052  gene_symbol id moltype numbering offset rna_id rna_mut rna_offset
2053  rna_ori seqobj sysname trivname
2054
2055 =head2 V.2 Appendix: Tutorial demo scripts
2056
2057 The following scripts demonstrate many of the features of bioperl. To
2058 run all the demos run:
2059
2060  > perl -w  bptutorial.pl 0
2061
2062 To run a subset of the scripts do
2063
2064  > perl -w  bptutorial.pl
2065
2066 and use the displayed help screen.
2067
2068 =cut
2069
2070 #!/usr/bin/perl
2071
2072 # PROGRAM  : bptutorial.pl
2073 # PURPOSE  : Demonstrate various uses of the bioperl package
2074 # AUTHOR   : Peter Schattner schattner@alum.mit.edu
2075 # CREATED  : Dec 15 2000
2076 # REVISION : $Id$
2077
2078 use strict;
2079 use Bio::SimpleAlign;
2080 use Bio::AlignIO;
2081 use Bio::SeqIO;
2082 use Bio::Seq;
2083
2084 # subroutine references
2085
2086 my ($access_remote_db, $index_local_db, $fetch_local_db,
2087     $sequence_manipulations, $seqstats_and_seqwords,
2088     $restriction_and_sigcleave, $other_seq_utilities,
2089     $run_standaloneblast,  $blast_parser, $bplite_parsing, $hmmer_parsing,
2090     $run_clustalw_tcoffee, $run_psw_bl2seq, $simplealign_univaln,
2091     $gene_prediction_parsing, $sequence_annotation, $largeseqs,
2092     $liveseqs, $demo_variations, $demo_xml, $display_help, $bpinspect1 );
2093
2094 # global variable file names.  Edit these if you want to try
2095 #out a tutorial script on a different file
2096 my $dna_seq_file = 't/dna1.fa';      # used in $sequence_manipulations
2097 my $amino_seq_file = 't/cysprot1.fa';  # used in $other_seq_utilities
2098                                        #and $sequence_annotation
2099 my $blast_report_file = 't/blast.report';    # used in $blast_parser
2100 my $bp_parse_file1 = 't/blast.report';       # used in $bplite_parsing
2101 my $bp_parse_file2 = 't/psiblastreport.out'; # used in $bplite_parsing
2102 my $bp_parse_file3 = 't/bl2seq.out';         # used in $bplite_parsing
2103 my $unaligned_amino_file = 't/cysprot1a.fa'; # used in $run_clustalw_tcoffee
2104 my $aligned_amino_file = 't/testaln.msf';    # used in $simplealign_univaln
2105
2106 # other global variables
2107 my (@runlist, $n );
2108
2109
2110
2111 ##############################
2112 #  display_help
2113 $display_help = sub {
2114
2115
2116 # Prints usage information for tutorial script.
2117
2118     print STDERR <<"QQ_PARAMS_QQ";
2119
2120 The following numeric arguments can be passed to run the corresponding demo-script.
2121 1 => access_remote_db ,
2122 2 => index_local_db ,
2123 3 => fetch_local_db ,               (# NOTE: needs to be run with demo 2)
2124 4 => sequence_manipulations ,
2125 5 => seqstats_and_seqwords ,
2126 6 => restriction_and_sigcleave ,
2127 7 => other_seq_utilities ,
2128 8 => run_standaloneblast ,
2129 9 => blast_parser ,
2130 10 => bplite_parsing ,
2131 11 => hmmer_parsing ,
2132 12 => run_clustalw_tcoffee ,
2133 13 => run_psw_bl2seq ,
2134 14 => simplealign_univaln ,
2135 15 => gene_prediction_parsing ,
2136 16 => sequence_annotation ,
2137 17 => largeseqs ,
2138 18 => liveseqs ,
2139 19 => demo_variations ,
2140 20 => demo_xml ,
2141
2142 In addition the argument "100" followed by the name of a single
2143 bioperl object will display a list of all the public methods
2144 available from that object and from what object they are inherited.
2145
2146 Using the parameter "0" will run all tests.
2147 Using any other argument (or no argument) will run this display.
2148
2149 So typical command lines might be:
2150 To run all demo scripts:
2151  > perl -w  bptutorial.pl 0
2152 or to just run the local indexing demos:
2153  > perl -w  bptutorial.pl 2 3
2154 or to list all the methods available for object Bio::Tools::SeqStats -
2155  > perl -w  bptutorial.pl 100 Bio::Tools::SeqStats
2156
2157 QQ_PARAMS_QQ
2158
2159 exit(0);
2160 };
2161
2162
2163 ## Note: "main" routine is at very end of script
2164
2165 #################################################
2166 #   access_remote_db():
2167 #
2168
2169 $access_remote_db = sub {
2170
2171     use Bio::DB::GenBank;
2172
2173     print "Beginning remote database access example... \n";
2174
2175     # III.2.1 Accessing remote databases (Bio::DB::GenBank, etc)
2176
2177     my ($gb, $seq1, $seq2, $seq1_id, $seqobj, $seq2_id,$seqio);
2178
2179     eval {
2180         $gb = new Bio::DB::GenBank();
2181         $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
2182     };
2183
2184 if ($@ || !$seq1) {
2185     warn "Warning: Couldn't connect to Genbank with Bio::DB::GenBank.pm!\nProbably no network access.\n Skipping Test\n";
2186     return 0;
2187 }
2188     $seq1_id =  $seq1->display_id();
2189     print "seq1 display id is $seq1_id \n";
2190     $seq2 = $gb->get_Seq_by_acc('AF303112');
2191     $seq2_id =  $seq2->display_id();
2192     print "seq2 display id is $seq2_id \n";
2193     $seqio = $gb->get_Stream_by_batch([ qw(2981014 J00522 AF303112)]);
2194     $seqobj = $seqio->next_seq();
2195     print "Display id of first sequence in stream is ", $seqobj->display_id()  ,"\n";
2196     return 1;
2197 } ;
2198
2199
2200 #################################################
2201 #  index_local_db ():
2202 #
2203
2204 $index_local_db = sub {
2205
2206     use Bio::Index::Fasta; # using fasta file format
2207     my ( $Index_File_Name, $inx1, $inx2, $id, $dir, $key,
2208          $keyfound, $seq, $indexhash);
2209     print "\nBeginning indexing local_db example... \n";
2210
2211     # This subroutine unlikely to run if unless OS  = unix
2212
2213     # III.2.2 Indexing and accessing local databases
2214     # (Bio::Index::*,  bpindex.pl,  bpfetch.pl)
2215
2216     # first create the index
2217     $Index_File_Name = 'bptutorial.indx';
2218
2219     eval { use Cwd; $dir = cwd; };
2220     # CWD not installed, revert to unix behavior, best we can do
2221     if( $@) { $dir = `pwd`;}
2222
2223
2224     $inx1 = Bio::Index::Fasta->new
2225         ('-FILENAME' => "$dir/$Index_File_Name",
2226          '-write_flag' => 1);
2227     $inx1->make_index("$dir/t/multifa.seq");
2228     $inx1->make_index("$dir/t/seqs.fas");
2229     print "Finished indexing local_db example... \n";
2230
2231     return 1;
2232 } ;
2233
2234
2235 #################################################
2236 #  fetch_local_db ():
2237 #
2238
2239 $fetch_local_db = sub {
2240
2241     use Bio::Index::Fasta; # using fasta file format
2242     my ( $Index_File_Name, $inx2, $id, $dir, $key,
2243          $keyfound, $seq, $indexhash);
2244     print "\nBeginning retrieving local_db example... \n";
2245
2246     # then retrieve some files
2247     #$Index_File_Name = shift;
2248     $Index_File_Name = 'bptutorial.indx';
2249
2250     eval { use Cwd; $dir = cwd; };
2251     # CWD not installed, revert to unix behavior, best we can do
2252     if( $@) { $dir = `pwd`;}
2253
2254     $inx2 = Bio::Index::Abstract->new
2255         ('-FILENAME'  => "$dir/$Index_File_Name");
2256
2257     $indexhash = $inx2->db();
2258     $keyfound = "";
2259     foreach $key (sort keys %$indexhash) {
2260         if($key =~ /97/) {$keyfound = 'true' ; $id = $key; last};
2261     }
2262     if ($keyfound) {
2263         $seq = $inx2->fetch($id);
2264         print "Sequence ", $seq->display_id(),
2265         " has length  ", $seq->length()," \n";
2266     }
2267
2268     unlink "$dir/$Index_File_Name" ;
2269
2270     return 1;
2271 } ;
2272
2273
2274
2275 #################################################
2276 # sequence_manipulations  ():
2277 #
2278
2279 $sequence_manipulations = sub {
2280
2281     my ($infile, $in, $out, $seqobj);
2282     $infile = $dna_seq_file;
2283
2284     print "\nBeginning sequence_manipulations and SeqIO example... \n";
2285
2286
2287     # III.3.1 Transforming sequence files (SeqIO)
2288
2289     $in  = Bio::SeqIO->new('-file' => $infile ,
2290                            '-format' => 'Fasta');
2291     $seqobj = $in->next_seq();
2292
2293     # perl "tied filehandle" syntax is available to SeqIO,
2294     # allowing you to use the standard <> and print operations
2295     # to read and write sequence objects, eg:
2296     #$out = Bio::SeqIO->newFh('-format' => 'EMBL');
2297
2298     $out = Bio::SeqIO->newFh('-format' => 'fasta');
2299
2300     print "First sequence in fasta format... \n";
2301     print $out $seqobj;
2302
2303     # III.4 Manipulating individual sequences
2304
2305     # The following methods return strings
2306
2307
2308     print "Seq object display id is ",
2309     $seqobj->display_id(), "\n"; # the human read-able id of the sequence
2310     print "Sequence is ",
2311     $seqobj->seq()," \n";        # string of sequence
2312     print "Sequence from 5 to 10 is ",
2313     $seqobj->subseq(5,10)," \n"; # part of the sequence as a string
2314     print "Acc num is ",
2315     $seqobj->accession_number(), " \n"; # when there, the accession number
2316     print "Moltype is ",
2317     $seqobj->moltype(), " \n";    # one of 'dna','rna','protein'
2318     print "Primary id is ", $seqobj->primary_seq->primary_id(),
2319     " \n"; # a unique id for this sequence irregardless
2320     #print "Primary id is ", $seqobj->primary_id(), " \n";
2321     # a unique id for this sequence irregardless
2322     # of its display_id or accession number
2323
2324     # The following methods return an array of  Bio::SeqFeature objects
2325     $seqobj->top_SeqFeatures; # The 'top level' sequence features
2326     $seqobj->all_SeqFeatures; # All sequence features, including sub
2327     # seq features
2328
2329     # The following methods returns new sequence objects,
2330     # but do not transfer features across
2331     # truncation from 5 to 10 as new object
2332     print "Truncated Seq object sequence is ",
2333     $seqobj->trunc(5,10)->seq(), " \n";
2334     # reverse complements sequence
2335     print "Reverse complemented sequence 5 to 10  is ",
2336     $seqobj->revcom->subseq(5,10), "  \n";
2337     # translation of the sequence
2338     print "Translated sequence 6 to 15 is ",
2339     $seqobj->translate->subseq(6,15), " \n";
2340
2341
2342     my $c = shift;
2343     $c ||= 'ctgagaaaataa';
2344
2345     print "\nBeginning 3-frame and alternate codon translation example... \n";
2346
2347     my $seq = new Bio::PrimarySeq('-SEQ' => $c,
2348                                   '-ID' => 'no.One');
2349     print "$c translated using method defaults   : ",
2350     $seq->translate->seq, "\n";
2351
2352     # Bio::Seq uses same sequence methods as PrimarySeq
2353     my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two');
2354     print "$c translated as a coding region (CDS): ",
2355     $seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";
2356
2357     print "\nTranslating in all six frames:\n";
2358     my @frames = (0, 1, 2);
2359     foreach my $frame (@frames) {
2360         print  " frame: ", $frame, " forward: ",
2361         $seq->translate(undef, undef, $frame)->seq, "\n";
2362         print  " frame: ", $frame, " reverse-complement: ",
2363         $seq->revcom->translate(undef, undef, $frame)->seq, "\n";
2364     }
2365
2366     print "Translating with all codon tables using method defaults:\n";
2367     my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
2368     foreach my $ct (@codontables) {
2369         print $ct, " : ",
2370         $seq->translate(undef, undef, undef, $ct)->seq, "\n";
2371     }
2372
2373     return 1;
2374 } ;
2375
2376 #################################################                                      ;
2377 #  seqstats_and_seqwords ():
2378 #
2379
2380 $seqstats_and_seqwords = sub {
2381
2382     use Bio::Tools::SeqStats;
2383     use Bio::Tools::SeqWords;
2384
2385     print "\nBeginning seqstats_and_seqwords example... \n";
2386
2387
2388     my ($seqobj, $weight, $monomer_ref, $codon_ref,
2389         $seq_stats, $words, $hash);
2390     $seqobj = Bio::Seq->new
2391         ('-seq'=>'ACTGTGGCGTCAACTGACTGTGGCGTCAACTGACTGTGGGCGTCAACTGACTGTGGCGTCAACTG',
2392          '-moltype'=>'dna',
2393          '-id'=>'test');
2394
2395
2396     # III.4.1 Obtaining basic sequence statistics- MW,
2397     # residue &codon frequencies(SeqStats, SeqWord)
2398
2399
2400     $seq_stats  =  Bio::Tools::SeqStats->new($seqobj);
2401     $weight = $seq_stats->get_mol_wt();
2402     $monomer_ref = $seq_stats->count_monomers();
2403     $codon_ref = $seq_stats->count_codons();  # for nucleic acid sequence
2404     print "Sequence \'test\' is ", $seqobj->seq(), " \n";
2405     print "Sequence ", $seqobj->id()," molecular weight is $$weight[0] \n";
2406     # Note, $weight is an array reference
2407     print "Number of A\'s in sequence is $$monomer_ref{'A'} \n";
2408     print "Number of ACT codon\'s in sequence is $$codon_ref{'ACT'} \n";
2409
2410     $words = Bio::Tools::SeqWords->new($seqobj);
2411     $hash = $words->count_words(6);
2412     print "Number of 6-letter \'words\' of type ACTGTG in",
2413     " sequence is $$hash{'ACTGTG'} \n";
2414
2415     return 1;
2416     } ;
2417
2418
2419 #################################################
2420 # restriction_and_sigcleave  ():
2421 #
2422
2423 $restriction_and_sigcleave = sub {
2424
2425     use Bio::Tools::RestrictionEnzyme;
2426     use Bio::Tools::Sigcleave;
2427
2428     my ($re, $re1, $re2, @sixcutters, @fragments1,
2429         @fragments2, $seqobj, $dna);
2430     print "\nBeginning restriction enzyme example... \n";
2431
2432     # III.4.4 Identifying restriction enzyme sites (RestrictionEnzyme)
2433
2434     $dna = 'CCTCCGGGGACTGCCGTGCCGGGCGGGAATTCGCCATGGCGACC'.
2435         'CTGGAAAAGCTGATATCGAAGGCCTTCGA';
2436
2437     # Build sequence and restriction enzyme objects.
2438     $seqobj = new Bio::Seq('-ID'  => 'test_seq',
2439                            '-SEQ' =>$dna);
2440
2441     #$re  = new Bio::Tools::RestrictionEnzyme();
2442     $re  = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
2443     @sixcutters = $re->available_list(6);
2444
2445     print "The following 6-cutters are available",
2446     @sixcutters," \n";
2447
2448     $re1  = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
2449     @fragments1 =  $re1->cut_seq($seqobj);
2450     #$seqobj is the Seq object for the dna to be cut
2451
2452     print "When cutting ", $seqobj->display_id(),
2453     " with ", $re1->seq->id ," \n";
2454     print "the initial fragment is ", $fragments1[0]," \n";
2455
2456     $re2 = new Bio::Tools::RestrictionEnzyme
2457         ('-NAME' =>'EcoRV--GAT^ATC',
2458          '-MAKE' =>'custom');
2459     @fragments2 =  $re2->cut_seq($seqobj);
2460
2461     print "When cutting ", $seqobj->display_id(),
2462     " with ", $re2->seq->id ,", \n";
2463     print "the second fragment is ", $fragments2[1], " \n";
2464
2465     # III.4.7 Identifying amino acid cleavage sites (Sigcleave)
2466
2467     print "\nBeginning  signal cleaving site location example... \n";
2468
2469     my ( $sigcleave_object, %raw_results , $location,
2470          $formatted_output, $protein, $in, $seqobj2);
2471
2472     $in  = Bio::SeqIO->new('-file' => 't/cysprot1.fa' ,
2473                            '-format' => 'Fasta');
2474     $seqobj2 = $in->next_seq();
2475
2476     $protein = $seqobj2->seq;
2477     $formatted_output = "";
2478
2479     # Build object
2480     # Note that Sigcleave is passed a raw sequence
2481     # (or file containing a sequence)  rather than a
2482     # sequence object when it is created.
2483     $sigcleave_object = new Bio::Tools::Sigcleave
2484         ('-id' =>'test_sigcleave_seq',
2485          '-type' =>'amino',
2486          '-threshold' => 3.0,
2487          '-seq' =>$protein);
2488
2489     %raw_results      = $sigcleave_object->signals;
2490     $formatted_output = $sigcleave_object->pretty_print;
2491
2492     print " SigCleave \'raw results\': \n";
2493     foreach $location (sort keys %raw_results) {
2494         print " Cleave site found at location $location ",
2495         "with value of $raw_results{$location} \n";
2496     }
2497     print "\nSigCleave formatted output: \n $formatted_output \n";
2498     return 1;
2499 } ;
2500
2501
2502 #################################################
2503 #  run_standaloneblast ():
2504 #
2505
2506 $run_standaloneblast = sub {
2507     use Bio::Tools::Run::StandAloneBlast;
2508     use Bio::Tools::BPlite;
2509
2510     print "\nBeginning run_standaloneblast example... \n";
2511
2512     my (@params, $factory, $input, $blast_report, $blast_present,
2513         $database, $sbjct, $str, $seq1);
2514
2515     # III.5.1 Running BLAST locally  (StandAloneBlast)
2516
2517     $database = $_[0] || 'ecoli.nt'; # user can select local nt database
2518
2519     $blast_present = Bio::Tools::Run::StandAloneBlast->exists_blast();
2520     unless ($blast_present) {
2521         warn "blast program not found. Skipping StandAloneBlast example\n";
2522         return 0;
2523     }
2524     #@params = ('program' => 'blastn', 'database' => 'ecoli.nt');
2525     @params = ('program' => 'blastn', 'database' => $database);
2526     $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
2527
2528     $str = Bio::SeqIO->new('-file'=>'t/dna2.fa' ,
2529                            '-format' => 'Fasta', );
2530     $seq1 = $str->next_seq();
2531
2532     $blast_report = $factory->blastall($seq1);
2533     $sbjct = $blast_report->nextSbjct;
2534
2535     print " Hit name is ", $sbjct->name, " \n";
2536
2537     return 1;
2538 } ;
2539
2540 #################################################
2541 #  blast_parser ():
2542 #
2543
2544 $blast_parser = sub {
2545
2546     print "\nBeginning blast.pm parser example... \n";
2547     my ($blast, $num_hits, @hits,$frac1, @inds,
2548         $report_file, $identity_string);
2549
2550     $report_file = $blast_report_file;
2551
2552     # III.5.3.1 Parsing BLAST reports with Blast.pm
2553
2554     use Bio::Tools::Blast;
2555     $blast = Bio::Tools::Blast->new('-file'   => $report_file,
2556                                     '-signif' => 1e-5,
2557                                     '-parse'  => 1,
2558                                     '-stats'  => 1,
2559                                     '-check_all_hits' => 1, );
2560     $blast->display();
2561     $num_hits =  $blast->num_hits;
2562     print "Number of hits is $num_hits \n";
2563
2564     @hits  = $blast->hits;
2565     $frac1 = $hits[1]->frac_identical;
2566     print "Fraction identical for hit 1 is $frac1 \n";
2567
2568     @inds = $hits[1]->hsp->seq_inds('query', 'iden', 1);
2569     $identity_string = join " ", @inds;
2570     print "Sequence identities for hsp of hit 1 are ",
2571     $identity_string, " \n";
2572
2573     return 1;
2574 } ;
2575
2576
2577
2578
2579 #################################################
2580 #  bplite_parsing ():
2581 #
2582
2583 $bplite_parsing = sub {
2584
2585     use Bio::Tools::BPlite;
2586     my ($file1, $file2, $file3, $report,$report2 , $report3 ,
2587         $last_iteration, $sbjct,  $total_iterations, $hsp,
2588         $matches, $loop);
2589
2590     print "\nBeginning bplite, bppsilite, bpbl2seq parsing example... \n";
2591
2592     ($file1, $file2, $file3) = ($bp_parse_file1,
2593                                 $bp_parse_file2 ,$bp_parse_file3 );
2594     #open FH, "t/blast.report";
2595     $report = Bio::Tools::BPlite->new('-file'=>$file1);
2596     $sbjct = $report->nextSbjct;
2597
2598     print " Hit name is ", $sbjct->name, " \n";
2599     while ($hsp = $sbjct->nextHSP) {
2600         $hsp->score;
2601     }
2602
2603     use Bio::Tools::BPpsilite;
2604     $report2 = Bio::Tools::BPpsilite->new('-file'=>$file2);
2605     $total_iterations = $report2->number_of_iterations;
2606     $last_iteration = $report2->round($total_iterations);
2607
2608     # print first 10 psiblast hits
2609     for ($loop = 1; $loop <= 10; $loop++) {
2610         $sbjct =  $last_iteration->nextSbjct;
2611         $sbjct && print " PSIBLAST Hit is ", $sbjct->name, " \n";
2612     }
2613
2614     #
2615     use Bio::Tools::BPbl2seq;
2616     $report3 = Bio::Tools::BPbl2seq->new('-file'  => $file3,
2617                                          '-queryname' => "ALEU_HORVU");
2618     $matches = $report3->match;
2619     print " Number of Blast2seq matches is $matches \n";
2620     #
2621
2622     return 1;
2623 } ;
2624
2625
2626 #################################################
2627 #  hmmer_parsing ():
2628 #
2629
2630 $hmmer_parsing = sub {
2631
2632     print "\nBeginning hmmer_parsing example... \n";
2633
2634     # III.5.4 Parsing HMM reports (HMMER::Results)
2635
2636     use Bio::Tools::HMMER::Domain;
2637     use Bio::Tools::HMMER::Set;
2638     use Bio::Tools::HMMER::Results;
2639
2640     my ($res, @seq, @domain);
2641     $res = new Bio::Tools::HMMER::Results('-file' => 't/hmmsearch.out',
2642                                           '-type' => 'hmmsearch');
2643
2644     @seq = $res->each_Set;
2645     print "Initial sequence bit score is ",$seq[0]->bits,"\n";
2646     @domain = $seq[0]->each_Domain;
2647     print " For initial domain, start = ",$domain[0]->start,
2648     " end = ",$domain[0]->end,"and score =",$domain[0]->bits,"\n";
2649
2650     #foreach $seq ( $res->each_Set ) {
2651     #    print "Sequence bit score is",$seq->bits,"\n";
2652     #    foreach $domain ( $seq->each_Domain ) {
2653     #        print " Domain start ",$domain->start," end ",
2654     #        $domain->end," score ",$domain->bits,"\n";
2655     #    }
2656     #}
2657     return 1;
2658 } ;
2659
2660 #################################################
2661 # run_clustalw_tcoffee  ():
2662 #
2663
2664 $run_clustalw_tcoffee = sub {
2665
2666     use Bio::Tools::Run::Alignment::TCoffee;
2667     use Bio::Tools::Run::Alignment::Clustalw;
2668     my (@params, $factory, $ktuple, $aln, $seq_array_ref, $strout);
2669     my (@seq_array, $seq, $str, $infile);
2670
2671     $infile = $unaligned_amino_file;
2672
2673     # III.6.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
2674     print "\nBeginning run_clustalw example... \n";
2675
2676     # put unaligned sequences in a Bio::Seq array
2677     $str = Bio::SeqIO->new('-file'=> $infile,
2678                            '-format' => 'Fasta');
2679     @seq_array =();
2680     while ($seq = $str->next_seq() ) { push (@seq_array, $seq) ;}
2681     $seq_array_ref = \@seq_array;
2682     # where @seq_array is an array of Bio::Seq objects
2683
2684     my $clustal_present =
2685       Bio::Tools::Run::Alignment::Clustalw->exists_clustal();
2686     if ($clustal_present) {
2687         @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
2688         $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
2689         $ktuple = 3;
2690         $factory->ktuple($ktuple);  # change the parameter before executing
2691         $aln = $factory->align($seq_array_ref);
2692         $strout = Bio::AlignIO->newFh('-format' => 'msf');
2693         print "Output of clustalw alignment... \n";
2694         print $strout $aln;
2695     } else {
2696         warn "Clustalw program not found. Skipping run_clustalw example....\n";
2697     }
2698     print "\nBeginning run_tcoffee example... \n";
2699
2700     my $tcoffee_present =
2701       Bio::Tools::Run::Alignment::TCoffee->exists_tcoffee();
2702     if ($tcoffee_present) {
2703         @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
2704         $factory = Bio::Tools::Run::Alignment::TCoffee->new(@params);
2705         $ktuple = 3;
2706         $factory->ktuple($ktuple);  # change the parameter before executing
2707         $aln = $factory->align($seq_array_ref);
2708         $strout = Bio::AlignIO->newFh('-format' => 'msf');
2709         print "Output of TCoffee alignment... \n";
2710         print $strout $aln;
2711     } else {
2712         warn "TCoffee program not found. Skipping run_TCoffee example....\n";
2713     }
2714     return 1;
2715 } ;
2716
2717 #################################################
2718 # simplealign_univaln  ():
2719 #
2720
2721 $simplealign_univaln = sub {
2722
2723     my ($in, $out1,$out2, $aln, $threshold_percent, $infile);
2724     my ( $str, $resSlice1, $resSlice2, $resSlice3, $tmpfile);
2725
2726     print "\nBeginning simplealign and alignio example... \n";
2727
2728     # III.3.2 Transforming alignment files (AlignIO)
2729     $infile = $aligned_amino_file;
2730     #$tmpfile = "test.tmp";
2731     $in  = Bio::AlignIO->new('-file' => $infile ,
2732                              '-format' => 'msf');
2733     $out1 = Bio::AlignIO->newFh('-format' => 'pfam');
2734     $out2 = Bio::AlignIO->newFh('-format' => 'fasta',
2735                                 '-file' => ">test.tmp");
2736
2737     # while ( my $aln = $in->next_aln() ) { $out->write_aln($aln);  }
2738     $aln = $in->next_aln() ;
2739     print "Alignment to display in pfam format... \n";
2740     print $out1 $aln;
2741     print "Alignment to file in fasta format... \n";
2742     print $out2 $aln;
2743
2744     # III.6.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
2745
2746     $threshold_percent = 60;
2747     $str = $aln->consensus_string($threshold_percent) ;
2748     print "Consensus string with threshold = $threshold_percent is $str\n";
2749
2750     use Bio::UnivAln;
2751
2752     print "\nBeginning univaln example... \n";
2753     #$aln = Bio::UnivAln->new('-file'=>'test.tmp',
2754     $aln = Bio::UnivAln->new('-file'=>'t/alnfile.fasta',
2755                              '-desc'=>'Sample alignment',
2756                              '-type'=>'amino',
2757                              '-ffmt'=>'fasta'
2758                              );
2759
2760     print "\nCurrent alignment:\n",$aln->layout;
2761
2762     $resSlice1 = $aln->remove_gaps(); # original sequences without gaps
2763     print "Alignment without gaps  is \n $resSlice1\n";
2764     $resSlice3 = $aln->consensus(0.2, [1,3]); # 60% majority, columns 1+3 only
2765     print "Consensus with 20% threshold for columns".
2766         " 1 and 3 is \n $resSlice3\n";
2767     #unlink $tmpfile;
2768     return 1;
2769 } ;
2770
2771 #################################################
2772 # run_psw_bl2seq  ():
2773 #
2774
2775 $run_psw_bl2seq = sub {
2776
2777     print "\nBeginning run_psw_bl2seq example... \n";
2778     eval { require Bio::Tools::pSW; };
2779     if( $@ ) {
2780         print STDERR ("\nThe C-compiled engine for Smith Waterman alignments (Bio::Ext::Align) has not been installed.\n Please read the install the bioperl-ext package\n\n");
2781         return 0;
2782     }
2783
2784     #use Bio::Tools::pSW;
2785     use Bio::Tools::Run::StandAloneBlast;
2786
2787     # III.6.1 Aligning 2 sequences with Smith-Waterman (pSW)
2788
2789     my ($factory, $aln, $out1, $out2, $str, $seq1, $seq2, @params);
2790
2791     # Get protein sequences from file
2792     $str = Bio::SeqIO->new('-file'=>'t/amino.fa' ,
2793                            '-format' => 'Fasta', );
2794     $seq1 = $str->next_seq();
2795     $seq2 = $str->next_seq();
2796
2797     $factory = new Bio::Tools::pSW( '-matrix' => 'examples/blosum62.bla',
2798                                     '-gap' => 12,
2799                                     '-ext' => 2, );
2800
2801     $factory->align_and_show($seq1,$seq2,'STDOUT');
2802     $aln = $factory->pairwise_alignment($seq1,$seq2);
2803
2804     $out1 = Bio::AlignIO->newFh('-format' => 'fasta');
2805     print "The Smith-Waterman alignment in fasta format... \n";
2806     print $out1 $aln;
2807
2808
2809     # III.6.2 Aligning 2 sequences with Blast using  bl2seq and AlignIO
2810
2811     # Bl2seq testing
2812     # first create factory for bl2seq
2813     @params = ('program' => 'blastp', 'outfile' => 'bl2seq.out');
2814     $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
2815     $factory->bl2seq($seq1, $seq2);
2816
2817     # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
2818     $str = Bio::AlignIO->new('-file'=> 'bl2seq.out',
2819                              '-format' => 'bl2seq');
2820     $aln = $str->next_aln();
2821
2822     $out2 = Bio::AlignIO->newFh('-format' => 'fasta');
2823     print "The Blast 2 sequence alignment in fasta format... \n";
2824     print $out2 $aln;
2825
2826     return 1;
2827 } ;
2828
2829
2830 #################################################
2831 #  gene_prediction_parsing ():
2832 #
2833
2834 $gene_prediction_parsing = sub {
2835
2836     use Bio::Tools::Genscan;
2837
2838     print "\nBeginning genscan_result_parsing example... \n";
2839
2840     my ($genscan, $gene, @exon_arr, $first_exon);
2841
2842     # III.7 Searching for genes and other structures
2843     # on genomic DNA (Genscan, ESTScan, MZEF)
2844
2845     use Bio::Tools::Genscan;
2846     $genscan = Bio::Tools::Genscan->new('-file' => 't/genomic-seq.genscan');
2847     # $gene is an instance of Bio::Tools::Prediction::Gene
2848     # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
2849     while($gene = $genscan->next_prediction()) {
2850         print "Protein corresponding to current gene prediction is ",
2851         $gene->predicted_protein->seq()," \n";
2852         @exon_arr = $gene->exons();
2853         $first_exon = $exon_arr[0];
2854         print "Start location of first exon of current predicted gene is ",
2855         $first_exon->start," \n";
2856     }
2857     $genscan->close();
2858
2859     use Bio::Tools::Sim4::Results;
2860     print "\nBeginning Sim4_result_parsing example... \n";
2861     my ($sim4,$exonset, @exons, $exon, $homol);
2862
2863    $sim4 = new Bio::Tools::Sim4::Results(-file=> 't/sim4.rev', -estisfirst=>0);
2864    $exonset = $sim4->next_exonset;
2865    @exons = $exonset->sub_SeqFeature();
2866    $exon = $exons[1];
2867     print "Start location of first exon of first exon set is ",$exon->start(), " \n";
2868    print "Name of est sequence is ",$exon->est_hit()->seqname(), " \n";
2869    print "The length of the est hit is ",$exon->est_hit()->seqlength(), " \n";
2870    $sim4->close();
2871
2872     return 1;
2873 } ;
2874
2875
2876 #################################################
2877 #  sequence_annotation ():
2878 #
2879
2880 $sequence_annotation = sub {
2881
2882     print "\nBeginning sequence_annotation example... \n";
2883     my ($feat, $feature1, $feature2,$seqobj, @topfeatures, @allfeatures,
2884         $ann, $in, $infile, $other, $answer);
2885     # III.8.1 Representing sequence annotations (Annotation, SeqFeature)
2886
2887     $infile = $amino_seq_file;
2888
2889     $in  = Bio::SeqIO->new('-file' => $infile ,
2890                            '-format' => 'Fasta');
2891     $seqobj = $in->next_seq();
2892
2893     $feature1 = new Bio::SeqFeature::Generic
2894         ('-start' => 10,
2895          '-end' => 40,
2896          '-strand' => 1,
2897          '-primary' => 'exon',
2898          '-source'  => 'internal', );
2899
2900     $feature2 = new Bio::SeqFeature::Generic
2901         ( '-start' => 20,
2902           '-end' => 30,
2903           '-strand' => 1,
2904           '-primary' => 'exon',
2905           '-source'  => 'internal', );
2906
2907     $seqobj->add_SeqFeature($feature1);   # Add the SeqFeature to the parent
2908     $seqobj->add_SeqFeature($feature2);   # Add the SeqFeature to the parent
2909     $seqobj->annotation(new Bio::Annotation('-description' => 'This is the annotation'));
2910
2911     # Once the features and annotations have been associated
2912     # with the Seq, they can be with retrieved, eg:
2913     @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
2914     @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
2915     $ann = $seqobj->annotation(); # annotation object
2916     $feat = $allfeatures[0];
2917     $other = $allfeatures[1];
2918
2919     print " Total number of sequence features is: ", scalar @allfeatures, " \n";
2920     print " The sequence annotation is: ", $ann->description," \n";
2921
2922     # The individual components of a SeqFeature can also be set
2923     # or retrieved with methods including:
2924     # attributes which return numbers
2925     $feat->start;          # start position
2926     $feat->end;            # end position
2927     $feat->strand;         # 1 means forward, -1 reverse, 0 not relevant
2928     # attributes which return strings
2929     $feat->primary_tag;    # the main 'name' of the sequence feature,  eg, 'exon'
2930     $feat->source_tag;     # where the feature comes from, eg'BLAST'
2931     # attributes which return Bio::PrimarySeq objects
2932     $feat->seq;            # the sequence between start,end
2933     $feat->entire_seq;     # the entire sequence
2934
2935     print " The primary tag of the feature is: ", $feat->primary_tag, "   \n";
2936     print " The first feature begins at location ", $feat->start, " \n";
2937     print "  ends at location ", $feat->end, " \n";
2938     print "  and is located on strand ", $feat->strand, " \n";
2939
2940     # other useful methods include
2941     $answer = $feat->overlaps($other);  # do SeqFeature $feat and SeqFeature $other overlap?
2942     print " Feature 1 does ", ($answer)? "":"not ", " overlap Feature2. \n";
2943
2944     $answer = $feat->contains($other); # is $other completely within $feat?
2945     print " Feature 1 does ", ($answer)? "":"not ", " contain Feature2. \n";
2946
2947     $answer = $feat->equals($other);   # do $feat and $other completely $agree?
2948     print " Feature 1 does ", ($answer)? "":"not ", " equal Feature2. \n";
2949
2950     $feat->sub_SeqFeature();  # create/access an array of subsequence features
2951
2952     return 1;
2953 } ;
2954
2955
2956 #################################################
2957 #  largeseqs ():
2958 #
2959
2960 $largeseqs = sub {
2961
2962     print "\nBeginning largeseqs example... \n";
2963
2964     # III.8.2 Representing and large sequences
2965     my ( $tmpfile, $seqio, $pseq, $plength, $last_4);
2966
2967     $tmpfile = 't/largefastatest.out';
2968     $seqio = new Bio::SeqIO('-format'=>'largefasta',
2969                             '-file'  =>'t/genomic-seq.fasta');
2970     $pseq = $seqio->next_seq();
2971     $plength = $pseq->length();
2972
2973     print " The length of the sequence ",
2974     $pseq->display_id()," is $plength  \n";
2975     $last_4 = $pseq->subseq($plength-3,$plength);
2976     print " The final four residues are $last_4 \n";
2977
2978     #END { unlink $tmpfile; }
2979     unlink $tmpfile;
2980
2981     return 1;
2982 } ;
2983
2984
2985 #################################################
2986 # liveseqs  ():
2987 #
2988
2989 $liveseqs = sub {
2990
2991     use Bio::LiveSeq::IO::BioPerl;
2992     my ($loader, $gene, $id, $maxstart);
2993
2994     print "\nBeginning liveseqs example... \n";
2995
2996     # III.8.2 Representing changing sequences (LiveSeq)
2997
2998     $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
2999                                             '-file'=>"t/factor7.embl");
3000     $gene=$loader->gene2liveseq('-gene_name' => "factor7");
3001     $id = $gene->get_DNA->display_id ;
3002     print " The id of the gene is $id , \n";
3003     $maxstart = $gene->maxtranscript->start;
3004     print " The start position of the max transcript is $maxstart , \n";
3005
3006     return 1;
3007 } ;
3008
3009
3010 #################################################
3011 # demo_variations  ():
3012 #
3013
3014 $demo_variations = sub {
3015
3016     print "\nBeginning demo_variations example... \n";
3017
3018     # III.8.3 Representing related sequences
3019     # - mutations, polymorphisms etc (Allele, SeqDiff,)
3020
3021     use Bio::Variation::SeqDiff;
3022     use Bio::Variation::Allele;
3023     use Bio::Variation::DNAMutation;
3024     my ($a1, $a2, $dnamut, $seqDiff, @current_variants);
3025
3026     $a1 = Bio::Variation::Allele->new;
3027     $a1->seq('aaaa');
3028     $a2 = Bio::Variation::Allele->new;
3029     $a2->seq('ttat');
3030
3031     $dnamut = Bio::Variation::DNAMutation->new
3032         ('-start'  => 1000,
3033          '-length'  => 4,
3034          '-upStreamSeq'  => 'actggg',
3035          '-proof'  => 'experimental',
3036          '-isMutation' => 1, );
3037     $dnamut->allele_ori($a1);
3038     $dnamut->add_Allele($a2);
3039     $seqDiff = Bio::Variation::SeqDiff->new
3040         ('-id' => 'M20132',
3041          '-moltype' => 'rna',
3042          #$seqDiff = Bio::Variation::SeqDiff->new
3043          #( '-id' => 'M20132', '-moltype' => 'RNA',
3044          '-gene_symbol' => 'AR',
3045          '-chromosome' => 'X',
3046          '-numbering' => 'coding');
3047     # add it to a SeqDiff container object
3048     $seqDiff->add_Variant($dnamut);
3049     @current_variants = $seqDiff->each_Variant();
3050     print " The original sequence at current variation is ",
3051     $current_variants[0]->allele_ori->seq ," \n";
3052     print " The mutated sequence is ",
3053     $current_variants[0]->allele_mut->seq ," \n";
3054     print " The upstream sequence is ",
3055     $current_variants[0]->upStreamSeq ," \n";
3056
3057     return 1;
3058 } ;
3059
3060
3061 #################################################
3062 #  demo_xml ():
3063 #
3064
3065 $demo_xml = sub {
3066
3067     print "\nBeginning demo_xml example... \n";
3068     my ($str, $seqobj, $id, @feats, $first_primary_tag);
3069
3070     # III.8.4 Sequence XML representations
3071     # - generation and parsing (SeqIO::game)
3072
3073     $str = Bio::SeqIO->new('-file'=> 't/test.game',
3074                            '-format' => 'game');
3075     $seqobj = $str->next_seq();
3076     # $seq = $str->next_primary_seq();
3077     # $id = $seq->id;
3078
3079     print "Seq object display id is ", $seqobj->display_id(),
3080     "\n"; # the human read-able id of the sequence
3081     print "The sequence description is: \n";
3082     print "   ", $seqobj->desc(), " \n";
3083     print "Acc num is ", $seqobj->accession_number(),
3084     " \n"; # when there, the accession number
3085     print "Moltype is ", $seqobj->moltype(),
3086     " \n";    # one of 'dna','rna','protein'
3087
3088     @feats = $seqobj->all_SeqFeatures();
3089     $first_primary_tag = $feats[0]->primary_tag;
3090     print " Total number of sequence features is: ", scalar @feats, " \n";
3091     print " The primary tag of the first feature is: $first_primary_tag  \n";
3092     print " The first feature begins at location ", $feats[0]->start, " \n";
3093     print "  and ends at location ", $feats[0]->end, " \n";
3094
3095     return 1;
3096 } ;
3097
3098
3099 ################################################
3100 #  other_seq_utilities ():
3101 #
3102
3103 $other_seq_utilities = sub {
3104
3105     use Bio::Tools::OddCodes;
3106     use Bio::Tools::SeqPattern;
3107     use Bio::Tools::CodonTable;
3108
3109     my ( $oddcode_obj, $amino_obj, $in, $infile, $chargeseq, $chemseq);
3110     my (@ii, $i, @res, $myCodonTable);
3111     my ( $pattern,$pattern_obj, $pattern_obj2, $pattern_obj3);
3112
3113     print "\nBeginning sequence utilities example... \n";
3114
3115     # III.4.6 Identifying amino acid characteristics
3116     # - eg charge, hydrophobicity (OddCodes)
3117     $infile = $amino_seq_file;
3118     $in  = Bio::SeqIO->new('-file' => $infile ,
3119                            '-format' => 'Fasta');
3120     $amino_obj = $in->next_seq->trunc(1,20);
3121     $oddcode_obj  =  Bio::Tools::OddCodes->new($amino_obj);
3122     print "Initial sequence is: \n ", $amino_obj->seq(), " \n";
3123     $chargeseq = $oddcode_obj->charge;
3124     # Note OddCodes returns a reference to the desired
3125     # sequence, not the sequence itself.
3126     print "Sequence in \'charge\' alphabet is : \n ",$$chargeseq, " \n";
3127     $chemseq = $oddcode_obj->chemical;
3128     print "Sequence in \'chemical\' alphabet is : \n ",$$chemseq, " \n";
3129
3130     # III.4.2.1 non-standard nucleotide translation tables (CodonTable)
3131
3132     # create a table object by giving an ID
3133     $myCodonTable = Bio::Tools::CodonTable -> new ( '-id' => 16);
3134     print "\nThe name of the current codon table is ",
3135     $myCodonTable->name() , " \n";
3136
3137     # translate some codons
3138     @ii  = qw(act acc att ggg ttt aaa ytr yyy);
3139     @res =();
3140     for $i (0..$#ii) { $res[$i] = $myCodonTable->translate($ii[$i]); }
3141     print "\nWith the current codon table, the sequence \n ",
3142     @ii, "\n is translated as \n" , @res, " \n";
3143
3144     # III.4.2.2 utilities for ambiguous residues (SeqPattern, IUPAC)
3145     # III.4.2.3 regex-like nucleotide pattern manipulation
3146
3147     print "\nBeginning demo of SeqPattern object \n ";
3148
3149     $pattern     = '(CCCCT)N{1,200}(agyyg)N{1,80}(agggg)';
3150     print "\nInput regular expression = $pattern \n ";
3151     $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' =>$pattern,
3152                                               '-TYPE' =>'dna');
3153     $pattern_obj2  = $pattern_obj->revcom();
3154     print "\nReverse-complemented expression = ",$pattern_obj2->str, "\n ";
3155     $pattern_obj3 = $pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3156     print "\nExpanded reverse-complemented expression = ",$pattern_obj3->str, "\n ";
3157     #$pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3158
3159     return 1;
3160 } ;
3161
3162
3163
3164 $bpinspect1 = sub {
3165     my ($object, $object_class,$package );
3166     my @package_list = ();
3167     $object = shift;
3168     ($object_class = $object) =~ s/::/\//g;  # replace "::" with "/"
3169     $object_class = $object_class . ".pm";
3170     require $object_class;
3171
3172     my $method_hash = $object->methods('all');
3173
3174     print " \n ***Methods for Object $object ********  \n";
3175 # get all the *unique* package names providing methods
3176     my %seen=();
3177     foreach $package  (values %$method_hash) {
3178         push(@package_list, $package) unless $seen{$package}++;
3179     }
3180     for $package (sort @package_list) {
3181         print " \n\n Methods taken from package $package  \n";
3182         my $out_count = 1;
3183         for my $method (sort keys %$method_hash) {
3184             if ($$method_hash{$method} eq $package ) {
3185             print " $method  ";
3186             $out_count++;
3187                 if ($out_count == 7) {
3188                         print " \n";    # keeps output line from getting too long
3189                         $out_count = 1;
3190                 }
3191             }
3192          }
3193      }
3194     print "\n";
3195
3196 # The following subroutine is based closely on Mark Summerfield's and Piers Cawley's
3197 # clever Class_Methods_Introspection program
3198 #
3199     package UNIVERSAL;
3200     sub methods {
3201         my ($class, $types) = @_;
3202         $class = ref $class || $class;
3203         $types ||= '';
3204         my %classes_seen;
3205         my %methods;
3206         my @class = ($class);
3207
3208         no strict 'refs';
3209         while ($class = shift @class) {
3210             next if $classes_seen{$class}++;
3211             unshift @class, @{"${class}::ISA"} if $types eq 'all';
3212             # Based on methods_via() in perl5db.pl
3213             for my $method (grep {not /^[(_]/ and
3214                                   defined &{${"${class}::"}{$_}}}
3215                             keys %{"${class}::"}) {
3216                 $methods{$method} = $class;        #ps
3217 #                $methods{$method} = wantarray ? undef : $class->can($method);
3218             }
3219         }
3220
3221         wantarray ? keys %methods : \%methods;
3222     }
3223
3224     1;
3225
3226
3227
3228
3229
3230
3231     return 1;
3232 } ;
3233
3234
3235
3236
3237
3238
3239
3240 ## "main" program follows
3241 #no strict 'refs';
3242
3243     @runlist = @ARGV;
3244     if (scalar(@runlist)==0) {&$display_help;}; # display help if no option
3245     if ($runlist[0] == 0) {@runlist = (1..20); }; # argument = 0 means run all tests
3246     foreach $n  (@runlist) {
3247         if ($n ==100) {my $object = $runlist[1]; &$bpinspect1($object); last;}
3248         if ($n ==1) {&$access_remote_db; next;}
3249         if ($n ==2) {&$index_local_db; next;}
3250         if ($n ==3) {&$fetch_local_db; next;}
3251         if ($n ==4) {&$sequence_manipulations; next;}
3252         if ($n ==5) {&$seqstats_and_seqwords; next;}
3253         if ($n ==6) {&$restriction_and_sigcleave; next;}
3254         if ($n ==7) {&$other_seq_utilities; next;}
3255         if ($n ==8) {&$run_standaloneblast; next;}
3256         if ($n ==9) {&$blast_parser; next;}
3257         if ($n ==10) {&$bplite_parsing; next;}
3258         if ($n ==11) {&$hmmer_parsing; next;}
3259         if ($n ==12) {&$run_clustalw_tcoffee; next;}
3260         if ($n ==13) {&$run_psw_bl2seq; next;}
3261         if ($n ==14) {&$simplealign_univaln ; next;}
3262         if ($n ==15) {&$gene_prediction_parsing; next;}
3263         if ($n ==16) {&$sequence_annotation; next;}
3264         if ($n ==17) {&$largeseqs; next;}
3265         if ($n ==18) {&$liveseqs; next;}
3266         if ($n ==19) {&$demo_variations; next;}
3267         if ($n ==20) {&$demo_xml; next;}
3268         &$display_help;
3269     }
3270
3271 ## End of "main"
3272
3273