Merged Steve L.'s fixes from branch (bug #935).
[bioperl-live.git] / bptutorial.pl
blob476eba70b75bfe2a240dfff54b92f228abddf4dd
1 # $Id$
3 =pod
5 =head1 BioPerl Tutorial
7 Cared for by Peter Schattner <schattner@alum.mit.edu>
9 Copyright Peter Schattner
11 This tutorial includes "snippets" of code and text from various
12 Bioperl documents including module documentation, example scripts
13 and "t" test scripts. You may distribute this tutorial under the
14 same terms as perl itself.
16 This document is written in Perl POD (plain old documentation)
17 format. You can run this file through your favorite pod translator
18 (pod2html, pod2man, pod2text, etc.) if you would like a more
19 convenient formatting.
21 Table of Contents
23 I. Introduction
24 I.1 Overview
25 I.2 Software requirements
26 I.2.1 For minimal bioperl installation
27 I.2.2 For complete installation
28 I.3 Installation procedures
29 I.4 Additional comments for non-unix users
31 II. Brief overview to bioperl's objects
32 II.1 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
33 II.2 Alignment objects (SimpleAlign, UnivAln)
34 II.3 Interface objects and implementation objects
36 III. Using bioperl
37 III.1 Accessing sequence data from local and remote databases
38 III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
39 III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl, bpfetch.pl)
40 III.2 Transforming formats of database/ file records
41 III.2.1 Transforming sequence files (SeqIO)
42 III.2.2 Transforming alignment files (AlignIO)
43 III.3 Manipulating sequences
44 III.3.1 Manipulating sequence data with Seq methods (Seq)
45 III.3.2 Obtaining basic sequence statistics- MW, residue &codon frequencies (SeqStats)
46 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
47 III.3.4 Identifying amino acid cleavage sites (Sigcleave)
48 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
49 III.4 Searching for "similar" sequences
50 III.4.1 Running BLAST locally (StandAloneBlast)
51 III.4.2 Running BLAST remotely (using Blast.pm)
52 III.4.3 Parsing BLAST reports with Blast.pm
53 III.4.4 Parsing BLAST reports with BPlite, BPpsilite and BPbl2seq
54 III.4.5 Parsing HMM reports (HMMER::Results)
55 III.5 Creating and manipulating sequence alignments
56 III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
57 III.5.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
58 III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
59 III.5.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
60 III.6 Searching for genes and other structures on genomic DNA (Genscan, Sim4, ESTScan, MZEF)
61 III.7 Developing machine readable sequence annotations
62 III.7.1 Representing sequence annotations (Annotation, SeqFeature)
63 III.7.2 Representing and large and/or changing sequences (LiveSeq,LargeSeq)
64 III.7.3 Representing related sequences - mutations, polymorphisms etc (Allele, SeqDiff,)
65 III.7.4 Sequence XML representations - generation and parsing (SeqIO::game)
67 IV. Related projects - biocorba, biopython, biojava, Ensembl, AnnotationWorkbench
68 IV.1 Biocorba
69 IV.2 Biopython and biojava
70 IV.3 Ensembl
71 IV.4 The Annotation Workbench and bioperl-gui
73 V. Appendices
74 V.1 Public Methods of Principal Bioperl Objects
75 V.2 Tutorial Demo Scripts
77 =head1 I. Introduction
79 =head2 I.1 Overview
81 Bioperl is a collection of perl modules that facilitate the
82 development of perl scripts for bio-informatics applications. As
83 such, it does not include ready to use programs in the sense that may
84 commercial packages and free web-based interfaces (eg Entrez, SRS) do.
85 On the other hand, bioperl does provide reusable perl modules that
86 facilitate writing perl scripts for sequence manipulation, accessing
87 of databases using a range of data formats and execution and parsing
88 of the results of various molecular biology programs including Blast,
89 clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently, bioperl
90 enables developing scripts that can analyze large quantities of
91 sequence data in ways that are typically difficult or impossible with
92 web based systems.
94 In order to take advantage of bioperl, the user needs a basic
95 understanding of the perl programming language including an
96 understanding of how to use perl references, modules, objects and
97 methods. If these concepts are unfamiliar the user is referred to any
98 of the various introductory / intermediate books on perl. (I've liked
99 S. Holzmer's Perl Core Language, Coriolis Technology Press, for
100 example). This tutorial is not intended to teach the fundamentals of
101 perl to those with little or no experience in the perl language. On
102 the other hand, advanced knowledge of perl - such as how to write a
103 perl object - is not required for successfully using bioperl.
105 Bioperl is open source software that is still under active
106 development. The advantages of open source software are well known.
107 They include the ability to freely examine and modify source code and
108 exemption from software licensing fees. However, since open source
109 software is typically developed by a large number of volunteer
110 programmers, the resulting code is often not as clearly organized and
111 its user interface not as standardized as in a mature commercial
112 product. In addition, in any project under active development,
113 documentation may not keep up with the development of new features.
114 Consequently the learning curve for actively developed, open source
115 source software is sometimes steep.
117 This tutorial is intended to ease the learning curve for new users of
118 bioperl. To that end the tutorial includes:
120 =over 4
122 =item *
124 Descriptions of what bio-informatics tasks can be handled with bioperl
126 =item *
128 Directions on where to find the methods to accomplish these tasks
129 within the bioperl package
131 =item *
133 Recommendations on where to go for additional information.
135 =item *
137 A separate tutorial script (tutorial.pl - located in the top bioperl
138 directory) with examples of many of methods described in the
139 tutorial.
141 =back
143 Running the tutorial.pl script while going through this tutorial - or
144 better yet, stepping through it with an interactive debugger - is a
145 good way of learning bioperl. The tutorial script is also a good
146 place from which to cut-and-paste code for your scripts(rather than
147 using the code snippets in this tutorial). The tutorial script should
148 work on your machine - and if it doesn't it would probably be a good
149 idea to find out why, before getting too involved with bioperl!
151 This tutorial does not intend to be a comprehensive description of all
152 the objects and methods available in bioperl. For that the reader is
153 directed to the documentation included with each of the modules as
154 well as the additional documentation referred to below.
156 =head2 I.2 Software requirements
158 =head2 I.2.1 Minimal bioperl installation
160 For a "minimal" installation of bioperl, you will need to have perl
161 itself installed as well as the bioperl "core modules". Bioperl has
162 been tested primarily using perl 5.005 and more recently perl 5.6.
163 The minimal bioperl installation should still work under perl 5.004.
164 However, as increasing numbers of bioperl objects are using modules
165 from CPAN (see below), problems have been observed for bioperl running
166 under perl 5.004. So if you are having trouble running bioperl under
167 perl 5.004, you should probably upgrade your version of perl.
169 In addition to a current version of perl, the new user of bioperl is
170 encouraged to have access to, and familiarity with, an interactive
171 perl debugger. Bioperl is a large collection of complex interacting
172 software objects. Stepping through a script with an interactive
173 debugger is a very helpful way of seeing what is happening in such a
174 complex software system - especially when the software is not behaving
175 in the way that you expect. The free graphical debugger ptkdb
176 (available as Devel::ptkdb from CPAN) is highly recommended. Active
177 State offers a commercial graphical debugger for windows systems. The
178 standard perl distribution also contains a powerful interactive
179 debugger - though with a more cumbersome (command line) interface.
181 =head2 I.2.2 Complete installation
183 Taking full advantage of bioperl requires software beyond that for the
184 minimal installation. This additional software includes perl modules
185 from CPAN, bioperl perl extensions, a bioperl xs-extension, and
186 several standard compiled bioinformatics programs.
188 B<Perl - extensions>
190 The following perl modules are available from bioperl
191 (http://bioperl.org/Core/external.shtml)or from CPAN
192 (http://www.perl.com/CPAN/) are used by bioperl. The listing also
193 indicates what bioperl features will not be available if the
194 corresponding CPAN module is not downloaded. If these modules are not
195 available (eg non-unix operating systems), the remainder of bioperl
196 should still function correctly.
198 For accessing remote databases you will need:
200 =over 2
202 =item *
204 File-Temp-0.09
206 =item *
208 IO-String-1.01
210 =back
212 For accessing Ace databases you will need:
214 =over 1
216 =item *
218 AcePerl-1.68.
220 =back
222 For remote blast searches you will need:
224 =over 7
226 =item *
228 libwww-perl-5.48
230 =item *
232 Digest-MD5-2.12.
234 =item *
236 HTML-Parser-3.13
238 =item *
240 libnet-1.0703
242 =item *
244 MIME-Base64-2.11
246 =item *
248 URI-1.09
250 =item *
252 IO-stringy-1.216
254 =back
256 For xml parsing you will need:
258 =over 5
260 =item *
262 libxml-perl-0.07
264 =item *
266 XML-Node-0.10
268 =item *
270 XML-Parser.2.30
272 =item *
274 XML-Writer-0.4
276 =item *
278 expat-1.95.1 from http://sourceforge.net/projects/expat/
280 =back
282 For more current and additional information on external modules
283 required by bioperl, check http://bioperl.org/Core/external.shtml
285 B<Bioperl c extensions & external bio-informatics programs>
287 Bioperl also uses several c-programs for sequence alignment and local
288 blast searching. To use these features of bioperl you will need an
289 ANSI C or Gnu C compiler as well as the actual program available from
290 sources such as:
292 for smith-waterman alignments- bioperl-ext-0.6 from
293 http://bioperl.org/Core/external.shtml
295 for clustalw alignments-
296 http://corba.ebi.ac.uk/Biocatalog/Alignment_Search_software.html/
298 for tcoffee alignments-
299 http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
301 for local blast searching- ftp://ncbi.nlm.nih.gov/blast
303 =head2 I.3 Installation
305 The actual installation of the various system components is
306 accomplished in the standard manner:
308 =over 6
310 =item *
312 Locate the package on the network
314 =item *
316 Download
318 =item *
320 Decompress (with gunzip or a similiar utility)
322 =item *
324 Remove the file archive (eg with tar -xvf)
326 =item *
328 Create a "makefile" (with "perl Makefile.PL" for perl modules or a
329 supplied "install" or "configure" program for non-perl program
331 =item *
333 Run "make", "make test" and "make install" This procedure must be
334 repeated for every CPAN module, bioperl-extension and external
335 module to be installed. A helper module CPAN.pm is available from
336 CPAN which automates the process for installing the perl modules.
338 =back
340 For the external programs (clustal, Tcoffee, ncbi-blast), there is an
341 extra step:
343 =over 1
345 =item *
347 Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or
348 BLASTDIR) to the directory holding the executable in your startup
349 file - eg in .bashrc. (For running local blasts, it is also
350 necessary that the name of local-blast database directory is known
351 to bioperl. This will typically happen automatically, but in case
352 of difficulty, refer to the documentation for StandAloneBlast.pm)
354 =back
356 The only likely complication (at least on unix systems) that may occur
357 is if you are unable to obtain system level writing privileges. For
358 instructions on modifying the installation in this case and for more
359 details on the overall installation procedure, see the README file in
360 the bioperl distribution as well as the README files in the external
361 programs you want to use (eg bioperl-ext, clustalw, TCoffee,
362 NCBI-blast).
364 =head2 I.4 Additional comments for non-unix users
366 Bioperl has mainly been developed and tested under various unix
367 environments (including Linux) and this tutorial is intended primarily
368 for unix users. The minimal installation of bioperl *should* work
369 under other OS's (NT, windows, perl). However, bioperl has not been
370 widely tested under these OS's and problems have been noted in the
371 bioperl mailing lists. In addition, many bioperl features require the
372 use of CPAN modules, compiled extensions or external programs. These
373 features will probably will not work under some or all of these other
374 operating systems. If a script attempts to access these features from
375 a non-unix OS, bioperl is designed to simply report that the desired
376 capability is not available. However, since the testing of bioperl in
377 these environments has been limited, the script may well crash in a
378 less "graceful" manner.
380 Todd Richmond has written of his experiences with BioPerl on MacOs
381 at http://bioperl.org/Core/mac-bioperl.html
383 =head1 II. Brief introduction to bioperl's objects
385 The purpose of this tutorial is to get you using bioperl to solve
386 real-life bioinformatics problems as quickly as possible. The aim is
387 not to explain the structure of bioperl objects or perl
388 object-oriented programming in general. Indeed, the relationships
389 among the bioperl objects is not simple; however, understanding them
390 in detail is fortunately not necessary for successfully using the
391 package.
393 Nevertheless, a little familiarity with the bioperl object "bestiary"
394 can be very helpful even to the casual user of bioperl. For example
395 there are (at least) six different "sequence objects" - Seq,
396 PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI. Understanding the
397 relationships among these objects - and why there are so many of them
398 - will help you select the appropriate one to use in your script.
400 =head2 II.2 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI)
402 Seq is the central sequence object in bioperl. When in doubt this is
403 probably the object that you want to use to describe a dna, rna or
404 protein sequence in bioperl. Most common sequence manipulations can
405 be performed with Seq. These capabilities are described in sections
406 III.3.1 and III.7.1.
408 Seq objects can be created explicitly (see section III.2.1 for an
409 example). However usually Seq objects will be created for you
410 automatically when you read in a file containing sequence data using
411 the SeqIO object. This procedure is described in section III.2.1. In
412 addition to storing its identification labels and the sequence itself,
413 a Seq object can store multiple annotations and associated "sequence
414 features". This capability can be very useful - especially in
415 development of automated genome annotation systems, see section
416 III.7.1.
418 On the other hand, if you need a script capable of simultaneously
419 handling many (hundreds or thousands) sequences at a time, then the
420 overhead of adding annotations to each sequence can be significant.
421 For such applications, you will want to use the PrimarySeq object.
422 PrimarySeq is basically a "stripped down" version of Seq. It contains
423 just the sequence data itself and a few identifying labels (id,
424 accession number, molecule type = dna, rna, or protein). For
425 applications with hundreds or thousands or sequences, using PrimarySeq
426 objects can significantly speed up program execution and decrease the
427 amount of RAM the program requires.
429 The LocatableSeq object is just a Seq object which has "start" and
430 "end" positions associated with it. It is used by the alignment
431 object SimpleAlign and other modules that use SimpleAlign objects (eg
432 AlignIO, pSW). In general you don't have to worry about creating
433 LocatableSeq objects because they will be made for you automatically
434 when you create an alignment (using pSW, Clustalw, Tcoffee or bl2seq)
435 or when input an alignment data file using AlignIO. However if you
436 need to input a sequence alignment by hand (ieg to build a SimpleAlign
437 object), you will need to input the sequences as LocatableSeqs.
439 A LargeSeq object is a special type of Seq object used for handling
440 very long ( eg L<gt> 100 MB) sequences. If you need to manipulate such
441 long sequences see section III.7.2 which describes LargeSeq objects.
443 A LiveSeq object is another specialized object for storing sequence
444 data. LiveSeq addresses the problem of features whose location on a
445 sequence changes over time. This can happen, for example, when
446 sequence feature objects are used to store gene locations on newly
447 sequenced genomes - locations which can change as higher quality
448 sequencing data becomes available. Although a LiveSeq object is not
449 implemented in the same way as a Seq object, LargeSeq does implement
450 the SeqI interface (see below). Consequently, most methods available
451 for Seq objects will work fine with LiveSeq objects. Section III.7.2
452 contains further discussion of LiveSeq objects.
454 SeqI objects are Seq "interface objects" (see section II.4) They are
455 used to ensure bioperl's compatibility with other software packages.
456 SeqI and other interface objects are not likely to be relevant to the
457 casual bioperl user.
459 *** Having described these other types of sequence objects, the
460 "bottom line" still is that if you store your sequence data in Seq
461 objects (which is where they'll be if you read them in with
462 SeqIO), you will usually do just fine. ***
464 =head2 II.3 Alignment objects (SimpleAlign, UnivAln)
466 There are two "alignment objects" in bioperl: SimpleAlign and UnivAln.
467 Both store an array of sequences as an alignment. However their
468 internal data structures are quite different and converting between
469 them - though certainly possible - is rather awkward. In contrast to
470 the sequence objects - where there are good reasons for having 6
471 different classes of objects, the presence of two alignment objects is
472 just an unfortunate relic of the two systems having been designed
473 independently at different times.
475 Since each object has some capabilities that the other lacks it has
476 not yet been feasible to unify bioperl's sequence alignment methods
477 into a single object (see section III.5.4 for a description of
478 SimpleAlign's and UnivAln's features) . However, recent development
479 in bioperl involving alignments has been focused on using SimpleAlign
480 and the new user should generally use SimpleAlign where possible.
482 =head2 II.4 Interface objects and implementation objects
484 Since release 0.6, bioperl has been moving to separate interface and
485 implementation objects. An interface is solely the definition of what
486 methods one can call on an object, without any knowledge of how it is
487 implemented. An implementation is an actual, working implementation of
488 an object. In languages like Java, interface definition is part of the
489 language. In Perl, you have to roll your own.
491 In bioperl, the interface objects usually have names like
492 Bio::MyObjectI, with the trailing I indicating it is an interface
493 object. The interface objects mainly provide documentation on what the
494 interface is, and how to use it, without any implementations (though
495 there are some exceptions). Although interface objects are not of
496 much direct utility to the casual bioperl user, being aware of their
497 existence is useful since they are the basis to understanding how
498 bioperl programs can communicate with other bioinformatics projects
499 such as Ensembl and the Annotation Workbench (see section IV)
502 =head1 III. Using bioperl
504 Bioperl provides software modules for many of the typical tasks of
505 bioinformatics programming. These include:
507 =over 7
509 =item * Accessing sequence data from local and remote databases
511 =item * Transforming formats of database/ file records
513 =item * Manipulating individual sequences
515 =item * Searching for "similar" sequences
517 =item * Creating and manipulating sequence alignments
519 =item * Searching for genes and other structures on genomic DNA
521 =item * Developing machine readable sequence annotations
523 =back
525 The following sections describe how bioperl can help perform all of
526 these tasks.
528 =head2 III.1 Accessing sequence data from local and remote databases
530 Much of bioperl is focused on sequence manipulation. However, before
531 bioperl can manipulate sequences, it needs to have access to sequence
532 data. Now one can directly enter data sequence data into a bioperl
533 Seq object, eg:
535 $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
536 '-desc'=>'Sample Bio::Seq object',
537 '-display_id' => 'something',
538 '-accession_number' => 'accnum',
539 '-moltype' => 'dna' );
541 However, in most cases, it is preferable to access sequence data from
542 some online data file or database (Note that in common with
543 conventional bioinformatics usage we will call a "database" what might
544 be more appropriately referred to as an "indexed flat file".) Bioperl
545 supports accessing remote databases as well as developing indices for
546 setting up local databases.
548 =head2 III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)
550 Accessing sequence data from the principal molecular biology databases
551 is straightforward in bioperl. Data can be accessed by means of the
552 sequence's accession number or id. Batch mode access is also
553 supported to facilitate the efficient retrieval of multiple sequences.
554 For retrieving data from genbank, for example, the code could be as
555 follows:
557 $gb = new Bio::DB::GenBank();
558 $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
559 $seq2 = $gb->get_Seq_by_acc('AF303112'))
560 $seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112 2981014)]));
562 Bioperl currently supports sequence data retrieval from the genbank,
563 genpept, swissprot and gdb databases. Bioperl also supports retrieval
564 from a remote Ace database. This capability requires the presence of
565 the external AcePerl module. You need to download and install the aceperl
566 module from http://stein.cshl.org/AcePerl/.
568 =head2 III.1.2 Indexing and accessing local databases (Bio::Index::*,
569 bpindex.pl, bpfetch.pl)
571 Alternately, bioperl permits indexing local sequence data files by
572 means of the Bio::Index objects. The following sequence data formats
573 are supported: genbank, swissprot, pfam, embl and fasta. Once the set
574 of sequences have been indexed using Bio::Index, individual sequences
575 can be accessed using syntax very similar to that described above for
576 accessing remote databases. For example, if one wants to set up an
577 indexed (flat-file) database of fasta files, and later wants then to
578 retrieve one file, one could write a scripts like:
580 # script 1: create the index
581 use Bio::Index::Fasta; # using fasta file format
582 $Index_File_Name = shift;
583 $inx = Bio::Index::Fasta->new(
584 -filename => $Index_File_Name,
585 -write_flag => 1);
586 $inx->make_index(@ARGV);
588 # script 2: retrieve some files
589 use Bio::Index::Fasta;
590 $Index_File_Name = shift;
591 $inx = Bio::Index::Fasta->new($Index_File_Name);
592 foreach $id (@ARGV) {
593 $seq = $inx->fetch($id); # Returns Bio::Seq object
594 # do something with the sequence
597 To facilitate the creation and use of more complex or flexible
598 indexing systems, the bioperl distribution includes two sample scripts
599 bpindex.pl and bpfetch.pl. These scripts can be used as templates to
600 develop customized local data-file indexing systems.
602 =head2 III.2 Transforming formats of database/ file records
604 =head2 III.2.1 Transforming sequence files (SeqIO)
606 A common - and tedious - bioinformatics task is that of converting
607 sequence data among the many widely used data formats. Bioperl's
608 SeqIO object, however, makes this chore a breeze. SeqIO can read a
609 stream of sequences - located in a single or in multiple files - in
610 any of six formats: Fasta, EMBL. GenBank, Swissprot, PIR and GCG.
611 Once the sequence data has been read in with SeqIO, it is available to
612 bioperl in the form of Seq objects. Moreover, the Seq objects can
613 then be written to another file (again using SeqIO) in any of the
614 supported data formats making data converters simple to implement, for
615 example:
617 use Bio::SeqIO;
618 $in = Bio::SeqIO->new('-file' => "inputfilename",
619 '-format' => 'Fasta');
620 $out = Bio::SeqIO->new('-file' => ">outputfilename",
621 '-format' => 'EMBL');
622 while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
624 In addition, perl "tied filehandle" syntax is available to SeqIO,
625 allowing you to use the standard E<lt>E<gt> and print operations to read and
626 write sequence objects, eg:
628 $in = Bio::SeqIO->newFh('-file' => "inputfilename" ,
629 '-format' => 'Fasta');
630 $out = Bio::SeqIO->newFh('-format' => 'EMBL');
631 print $out $_ while <$in>;
633 =head2 III.2.2 Transforming alignment files (AlignIO)
635 Data files storing multiple sequence alignments also appear in varied
636 formats. AlignIO is the bioperl object for data conversion of
637 alignment files. AlignIO is patterned on the SeqIO object and shares
638 most of SeqIO's features. AlignIO currently supports input in the
639 following formats: fasta, mase, stockholm, prodom, selex, bl2seq,
640 msf/gcg and output in these formats: : fasta, mase, selex, clustalw,
641 msf/gcg. One significant difference between AlignIO and SeqIO is that
642 AlignIO handles IO for only a single alignment at a time (SeqIO.pm
643 handles IO for multiple sequences in a single stream.) Syntax for
644 AlignIO is almost identical to that of SeqIO: use Bio::AlignIO;
646 $in = Bio::AlignIO->new('-file' => "inputfilename" ,
647 '-format' => 'fasta');
648 $out = Bio::AlignIO->new('-file' => ">outputfilename",
649 '-format' => 'pfam');
650 while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
652 The only difference is that here, the returned object reference, $aln,
653 is to a SimpleAlign object rather than a Seq object.
655 AlignIO also supports the tied filehandle syntax described above for
656 SeqIO. (Note that currently AlignIO is usable only with SimpleAlign
657 alignment objects. IO for UnivAln objects can only be done for files
658 in fasta data format.)
660 =head2 III.3 Manipulating sequences
662 =head2 III.3.1 Manipulating sequence data with Seq methods
664 OK, so we know how to retrieve sequences and access them as Seq
665 objects. Let's see how we can use the Seq objects to manipulate our
666 sequence data and retrieve information. Seq provides multiple methods
667 for performing many common (and some not-so-common) tasks of sequence
668 manipulation and data retrieval. Here are some of the most useful:
670 The following methods return strings
672 $seqobj->display_id(); # the human read-able id of the sequence
673 $seqobj->seq(); # string of sequence
674 $seqobj->subseq(5,10); # part of the sequence as a string
675 $seqobj->accession_number(); # when there, the accession number
676 $seqobj->moltype(); # one of 'dna','rna','protein'
677 $seqobj->primary_id(); # a unique id for this sequence irregardless
678 # of its display_id or accession number
680 The following methods return an array of Bio::SeqFeature objects
682 $seqobj->top_SeqFeatures # The 'top level' sequence features
683 $seqobj->all_SeqFeatures # All sequence features, including sub
684 # seq features
686 Sequence features will be discussed further in section III.7 on
687 machine-readable sequence annotation.
689 The following methods returns new sequence objects, but do not transfer features across
691 $seqobj->trunc(5,10) # truncation from 5 to 10 as new object
692 $seqobj->revcom # reverse complements sequence
693 $seqobj->translate # translation of the sequence
695 Note that some methods return strings, some return arrays and some return
696 references to objects. Here (as elsewhere in perl and bioperl) it is the user's
697 responsibility to check the relevant documentation so they know the
698 format of the data being returned.
700 Many of these methods are self-explanatory. However, bioperl's flexible
701 translation methods warrant further comment. Translation in bioinformatics
702 can mean two slightly different things:
704 =over 2
706 =item 1 Translating a nucleotide sequence from start to end.
708 =item 2 Taking into account the constraints of real coding regions in mRNAs.
710 =back
712 For historical reasons the bioperl implementation of translation does
713 the first of these tasks easily. Any sequence object which is not of moltype
714 'protein' can be translated by simply calling the method which returns
715 a protein sequence object:
717 $translation1 = $my_seq_object->translate;
719 However, the translate method can also be passed several optional parameters
720 to modify its behavior. For example, the first two arguments to "translate"
721 can be used to modify the characters used to represent stop (default '*')
722 and unknown amino acid ('X'). (These are normally best left untouched.)
723 The third argument determines the frame of the translation. The default
724 frame is "0". To get translations in the other two forward frames,
725 we would write:
727 $translation2 = $my_seq_object->translate(undef,undef,1);
728 $translation3 = $my_seq_object->translate(undef,undef,2);
730 The fourth argument to "translate" makes it possible to use alternative
731 genetic codes. There are currently 16 codon tables defined, including tables for
732 'Verterbate Mitochondrial', 'Bacterial', 'Alternative Yeast Nuclear'
733 and 'Ciliate, Dasycladacean and Hexamita Nuclear' translation. These
734 tables are located in the object Bio::Tools::CodonTable which is used
735 by the translate method. For example, for mitochondrial translation:
737 $human_mitochondrial_translation =
738 $my_seq_object->translate(undef,undef,undef, 2);
740 If we want to translate full coding regions (CDS) the way major
741 nucleotide databanks EMBL, GenBank and DDBJ do it, the translate
742 method has to perform more tricks. Specifically, 'translate' needs
743 to confirm that the sequence has appropriate start and terminator codons
744 at the beginning and the end of the sequence and that there are no terminator
745 codons present within the sequence. In addition, if the genetic code being used has
746 an atypical (non-ATG) start codon, the translate method needs to convert
747 the initial amino acid to methionine. These checks and conversions are triggered by
748 setting the fifth argument of the translate method to evaluate to "true".
750 If argument 5 is set to true and the criteria for a proper CDS are
751 not met, the method, by default, issues a warning. By setting the
752 sixth argument to evaluate to "true", one can instead instruct
753 the program to die if an improper CDS is found, e.g.
755 $protein_object =
756 $cds->translate(undef,undef,undef,undef,1,'die_if_errors');
758 =head2 III.3.2 Obtaining basic sequence statistics- MW, residue &codon
759 frequencies(SeqStats, SeqWord)
761 In addition to the methods directly available in the Seq object,
762 bioperl provides various "helper" objects to determine additional
763 information about a sequence. For example, the SeqStats object
764 provides methods for obtaining the molecular weight of the sequence as
765 well the number of occurrences of each of the component residues
766 (bases for a nucleic acid or amino acids for a protein.) For nucleic
767 acids, SeqStats also returns counts of the number of codons used. For
768 example:
770 use SeqStats
771 $seq_stats = Bio::Tools::SeqStats->new($seqobj);
772 $weight = $seq_stats->get_mol_wt();
773 $monomer_ref = $seq_stats->count_monomers();
774 $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
776 Note: sometimes sequences will contain "ambiguous" codes. For this
777 reason, get_mol_wt() returns (a reference to) a two element array
778 containing a greatest lower bound and a least upper bound of the
779 molecular weight.
781 The SeqWords object is similar to SeqStats and provides methods for
782 calculating frequencies of "words" (eg tetramers or hexamers) within
783 the sequence.
785 =head2 III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme)
787 Another common sequence manipulation task for nucleic acid sequences
788 is locating restriction enzyme cutting sites. Bioperl provides the
789 RestrictionEnzyme object for this purpose. Bioperl's standard
790 RestrictionEnzyme object comes with data for XXX different restriction
791 enzymes. A list of the available enzymes can be accessed using the
792 available_list() method. For example to select all available enzymes
793 that with cutting patterns that are six bases long one would write:
795 $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
796 @sixcutters = $re->available_list(6);
798 Once an appropriate enzyme has been selected, the sites for that
799 enzyme on a given nucleic acid sequence can be obtained using the
800 cut_seq() method. The syntax for performing this task is:
802 $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');
803 # $seqobj is the Seq object for the dna to be cut
804 @fragments = $re1->cut_seq($seqobj);
806 Adding an enzyme not in the default list is easily accomplished:
808 $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV--GAT^ATC',
809 '-MAKE' =>'custom');
811 Once the custom enzyme object has been created, cut_seq() can be
812 called in the usual manner.
814 =head2 III.3.4 Identifying amino acid cleavage sites (Sigcleave)
816 For amino acid sequences we may be interested to know whether the
817 amino acid sequence contains a cleavable "signal sequence" for
818 directing the transport of the protein within the cell. SigCleave is
819 a program (originally part of the EGCG molecular biology package) to
820 predict signal sequences, and to identify the cleavage site.
822 The "threshold" setting controls the score reporting. If no value for
823 threshold is passed in by the user, the code defaults to a reporting
824 value of 3.5. SigCleave will only return score/position pairs which
825 meet the threshold limit.
827 There are 2 accessor methods for this object. "signals" will return a
828 perl hash containing the sigcleave scores keyed by amino acid
829 position. "pretty_print" returns a formatted string similar to the
830 output of the original sigcleave utility.
832 Syntax for using the modules is as follows:
834 use Bio::Tools::Sigcleave;
835 $sigcleave_object = new Bio::Tools::Sigcleave
836 ('-file'=>'sigtest.aa',
837 '-threshold'=>'3.5'
838 '-desc'=>'test sigcleave protein seq',
839 '-type'=>'AMINO
841 %raw_results = $sigcleave_object->signals;
842 $formatted_output = $sigcleave_object->pretty_print;
844 Note that Sigcleave is passed a raw sequence (or file containing a
845 sequence) rather than a sequence object when it is created. Also note
846 that the "type" in the Sigcleave object is "amino" whereas in a Seq
847 object it is "protein".
849 =head2 III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern
851 OddCodes:
853 For some purposes it's useful to have a listing of an amino acid
854 sequence showing where the hydrophobic amino acids are located or
855 where the positively charged ones are. Bioperl provides this
856 capability via the module OddCodes.pm.
858 For example, to quickly see where the charged amino acids are located
859 along the sequence we perform:
861 use Bio::Tools::OddCodes;
862 $oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
863 $output = $oddcode_obj->charge();
865 The sequence will be transformed into a three-letter sequence (A,C,N)
866 for negative (acidic), positive (basic), and neutral amino acids. For
867 example the ACDEFGH would become NNAANNC.
869 For a more complete chemical description of the sequence one can call
870 the chemical() method which turns sequence into one with an 8-letter
871 chemical alphabet { A (acidic), L (aliphatic), M (amide), R
872 (aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:
874 $output = $oddcode_obj->chemical();
876 In this case the sample sequence ACDEFGH would become LSAARAC.
878 OddCodes also offers translation into alphabets showing alternate
879 characteristics of the amino acid sequence such as hydrophobicity,
880 "functionality" or grouping using Dayhoff's definitions. See the
881 documentation for OddCodes.pm for further details.
883 SeqPattern:
885 The SeqPattern object is used to manipulate sequences that include
886 perl "regular expressions". A key motivation for SeqPattern is to
887 have a way of generating a reverse complement of a nucleic acid
888 sequence pattern that includes ambiguous bases and/or regular
889 expressions. This capability leads to significant performance gains
890 when pattern matching on both the sense and anti-sense strands of a
891 query sequence are required. Typical syntax for using SeqPattern is
892 shown below. For more information, there are several interesting
893 examples in the script SeqPattern.pl in the examples directory.
895 Use Bio::Tools::SeqPattern;
896 $pattern = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
897 $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' =>$pattern,
898 '-TYPE' =>'dna');
899 $pattern_obj2 = $pattern_obj->revcom();
900 $pattern_obj->revcom(1); ## returns expanded rev complement pattern.
902 =head2 III.4 Searching for "similar" sequences
904 One of the basic tasks in molecular biology is identifying sequences
905 that are, in some way, similar to a sequence of interest. The Blast
906 programs, originally developed at the NCBI, are widely used for
907 identifying such sequences. Bioperl offers a number of modules to
908 facilitate running Blast as well as to parse the often voluminous
909 reports produced by Blast.
911 =head2 III.4.1 Running BLAST locally (StandAloneBlast)
913 There are several reasons why one might want to run the Blast programs
914 locally - speed, data security, immunity to network problems, being
915 able to run large batch runs etc. The NCBI provides a downloadable
916 version of blast in a stand-alone format, and running blast locally
917 without any use of perl or bioperl - is completely
918 straightforward. However, there are situations where having a perl
919 interface for running the blast programs locally is convenient.
921 The module StandAloneBlast.pm offers the ability to wrap local calls
922 to blast from within perl. All of the currently available options of
923 NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within
924 the bioperl StandAloneBlast interface. Of course, to use
925 StandAloneBlast, one needs to have installed locally ncbi-blast as
926 well as one or more blast-readable databases.
928 Basic usage of the StandAloneBlast.pm module is simple. Initially, a
929 local blast "factory object" is created.
931 @params = ('program' => 'blastn',
932 'database' => 'ecoli.nt');
933 $factory = Bio::Tools::StandAloneBlast->new(@params);
935 Any parameters not explicitly set will remain as the BLAST defaults.
936 Once the factory has been created and the appropriate parameters set,
937 one can call one of the supported blast executables. The input
938 sequence(s) to these executables may be fasta file(s), a Bio::Seq
939 object or an array of Bio::Seq objects, eg
941 $input = Bio::Seq->new('-id'=>"test query",
942 '-seq'=>"ACTAAGTGGGGG");
943 $blast_report = $factory->blastall($input);
945 The returned blast report will be in the form of a bioperl
946 parsed-blast object. The report object may be either a BPlite,
947 BPpsilite, BPbl2seq or Blast object depending on the type of blast
948 search. The "raw" blast report is also available.
950 The syntax for running PHIBLAST, PSIBLAST and bl2seq searches via
951 StandAloneBlast is also straightforward. See the StandAloneBlast.pm
952 documentation for details. In addition, the script standaloneblast.pl
953 in the examples directory contains descriptions of various possible
954 applications of the StandAloneBlast object.
956 =head2 III.4.2 Running BLAST remotely (using Blast.pm)
958 Bioperl supports remote execution of blasts at NCBI by means of the
959 Blast.pm object. (Note: the bioperl Blast object is referred to here
960 as Blast.pm to distinguish it from the actual Blast program).
961 Blast.pm is capable of both running Blasts and parsing the report
962 results. Blast.pm supports a wide array of modes, options and
963 parameters. As a result, using Blast.pm directly can be somewhat
964 complicated. Consequently, it is recommended to use, and if necessary
965 modify, the supplied scripts - run_blast_remote.pl and
966 retrieve_blast.pl in the examples/blast/ subdirectory - rather than to
967 use Blast.pm directly. Sample syntax looks like this:
969 run_blast_remote.pl seq/yel009c.fasta -prog blastp -db swissprot
970 retrieve_blast.pl < YEL009C.blastp2.swissprot.temp.html
972 The NCBI blast server will respond with an ID number indicating the
973 file in which the blast results are stored (with a line like "Obtained
974 request ID: 940912366-18156-27559"). That result file will then be
975 stored locally with a name like 940905064-15626-17267.txt, and can
976 subsequently be read with Blast.pm or BPlite as described below.
978 Run the scripts run_blast_remote.pl, retrieve_blast.pl and
979 blast_config.pl with the options "-h" or "-eg" for more examples on
980 how to use Blast.pm to perform remote blasts.
982 =head2 III.4.3 Parsing BLAST reports with Blast.pm
984 No matter how Blast searches are run (locally or remotely, with or
985 without a perl interface), they return large quantities of data that
986 are tedious to sift through. Bioperl offers two different objects -
987 Blast.pm and BPlite.pm (along with its minor modifications, BPpsilite
988 and BPbl2seq) for parsing Blast reports.
990 The parser contained within the Blast.pm module is the original Blast
991 parser developed for Bioperl. It is very full featured and has a
992 large array of options and output formats. Typical syntax for parsing
993 a blast report with Blast.pm is:
995 use Bio::Tools::Blast;
996 $blast = Bio::Tools::Blast->new(-file =>'t/blast.report',
997 -signif => 1e-5,
998 -parse => 1,
999 -stats => 1,
1000 -check_all_hits => 1, );
1001 $blast->display();
1002 $num_hits = $blast->num_hits;
1003 @hits = $blast->hits;
1004 $frac1 = $hits[1]->frac_identical;
1005 @inds = $hits[1]->hsp->seq_inds('query', 'iden', 1);
1007 Here the method "hits" returns an object containing the names of the
1008 sequences which produced a match and the "hsp" method returns a "high
1009 scoring pair" object containing the actual sequence alignments that
1010 each of the hits produced.
1012 One very nice feature of the Blast.pm parser is being able to define
1013 an arbitrary "filter function" for use while parsing the Blast hits.
1014 With this feature, you can filter your results to just save hits with
1015 specific pattern in their id fields (eg "homo sapiens") or specific
1016 sequence patterns in a returned high-scoring-pair or just about
1017 anything else that can be found in the blast report record.
1019 While the Blast object is parsing the report, each hit is checked by
1020 calling &filter($hit). All hits that generate false return values from
1021 &filter are screened out of the Blast object.. Note that the Blast
1022 object will normally stop parsing after the first non-significant hit
1023 or the first hit that does not pass the filter function. To force the
1024 Blast object to check all hits, include a " -check_all_hits =E<gt> 1"
1025 parameter. For example, to eliminate all hits with gaps or with less
1026 than 50% conserved residues one could use the following filter
1027 function:
1029 sub filter { $hit=shift;
1030 return ($hit->gaps == 0 and $hit->frac_conserved > 0.5); }
1032 and use it like this:
1034 $blastObj = Bio::Tools::Blast->new( '-file' => '/tmp/blast.out',
1035 '-parse' => 1,
1036 '-check_all_hits' => 1,
1037 '-filt_func' => \&filter );
1039 Unfortunately the flexibility of the Blast.pm parser comes at a cost
1040 of complexity. As a result of this complexity and the fact that
1041 Blast.pm's original developer is no longer actively supporting the
1042 module, the Blast.pm parser has been difficult to maintain and has not
1043 been upgraded to handle the output of the newer blast options such as
1044 PSIBLAST and BL2SEQ. Consequently, the BPlite parser (described in
1045 the following section) is recommended for most blast parsing within
1046 bioperl.
1048 =head2 III.4.4 Parsing BLAST reports with BPlite, BPpsilite and BPbl2seq
1050 Because of the issues with Blast.pm discussed above, Ian Korf's BPlite
1051 parser has been recently ported to Bioperl. BPlite is less complex
1052 and easier to maintain than Blast.pm. Although it has fewer options
1053 and display modes than Blast.pm, you will probably find that it
1054 contains the functionality that you need. ( One exception might be if
1055 you want to set up an arbitrary filter function - as described above -
1056 in which case you may want to use the Blast.pm parser.)
1058 BPlite
1060 The syntax for using BPlite is as follows where the method for
1061 retrieving hits is now called "nextSbjct" (for "subject"), while the
1062 method for retrieving high-scoring-pairs is called "nextHSP":
1064 use Bio::Tools::BPlite;
1065 $report = new BPlite(-fh=>\*STDIN);
1066 $report->query;
1067 while(my $sbjct = $report->nextSbjct) {
1068 $sbjct->name;
1069 while (my $hsp = $sbjct->nextHSP) { $hsp->score; }
1072 BPpsilite
1074 BPpsilite and BPbl2seq are objects for parsing (multiple iteration)
1075 PSIBLAST reports and Blast bl2seq reports, respectively. They are
1076 both minor variations on the BPlite object.
1078 The syntax for parsing a multiple iteration PSIBLAST report is as
1079 shown below. The only significant additions to BPlite are methods to
1080 determine the number of iterated blasts and to access the results from
1081 each iteration. The results from each iteration are parsed in the
1082 same manner as a (complete) BPlite object.
1084 use Bio::Tools::BPpsilite;
1085 $report = new BPpsilite(-fh=>\*STDIN);
1086 $total_iterations = $report->number_of_iterations;
1087 $last_iteration = $report->round($total_iterations)
1088 while(my $sbjct = $last_iteration ->nextSbjct) {
1089 $sbjct->name;
1090 while (my $hsp = $sbjct->nextHSP) {$hsp->score; }
1093 BPbl2seq
1095 BLAST bl2seq is a program for comparing and aligning two sequences
1096 using BLAST. Although the report format is similar to that of a
1097 conventional BLAST, there are a few differences. Consequently, the
1098 standard bioperl parsers Blast.pm and BPlite are unable to read bl2seq
1099 reports directly. From the user's perspective, the main difference
1100 between bl2seq and other blast reports is that the bl2seq report does
1101 not print out the name of the first of the two aligned sequences.
1102 Consequently, BPbl2seq has no way of identifying the name of one of
1103 the initial sequence unless it is explicitly passed to constructor as
1104 a second argument as in:
1106 use Bio::Tools::BPbl2seq;
1107 $report = Bio::Tools::BPbl2seq->new(-file => "t/bl2seq.out", -queryname => "ALEU_HORVU");
1108 $matches = $report->match;
1110 =head2 III.4.5 Parsing HMM reports (HMMER::Results)
1112 Blast is not the only sequence-similarity-searching program supported
1113 by bioperl. HMMER is a Hidden Markov-chain Model (HMM) program that
1114 (among other capabilities) enables sequence similarity
1115 searching. Bioperl does not currently provide a perl interface for
1116 running HMMER. However, bioperl does provide a HMMER report parser
1117 with the (perhaps not too descriptive) name of Results.
1119 Results can parse reports generated both by the HMMER program
1120 hmmsearch - which searches a sequence database for sequences similar
1121 to those generated by a given HMM - and the program hmmpfam - which
1122 searches a HMM database for HMMs which match domains of a given
1123 sequence. For hmmsearch, a series of HMMER::Set objects are made, one
1124 for each sequence. For hmmpfam searches, only one Set object is made.
1125 Sample usage for parsing a hmmsearch report might be:
1127 use Bio::Tools::HMMER::Results;
1128 $res = new Bio::Tools::HMMER::Results('-file' => 'output.hmm' ,
1129 '-type' => 'hmmsearch');
1130 foreach $seq ( $res->each_Set ) {
1131 print "Sequence bit score is ", $seq->bits, "\n";
1132 foreach $domain ( $seq->each_Domain ) {
1133 print " Domain start ", $domain->start, " end ",
1134 $domain->end," score ",$domain->bits,"\n";
1138 =head2 III.5 Creating and manipulating sequence alignments
1140 Once one has identified a set of similar sequences, one often needs to
1141 create an alignment of those sequences. Bioperl offers several perl
1142 objects to facilitate sequence alignment: pSW, Clustalw.pm, TCoffee.pm
1143 and the bl2seq option of StandAloneBlast. All of these objects take
1144 as arguments a reference to an array of (unaligned) Seq objects. All
1145 (except bl2seq) return a reference to a SimpleAlign object. bl2seq
1146 can also produce a SimpleAlign object when it is combined with AlignIO
1147 (see below section III.5.2).
1149 =head2 III.5.1 Aligning 2 sequences with Smith-Waterman (pSW)
1151 The Smith-Waterman (SW) algorithm is the standard method for producing
1152 an optimal alignment of two sequences. Bioperl supports the
1153 computation of SW alignments via the pSW object. The SW algorithm
1154 itself is implemented in C and incorporated into bioperl using an XS
1155 extension. This has significant efficiency advantages but means that
1156 pSW will **not** work unless you have compiled the bioperl-ext
1157 package. If you have compiled the bioperl-ext package, usage is
1158 simple, where the method align_and_show displays the alignment while
1159 pairwise_alignment produces a (reference to) a SimpleAlign object.
1161 use Bio::Tools::pSW;
1162 $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla',
1163 '-gap' => 12,
1164 '-ext' => 2, );
1165 $factory->align_and_show($seq1, $seq2, STDOUT);
1166 $aln = $factory->pairwise_alignment($seq1, $seq2);
1168 SW matrix, gap and extension parameters can be adjusted as shown.
1169 Bioperl comes standard with blosum62 and gonnet250 matrices. Others
1170 can be added by the user. For additional information on accessing the
1171 SW algorithm via pSW see the example script pSW.pl and the
1172 documentation in pSW.pm.
1174 =head2 III.5.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
1176 As an alternative to Smith-Waterman, two sequences can also be aligned
1177 in Bioperl using the bl2seq option of Blast within the StandAloneBlast
1178 object. To get an alignment - in the form of a SimpleAlign object -
1179 using bl2seq, you need to parse the bl2seq report with the AlignIO
1180 file format reader as follows:
1182 $factory = Bio::Tools::StandAloneBlast->new('outfile' => 'bl2seq.out');
1183 $bl2seq_report = $factory->bl2seq($seq1, $seq2);
1184 # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
1185 $str = Bio::AlignIO->new('-file '=>' bl2seq.out',
1186 '-format' => 'bl2seq');
1187 $aln = $str->next_aln();
1189 =head2 III.5.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
1191 For aligning multiple sequences (ie two or more), bioperl offers a
1192 perl interface to the bioinformatics-standard clustalw and tcoffee
1193 programs. Clustalw has been a leading program in global multiple
1194 sequence alignment (MSA) for several years. TCoffee is a relatively
1195 recent program - derived from clustalw - which has been shown to
1196 produce better results for local MSA.
1198 To use these capabilities, the clustalw and/or tcoffee programs
1199 themselves need to be installed on the host system. In addition, the
1200 environmental variables CLUSTALDIR and TCOFFEEDIR need to be set to
1201 the directories containg the executables. See section I.3 and the
1202 Clustalw.pm and TCoffee.pm module documentation for information on
1203 downloading and installing these programs.
1205 From the user's perspective, the bioperl syntax for calling
1206 Clustalw.pm or TCoffee.pm is almost identical. The only differences
1207 are the names of the modules themselves appearing in the initial "use"
1208 and constructor statements and the names of the some of the individual
1209 program options and parameters.
1211 In either case, initially, a "factory object" must be created. The
1212 factory may be passed most of the parameters or switches of the
1213 relevant program. In addition, alignment parameters can be changed
1214 and/or examined after the factory has been created. Any parameters
1215 not explicitly set will remain as the underlying program's
1216 defaults. Clustalw.pm/TCoffee.pm output is returned in the form of a
1217 SimpleAlign object. It should be noted that some Clustalw and TCoffee
1218 parameters and features (such as those corresponding to tree
1219 production) have not been implemented yet in the Perl interface.
1221 Once the factory has been created and the appropriate parameters set,
1222 one can call the method align() to align a set of unaligned sequences,
1223 or profile_align() to add one or more sequences or a second alignment
1224 to an initial alignment. Input to align() consists of a set of
1225 unaligned sequences in the form of the name of file containing the
1226 sequences or a reference to an array of Bio::Seq objects. Typical
1227 syntax is shown below. (We illustrate with Clustalw.pm, but the same
1228 syntax - except for the module name - would work for TCoffee.pm)
1230 use Bio::Tools::Run::Alignment::Clustalw;
1231 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
1232 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
1233 $ktuple = 3;
1234 $factory->ktuple($ktuple); # change the parameter before executing
1235 $seq_array_ref = \@seq_array;
1236 # where @seq_array is an array of Bio::Seq objects
1237 $aln = $factory->align($seq_array_ref);
1239 Clustalw.pm/TCoffee.pm can also align two (sub)alignments to each
1240 other or add a sequence to a previously created alignment by using the
1241 profile_align method. For further details on the required syntax and
1242 options for the profile_align method, the user is referred to the
1243 Clustalw.pm/TCoffee.pm documentation. The user is also encouraged to
1244 run the script clustalw.pl in the examples directory.
1246 =head2 III.5.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
1248 As described in section II.2, bioperl currently includes two alignment
1249 objects, SimpleAlign and UnivAln. SimpleAlign objects are usually
1250 more useful, since they are directly produced by bioperl alignment
1251 creation objects ( eg Clustalw.pm and pSW) and can be used to read and
1252 write multiple alignment formats via AlignIO.
1254 However, SimpleAlign currently only offers limited functionality for
1255 alignment manipulation. One useful method offered by SimpleAlign is
1256 consensus_string(). This method returns a string with the most common
1257 residue in the alignment at each string position. An optional
1258 threshold ranging from 0 to 100 can be passed to consensus_string. If
1259 the consensus residue appears in fewer than the threshold % of the
1260 sequences, consensus_string will return a "?" at that
1261 location. Typical usage is:
1263 use Bio::SimpleAlign;
1264 $aln = Bio::SimpleAlign->new('t/alnfile.fasta');
1265 $threshold_percent = 60;
1266 $str = $aln->consensus_string($threshold_percent)
1268 UnivAln, on the other hand, offers a variety of methods for "slicing
1269 and dicing" an alignment including methods for removing gaps, reverse
1270 complementing specified rows and/or columns of an alignment, and
1271 extracting consensus sequences with specified thresholds for the
1272 entire alignment or a sub-alignment. Typical usage is:
1274 use Bio::UnivAln;
1275 $aln = Bio::UnivAln->new('t/alnfile.fasta');
1276 $resSlice1 = $aln->remove_gaps(); # original sequences without gaps
1277 $resSlice2 = $aln->revcom([1,3]); # reverse complement, rows 1+3 only
1278 $resSlice3 = $aln->consensus(0.6, [1,3]);
1279 # 60% majority, columns 1+3 only
1281 Many additional - and more intricate - methods exist. See the UnivAln
1282 documentation. Note that if you do want to use UnivAln's methods on
1283 an alignment, you will first need to convert the alignment into fasta
1284 format (which can be done via the SimpleAlign and AlignIO objects
1285 discussed above.)
1287 =head2 III.6 Searching for genes and other structures on genomic DNA
1288 (Genscan, Sim4, ESTScan, MZEF)
1290 Automated searching for putative genes, coding sequences and other
1291 functional units in genomic and expressed sequence tag (EST) data has
1292 become very important as the available quantity of sequence data has
1293 rapidly increased. Many gene searching programs currently exist.
1294 Each produces reports containing predictions that must be read
1295 manually or parsed by automated report readers.
1297 Parsers for four widely used gene prediction programs- Genscan, Sim4,
1298 ESTScan and MZEF - are currently available or under active development
1299 in bioperl. The interfaces for the four parsers are similar. We illustrate
1300 the usage for Genscan and Sim4 here. The syntax is relatively
1301 self-explanatory; further details are available in the module
1302 documentation in the Bio::Tools directory.
1304 use Bio::Tools::Genscan;
1305 $genscan = Bio::Tools::Genscan->new(-file => 'result.genscan');
1306 # $gene is an instance of Bio::Tools::Prediction::Gene
1307 # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
1308 while($gene = $genscan->next_prediction())
1309 { @exon_arr = $gene->exons(); }
1310 $genscan->close();
1312 use Bio::Tools::Sim4::Results;
1313 $sim4 = new Bio::Tools::Sim4::Results(-file=> 't/sim4.rev', -estisfirst=>0);
1314 # $exonset is-a Bio::SeqFeature::Generic with Bio::Tools::Sim4::Exons
1315 # as sub features
1316 $exonset = $sim4->next_exonset;
1317 @exons = $exonset->sub_SeqFeature();
1318 # $exon is-a Bio::SeqFeature::FeaturePair
1319 $exon = 1;
1320 $exonstart = $exons[$exon]->start();
1321 $estname = $exons[$exon]->est_hit()->seqname();
1322 $sim4->close();
1324 =head2 III.7 Developing machine readable sequence annotations
1326 Historically, annotations for sequence data have been entered and read
1327 manually in flat-file or relational databases with relatively little
1328 concern for machine readability. More recent projects - such as EBI's
1329 Ensembl project and the efforts to develop an XML molecular biology
1330 data specification - have begun to address this limitation. Because
1331 of its strengths in text processing and regular-expression handling,
1332 perl is a natural choice for the computer language to be used for this
1333 task. And bioperl offers numerous tools to facilitate this process -
1334 several of which are described in the following sub-sections.
1336 =head2 III.7.1 Representing sequence annotations (Annotation,
1337 SeqFeature)
1339 As of the 0.7 release of bioperl, the fundamental sequence object,
1340 Seq, can have multiple sequence feature (SeqFeature) objects - eg
1341 Gene, Exon, Promoter objects - associated with it. A Seq object can
1342 also have an Annotation object (used to store database links,
1343 literature references and comments) associated with it. Creating a
1344 new SeqFeature and Annotation and associating it with a Seq is
1345 accomplished with syntax like:
1347 $feat = new Bio::SeqFeature::Generic('-start' => 40,
1348 '-end' => 80,
1349 '-strand' => 1,
1350 '-primary' => 'exon',
1351 '-source' => 'internal' );
1352 $seqobj->add_SeqFeature($feat); # Add the SeqFeature to the parent
1353 $seqobj->annotation(new Bio::Annotation
1354 ('-description' => 'desc-here'));
1356 Once the features and annotations have been associated with the Seq,
1357 they can be with retrieved, eg:
1359 @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
1360 @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
1361 $ann = $seqobj->annotation(); # annotation object
1363 The individual components of a SeqFeature can also be set or retrieved
1364 with methods including:
1366 # attributes which return numbers
1367 $feat->start # start position
1368 $feat->end # end position
1370 $feat->strand # 1 means forward, -1 reverse, 0 not relevant
1372 # attributes which return strings
1373 $feat->primary_tag # the main 'name' of the sequence feature,
1374 # eg, 'exon'
1375 $feat->source_tag # where the feature comes from, eg'BLAST'
1377 # attributes which return Bio::PrimarySeq objects
1378 $feat->seq # the sequence between start,end
1379 $feat->entire_seq # the entire sequence
1381 # other useful methods include
1382 $feat->overlap($other) # do SeqFeature $feat and SeqFeature $other overlap?
1383 $feat->contains($other) # is $other completely within $feat?
1384 $feat->equals($other) # do $feat and $other completely $agree?
1385 $feat->sub_SeqFeatures # create/access an array of subsequence features
1387 =head2 III.7.2 Representing and large and/or changing sequences
1388 (LiveSeq,LargeSeq)
1390 Very large sequences and/or data files with sequences that are
1391 frequently being updated present special problems to automated
1392 sequence-annotation storage and retrieval projects. Bioperl's
1393 LargeSeq and LiveSeq objects are designed to address these two
1394 situations.
1396 LargeSeq
1398 A LargeSeq object is a SeqI compliant object that stores a sequence as
1399 a series of files in a temporary directory (see sect II.1 for a
1400 definition of SeqI objects). The aim is to enable storing very large
1401 sequences (eg, E<gt> 100MBases) without running out of memory and, at the
1402 same time, preserving the familiar bioperl Seq object interface. As a
1403 result, from the users perspective, using a LargeSeq object is almost
1404 identical to using a Seq object. The principal difference is in the
1405 format used in the SeqIO calls. Another difference is that the user
1406 must remember to only read in small chunks of the sequence at one
1407 time. These differences are illustrated in the following code:
1409 $seqio = new Bio::SeqIO('-format'=>'largefasta',
1410 '-file' =>'t/genomic-seq.fasta');
1411 $pseq = $seqio->next_seq();
1412 $plength = $pseq->length();
1413 $last_4 = $pseq->subseq($plength-3,$plength); # this is OK
1415 #On the other hand, the next statement would
1416 #probably cause the machine to run out of memory
1417 #$lots_of_data = $pseq->seq(); #NOT OK for a large LargeSeq object
1419 LiveSeq
1421 The LiveSeq object addresses the need for a sequence object capable of
1422 handling sequence data that may be changing over time. In such a
1423 sequence, the precise locations of features along the sequence may
1424 change. LiveSeq deals with this issue by re-implementing the sequence
1425 object internally as a "double linked chain." Each element of the
1426 chain is connected to other two elements (the PREVious and the NEXT
1427 one). There is no absolute position (like in an array), hence if
1428 positions are important, they need to be computed (methods are
1429 provided). Otherwise it's easy to keep track of the elements with
1430 their "LABELs". There is one LABEL (think of it as a pointer) to each
1431 ELEMENT. The labels won't change after insertions or deletions of the
1432 chain. So it's always possible to retrieve an element even if the
1433 chain has been modified by successive insertions or deletions.
1435 Although the implementation of the LiveSeq object is novel, its
1436 bioperl user interface is unchanged since LiveSeq implements a
1437 PrimarySeqI interface (recall PrimarySeq is the subset of Seq without
1438 annotations or SeqFeatures - see sec II.1). Consequently syntax for
1439 using LiveSeq objects is familiar although a modified version of SeqIO
1440 called LiveSeq::IO::Bioperl needs to be used to actually load the
1441 data, eg:
1443 $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
1444 '-file'=>"t/factor7.embl");
1445 $gene=$loader->gene2liveseq('-gene_name' => "factor7");
1446 $id = $gene->get_DNA->display_id ;
1447 $maxstart = $gene->maxtranscript->start;
1449 Creating, maintaining and querying of LiveSeq genes is quite memory
1450 and processor intensive. Consequently, any additional information
1451 relating to mutational changes in a gene need to be stored separately
1452 from the sequence data itself. The next section describes the mutation
1453 and polymorphism objects used to accomplish this.
1456 =head2 III.7.3 Representing related sequences - mutations,
1457 polymorphisms etc (Allele, SeqDiff,)
1459 Bio::LiveSeq::Mutation object allows for a basic description of a
1460 sequence change in DNA or cDNA sequence of a gene.
1461 Bio::LiveSeq::Mutator takes in mutations, applies them to a LiveSeq
1462 gene and returns a set of Bio::Variation objects describing the net
1463 effect of the mutation on the gene at the DNA, RNA and protein stages.
1465 The objects in Bio::Variation and Bio::LiveSeq directory were
1466 originally designed for the "Computational Mutation Expression
1467 Toolkit" project at European Bioinformatics Institute (EBI). The
1468 result of using them to mutate a gene is a holder object, 'SeqDiff',
1469 that can be printed out or queried for specific information. For
1470 example, to find out if restriction enzyme changes caused by a
1471 mutation are exactly the same in DNA and RNA sequences, we can write:
1473 use Bio::LiveSeq::IO::BioPerl;
1474 use Bio::LiveSeq::Mutator;
1475 use Bio::LiveSeq::Mutation;
1477 $loader=Bio::LiveSeq::IO::BioPerl->load('-file' => "$filename");
1478 $gene=$loader->gene2liveseq('-gene_name' => $gene_name);
1479 $mutation = new Bio::LiveSeq::Mutation ('-seq' =>'G',
1480 '-pos' => 100,
1482 $mutate = Bio::LiveSeq::Mutator->new('-gene' => $gene,
1483 '-numbering' => "coding"
1485 $mutate->add_Mutation($mutation);
1486 $seqdiff = $mutate->change_gene();
1487 $DNA_re_changes = $seqdiff->DNAMutation->restriction_changes;
1488 $RNA_re_changes = $seqdiff->RNAChange->restriction_changes;
1489 $DNA_re_changes eq $RNA_re_changes or print "Different!\n";
1491 For a complete working script, see the change_gene.pl script
1492 in the examples directory. For more details on the use of these objects
1493 see the documentation in the modules as well as the original
1494 documentation for the "Computational Mutation Expression Toolkit"
1495 project at http://www.ebi.ac.uk/mutations/toolkit/.
1497 =head2 III.7.4 Sequence XML representations - generation and parsing
1498 (SeqIO::game)
1500 The previous subsections have described tools for automated sequence
1501 annotation by the creation of an "object layer" on top of a
1502 traditional database structure. XML takes a somewhat different
1503 approach. In XML, the data structure is unmodified, but machine
1504 readability is facilitated by using a data-record syntax with special
1505 flags and controlled vocabulary.
1507 Bioperl supports a set of XML flags and vocabulary words for molecular
1508 biology - called bioxml - detailed at
1509 http://www.bioxml.org/dtds/current/ The idea is that any bioxml
1510 features can be turned into bioperl Bio::Seq annotations. Conversely
1511 Seq object features and annotations can be converted to XML so that
1512 they become available to any other systems that are XML (and bioxml)
1513 compliant. Typical usage is shown below. No special syntax is
1514 required by the user. Note that some Seq annotation will be lost when
1515 using bioxml in this manner - since in its current implementation,
1516 bioxml does not support all the annotation information available in
1517 Seq objects.
1519 $str = Bio::SeqIO->new('-file'=> 't/test.game',
1520 '-format' => 'game');
1521 $seq = $str->next_primary_seq();
1522 $id = $seq->id;
1523 @feats = $seq->all_SeqFeatures();
1524 $first_primary_tag = $feats[0]->primary_tag;
1527 =head1 IV. Related projects - biocorba, biopython, biojava, Ensembl,
1528 AnnotationWorkbench / bioperl-gui
1530 There are several "sister projects" to bioperl currently under
1531 development. These include biocorba, biopython, biojava, Ensembl, and
1532 the Annotation Workbench (which includes Bioperl-gui). These are all
1533 large complex projects and describing them in detail here will not be
1534 attempted. However a brief introduction seems appropriate since, in
1535 the future, they may each provide significant added utility to the
1536 bioperl user.
1538 =head2 IV.1 Biocorba
1540 Interface objects have facilitated interoperability between bioperl
1541 and other perl packages such as Ensembl and the Annotation Workbench.
1542 However, interoperability between bioperl and packages written in
1543 other languages requires additional support software. CORBA is one
1544 such framework for interlanguage support, and the biocorba project is
1545 currently implementing a CORBA interface for bioperl. With biocorba,
1546 objects written within bioperl will be able to communicate with
1547 objects written in biopython and biojava (see the next subsection).
1548 For more information, se the biocorba project website at
1549 http://biocorba.org/.
1551 =head2 IV.2 Biopython and biojava
1553 Biopython and biojava are open source projects with very similar goals
1554 to bioperl. However their code is implemented in python and java,
1555 respectively. With the development of interface objects and biocorba,
1556 it is possible to write java or python objects which can be accessed
1557 by a bioperl script. Or to call bioperl objects from java or python
1558 code. Since biopython and biojava are more recent projects than
1559 bioperl, most effort to date has been to port bioperl functionality to
1560 biopython and biojava rather than the other way around. However, in
1561 the future, some bioinformatics tasks may prove to be more effectively
1562 implemented in java or python in which case being able to call them
1563 from within bioperl will become more important. For more information,
1564 go to the biojava http://biojava.org/ and biopython
1565 http://biopython.org/ websites.
1567 =head2 IV.3 Ensembl
1569 Ensembl is an ambitious automated-genome-annotation project at EBI.
1570 Much of Ensembl's code is based on bioperl and Ensembl developers, in
1571 turn, have contributed significant pieces of code to bioperl. In
1572 particular, the bioperl code for automated sequence annotation has
1573 been largely contributed by Ensembl developers. (The close association
1574 between bioperl and Ensembl is not surprising, since the same
1575 individual - Ewan Birney - has been coordinating both projects).
1576 Describing Ensembl and its capabilities is far beyond the scope of
1577 this tutorial The interested reader is referred to the Ensembl website
1578 at http://www.ensembl.org/ .
1580 =head2 IV.4 The Annotation Workbench and bioperl-gui
1582 The Annotation Workbench (AW) being developed at the Plant
1583 Biotechnology Institute of the National Research Council of Canada is
1584 designed to be an integrated suite of tools for examining a sequence,
1585 predicting gene structure, and creating annotations. The workbench
1586 features a graphical user interface and is implemented completely in
1587 perl. Information about the AW is available at
1588 http://bioinfo.pbi.nrc.ca/dblock/wiki/html/Bioinfo/AnnotationWorkbench.htm . A
1589 goal of the AW team is to port much of the functionality of the AW to
1590 bioperl. The porting process has begun and displaying a Seq object
1591 graphically is now possible. You can download the current version of
1592 the gui software from the bioperl bioperl-gui CVS directory at
1593 http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-gui/?cvsroot=bioperl
1597 =head2 V.1 Appendix: Public Methods of Bioperl Objects
1599 Appendix V.1 lists the public methods for the principal bioperl
1600 objects. The methods are grouped together under the name of the
1601 "ancestor" object from which the object inherits the method. Knowing
1602 the name of the ancestor object is useful since the ancestor module
1603 will contain the documentation describing the method.
1605 Since nearly all bioperl objects inherit from Bio::Root::RootI, the
1606 methods inherited from Bio::Root::RootI are listed separately - and
1607 only once.
1609 If a listing of public methods for a bioperl object not listed in this
1610 appendix is required, the listing can be obtained by running the
1611 bptutorial.pl script using with command number 100 as in:
1613 perl -w bptutorial.pl 100 Bio::Tools::SeqStats
1615 (where Bio::Tools::SeqStats would be replaced with the name of the
1616 bioperl object for which the methods list is needed - see Appendix V.2
1617 for details)
1619 B<Methods for Object Bio::Root::RootI>
1621 I<Methods taken from package Bio::Root::RootI>
1623 DESTROY gensym new qualify qualify_to_ref stack_trace
1624 stack_trace_dump tempdir tempfile throw ungensym verbose warn
1626 B<Methods for Object Bio::AlignIO>
1628 I<Methods taken from package Bio::AlignIO>
1630 PRINT READLINE TIEHANDLE close fh newFh next_aln write_aln
1632 B<Methods for Object Bio::DB::GenBank>
1634 I<Methods taken from package Bio::DB::NCBIHelper>
1636 get_Stream_by_batch get_params
1638 I<Methods taken from package Bio::DB::RandomAccessI>
1640 get_Seq_by_acc get_Seq_by_id
1642 I<Methods taken from package Bio::DB::WebDBSeqI>
1644 GET HEAD POST PUT default_format get_Stream_by_acc get_Stream_by_id
1645 get_request get_seq_stream postprocess_data proxy request_format
1646 retrieval_type ua url_base_address url_params
1648 B<Methods for Object Bio::Index::Fasta>
1650 I<Methods taken from package Bio::DB::RandomAccessI>
1652 get_Seq_by_acc get_Seq_by_id
1654 I<Methods taken from package Bio::DB::SeqI>
1656 get_PrimarySeq_stream get_Seq_by_primary_id get_all_primary_ids
1658 I<Methods taken from package Bio::Index::Abstract>
1660 O_CREAT O_RDWR add_record db dbm_package filename get_stream
1661 make_index open_dbm pack_record unpack_record write_flag
1663 I<Methods taken from package Bio::Index::AbstractSeq>
1665 fetch
1667 I<Methods taken from package Bio::Index::Fasta>
1669 default_id_parser id_parser
1671 B<Methods for Object Bio::LocatableSeq>
1673 I<Methods taken from package Bio::LocatableSeq>
1675 get_nse
1677 I<Methods taken from package Bio::PrimarySeqI>
1679 GCG_checksum accession_number ary can_call_new desc display_id getseq
1680 id moltype out_fasta primary_id revcom seq seq_len setseq str subseq
1681 translate translate_old trunc type
1683 I<Methods taken from package Bio::RangeI>
1685 carp confess contains croak end equals intersection length new
1686 overlap_extent overlaps start strand union
1688 I<Methods taken from package Bio::Seq>
1690 accession add_SeqFeature add_date add_secondary_accession annotation
1691 division each_date each_secondary_accession feature_count keywords
1692 molecule primary_seq species sv
1694 I<Methods taken from package Bio::SeqI>
1696 all_SeqFeatures top_SeqFeatures write_GFF
1698 B<Methods for Object Bio::Seq>
1700 I<Methods taken from package Bio::PrimarySeqI>
1702 GCG_checksum accession_number ary can_call_new carp confess croak
1703 desc display_id getseq id length moltype out_fasta primary_id revcom
1704 seq seq_len setseq str subseq translate translate_old trunc type
1706 I<Methods taken from package Bio::Seq>
1708 accession add_SeqFeature add_date add_secondary_accession annotation
1709 division each_date each_secondary_accession feature_count keywords
1710 molecule primary_seq species sv
1712 I<Methods taken from package Bio::SeqI>
1714 all_SeqFeatures top_SeqFeatures write_GFF
1716 B<Methods for Object Bio::SeqIO>
1718 I<Methods taken from package Bio::SeqIO>
1720 PRINT READLINE TIEHANDLE close fh moltype newFh next_primary_seq
1721 next_seq write_seq
1723 B<Methods for Object Bio::SimpleAlign>
1725 I<Methods taken from package Bio::SimpleAlign>
1727 addSeq alpha_startend column_from_residue_number consensus_aa
1728 consensus_string eachSeq eachSeqWithId each_alphabetically
1729 get_displayname id is_flush length_aln map_chars
1730 maxdisplayname_length maxname_length maxnse_length no_residues
1731 no_sequences percentage_identity purge read_MSF read_Pfam
1732 read_Pfam_file read_Prodom read_fasta read_mase read_selex
1733 read_stockholm removeSeq set_displayname set_displayname_count
1734 set_displayname_flat set_displayname_normal sort_alphabetically
1735 uppercase write_MSF write_Pfam write_clustalw write_fasta write_selex
1737 B<Methods for Object Bio::Tools::BPbl2seq>
1739 I<Methods taken from package Bio::RangeI>
1741 carp confess contains croak end equals intersection length new
1742 overlap_extent overlaps start strand union
1744 I<Methods taken from package Bio::SeqFeature::FeaturePair>
1746 feature1 feature2 hend hframe hprimary_tag hscore hseqname
1747 hsource_tag hstart hstrand invert
1749 I<Methods taken from package Bio::SeqFeature::Generic>
1751 add_sub_SeqFeature add_tag_value annotation attach_seq entire_seq
1752 flush_sub_SeqFeature frame remove_tag score seq seqname
1753 slurp_gff_file
1755 I<Methods taken from package Bio::SeqFeature::SimilarityPair>
1757 bits from_searchResult query significance subject
1759 I<Methods taken from package Bio::SeqFeatureI>
1761 all_tags each_tag_value gff2_string gff_string has_tag primary_tag
1762 source_tag sub_SeqFeature
1764 I<Methods taken from package Bio::Tools::BPbl2seq>
1766 P homologySeq hs match percent positive qs querySeq sbjctSeq ss
1768 B<Methods for Object Bio::Tools::BPlite>
1770 I<Methods taken from package Bio::Tools::BPlite>
1772 database fh nextSbjct pattern qlength query query_pattern_location
1774 B<Methods for Object Bio::Tools::BPpsilite>
1776 I<Methods taken from package Bio::Tools::BPpsilite>
1778 database fh number_of_iterations pattern qlength query
1779 query_pattern_location round
1781 B<Methods for Object Bio::Tools::Blast>
1783 I<Methods taken from package Bio::Root::Object>
1785 clear_err clone compress_file containment debug delete_file destroy
1786 display dont_warn err err_state err_string fatal_on_warn fh file
1787 file_date find_object has_name has_warning index make monitor name
1788 parent print_err read record_err set_display set_err_data set_log_err
1789 set_read set_stats show src_obj strict strictness terse testing
1790 to_string uncompress_file verbosity warn_on_fatal xref
1792 I<Methods taken from package Bio::Tools::Blast>
1794 ambiguous_aln carp confess croak db_local db_remote expect filter
1795 gap_creation gap_extension gapped highest_expect highest_p
1796 highest_signif hit hits homol_data is_signif karlin_altschul
1797 lowest_expect lowest_p lowest_signif matrix min_length num_hits
1798 overlap s signif signif_fmt table table_labels table_labels_tiled
1799 table_tiled to_html word_size
1801 I<Methods taken from package Bio::Tools::SeqAnal>
1803 best database database_letters database_release database_seqs date
1804 length parse program program_version query query_desc roman2int run
1805 set_date
1807 I<Methods taken from package Exporter>
1809 export export_fail export_ok_tags export_tags export_to_level import
1810 require_version
1812 B<Methods for Object Bio::Tools::CodonTable>
1814 I<Methods taken from package Bio::Root::RootI>
1816 DESTROY gensym new qualify qualify_to_ref stack_trace
1817 stack_trace_dump tempdir tempfile throw ungensym verbose warn
1819 I<Methods taken from package Bio::Tools::CodonTable>
1821 id is_start_codon is_ter_codon is_unknown_codon name revtranslate
1822 translate translate_strict
1824 B<Methods for Object Bio::Tools::Genscan>
1826 I<Methods taken from package Bio::SeqAnalysisParserI>
1828 carp confess croak next_feature parse
1830 I<Methods taken from package Bio::Tools::AnalysisResult>
1832 analysis_date analysis_method analysis_method_version analysis_query
1833 analysis_subject close
1835 I<Methods taken from package Bio::Tools::Genscan>
1837 next_prediction
1839 B<Methods for Object Bio::Tools::HMMER::Results>
1841 I<Methods taken from package Bio::Tools::HMMER::Results>
1843 add_Domain add_Set carp confess croak dictate_hmm_acc
1844 domain_bits_cutoff_from_evalue each_Domain each_Set filter_on_cutoff
1845 get_Set get_unit_nse highest_noise lowest_true number write_FT_output
1846 write_GDF write_GDF_bits write_ascii_out write_scores_bits
1848 B<Methods for Object Bio::Tools::OddCodes>
1850 I<Methods taken from package Bio::Tools::OddCodes>
1852 Dayhoff Sneath Stanfel charge chemical functional hydrophobic
1853 structural
1855 B<Methods for Object Bio::Tools::RestrictionEnzyme>
1857 I<Methods taken from package Bio::Tools::RestrictionEnzyme>
1859 annotate_seq available available_list cut_locations cut_seq
1860 cuts_after is_available name palindromic revcom seq site string
1862 I<Methods taken from package Exporter>
1864 export export_fail export_ok_tags export_tags export_to_level import
1865 require_version
1867 B<Methods for Object Bio::Tools::Run::Alignment::Clustalw>
1869 I<Methods taken from package Bio::Tools::Run::Alignment::Clustalw>
1871 AUTOLOAD align exists_clustal profile_align
1873 B<Methods for Object Bio::Tools::Run::Alignment::TCoffee>
1875 I<Methods taken from package Bio::Tools::Run::Alignment::TCoffee>
1877 AUTOLOAD align exists_tcoffee profile_align
1879 B<Methods for Object Bio::Tools::Run::StandAloneBlast>
1881 I<Methods taken from package Bio::Tools::Run::StandAloneBlast>
1883 AUTOLOAD bl2seq blastall blastpgp exists_blast
1885 B<Methods for Object Bio::Tools::SeqPattern>
1887 I<Methods taken from package Bio::Root::Object>
1889 clear_err clone compress_file containment debug delete_file destroy
1890 display dont_warn err err_state err_string fatal_on_warn fh file
1891 file_date find_object has_name has_warning index make monitor name
1892 parent print_err read record_err set_display set_err_data set_log_err
1893 set_read set_stats show src_obj strict strictness terse testing
1894 to_string uncompress_file verbosity warn_on_fatal xref
1896 I<Methods taken from package Bio::Tools::SeqPattern>
1898 alphabet_ok expand revcom str type
1900 B<Methods for Object Bio::Tools::SeqStats>
1902 I<Methods taken from package Bio::Tools::SeqStats>
1904 count_codons count_monomers get_mol_wt
1906 B<Methods for Object Bio::Tools::SeqWords>
1908 I<Methods taken from package Bio::Root::Object>
1910 clear_err clone compress_file containment debug delete_file destroy
1911 display dont_warn err err_state err_string fatal_on_warn fh file
1912 file_date find_object has_name has_warning index make monitor name
1913 parent print_err read record_err set_display set_err_data set_log_err
1914 set_read set_stats show src_obj strict strictness terse testing
1915 to_string uncompress_file verbosity warn_on_fatal xref
1917 I<Methods taken from package Bio::Tools::SeqWords>
1919 count_words
1921 B<Methods for Object Bio::Tools::Sigcleave>
1923 I<Methods taken from package Bio::PrimarySeqI>
1925 GCG_checksum accession_number ary can_call_new carp confess croak
1926 desc display_id getseq id length moltype out_fasta primary_id revcom
1927 seq seq_len setseq str subseq translate translate_old trunc type
1929 I<Methods taken from package Bio::Seq
1931 accession add_SeqFeature add_date add_secondary_accession annotation
1932 division each_date each_secondary_accession feature_count keywords
1933 molecule primary_seq species sv
1935 I<Methods taken from package Bio::SeqI>
1937 all_SeqFeatures top_SeqFeatures write_GFF
1939 I<Methods taken from package Bio::Tools::Sigcleave>
1941 debug dont_warn fatal_on_warn monitor pretty_print signals strictness
1942 testing threshold verbosity warn_on_fatal
1944 B<Methods for Object Bio::Tools::Sim4::Results>
1946 I<Methods taken from package Bio::SeqAnalysisParserI>
1948 carp confess croak next_feature parse
1950 I<Methods taken from package Bio::Tools::AnalysisResult>
1952 analysis_date analysis_method analysis_method_version analysis_query
1953 analysis_subject close
1955 I<Methods taken from package Bio::Tools::Sim4::Results>
1957 basename dirname fileparse fileparse_set_fstype next_exonset
1958 parse_next_alignment
1960 B<Methods for Object Bio::Tools::pSW>
1962 I<Methods taken from package Bio::Tools::AlignFactory>
1964 kbyte report set_memory_and_report
1966 I<Methods taken from package Bio::Tools::pSW>
1968 align_and_show ext gap matrix pairwise_alignment
1970 B<Methods for Object Bio::UnivAln>
1972 I<Methods taken from package Bio::UnivAln>
1974 abort access acos aln alphabet_check asctime asin atan basename carp
1975 catch ceil clock col_descs col_ids complement confess consensus copy
1976 cosh croak ctermid ctime cuserid desc descffmt difftime dirname dup
1977 dup2 equal_nogaps equalize_lengths ffmt fileparse
1978 fileparse_set_fstype floor fmod fpathconf frexp gap_free_cols
1979 gap_free_inds gap_free_sites height id inplace invar_inds invar_sites
1980 isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct
1981 isspace isupper isxdigit layout ldexp localeconv log10 lseek map_c
1982 map_r mblen mbstowcs mbtowc mkfifo mktime modf names new
1983 no_allgap_inds no_allgap_sites numbering out_bad out_fasta out_fasta2
1984 out_raw out_raw2 out_readseq parse parse_bad parse_fasta parse_raw
1985 parse_unknown pathconf pause remove_gaps revcom reverse row_descs
1986 row_ids samelength seqs setlocale setpgid setsid sigaction sigpending
1987 sigprocmask sigsuspend sinh special_free_inds special_free_sites
1988 strcoll strftime strtod strtol strtoul strxfrm sysconf tan tanh
1989 tcdrain tcflow tcflush tcgetpgrp tcsendbreak tcsetpgrp tmpnam ttyname
1990 type tzname tzset uname unknown_free_inds unknown_free_sites var_inds
1991 var_sites wcstombs wctomb width
1993 I<Methods taken from package Exporter>
1995 export export_fail export_ok_tags export_tags export_to_level import
1996 require_version
1998 B<Methods for Object Bio::Variation::Allele>
2000 I<Methods taken from package Bio::DBLinkContainerI>
2002 carp confess croak each_DBLink
2004 I<Methods taken from package Bio::PrimarySeqI>
2006 GCG_checksum accession_number ary can_call_new desc display_id getseq
2007 id length moltype out_fasta primary_id revcom seq seq_len setseq str
2008 subseq translate translate_old trunc type
2010 I<Methods taken from package Bio::Variation::Allele>
2012 add_DBLink is_reference repeat_count repeat_unit
2014 B<Methods for Object Bio::Variation::DNAMutation>
2016 I<Methods taken from package Bio::DBLinkContainerI>
2018 carp confess croak each_DBLink
2020 I<Methods taken from package Bio::RangeI>
2022 contains end equals intersection length new overlap_extent overlaps
2023 start strand union
2025 I<Methods taken from package Bio::SeqFeature::Generic>
2027 add_sub_SeqFeature add_tag_value annotation attach_seq entire_seq
2028 flush_sub_SeqFeature frame remove_tag score seq seqname
2029 slurp_gff_file
2031 I<Methods taken from package Bio::SeqFeatureI>
2033 all_tags each_tag_value gff2_string gff_string has_tag primary_tag
2034 source_tag sub_SeqFeature
2036 I<Methods taken from package Bio::Variation::DNAMutation>
2038 CpG RNAChange sysname
2040 I<Methods taken from package Bio::Variation::VariantI>
2042 SeqDiff add_Allele add_DBLink allele_mut allele_ori dnStreamSeq
2043 each_Allele id isMutation label mut_number numbering proof region
2044 region_value restriction_changes status upStreamSeq
2046 B<Methods for Object Bio::Variation::SeqDiff>
2048 I<Methods taken from package Bio::Variation::SeqDiff>
2050 aa_mut aa_ori add_Gene add_Variant alignment cds_end cds_start
2051 chromosome description dna_mut dna_ori each_Gene each_Variant
2052 gene_symbol id moltype numbering offset rna_id rna_mut rna_offset
2053 rna_ori seqobj sysname trivname
2055 =head2 V.2 Appendix: Tutorial demo scripts
2057 The following scripts demonstrate many of the features of bioperl. To
2058 run all the demos run:
2060 > perl -w bptutorial.pl 0
2062 To run a subset of the scripts do
2064 > perl -w bptutorial.pl
2066 and use the displayed help screen.
2068 =cut
2070 #!/usr/bin/perl
2072 # PROGRAM : bptutorial.pl
2073 # PURPOSE : Demonstrate various uses of the bioperl package
2074 # AUTHOR : Peter Schattner schattner@alum.mit.edu
2075 # CREATED : Dec 15 2000
2076 # REVISION : $Id$
2078 use strict;
2079 use Bio::SimpleAlign;
2080 use Bio::AlignIO;
2081 use Bio::SeqIO;
2082 use Bio::Seq;
2084 # subroutine references
2086 my ($access_remote_db, $index_local_db, $fetch_local_db,
2087 $sequence_manipulations, $seqstats_and_seqwords,
2088 $restriction_and_sigcleave, $other_seq_utilities,
2089 $run_standaloneblast, $blast_parser, $bplite_parsing, $hmmer_parsing,
2090 $run_clustalw_tcoffee, $run_psw_bl2seq, $simplealign_univaln,
2091 $gene_prediction_parsing, $sequence_annotation, $largeseqs,
2092 $liveseqs, $demo_variations, $demo_xml, $display_help, $bpinspect1 );
2094 # global variable file names. Edit these if you want to try
2095 #out a tutorial script on a different file
2096 my $dna_seq_file = 't/dna1.fa'; # used in $sequence_manipulations
2097 my $amino_seq_file = 't/cysprot1.fa'; # used in $other_seq_utilities
2098 #and $sequence_annotation
2099 my $blast_report_file = 't/blast.report'; # used in $blast_parser
2100 my $bp_parse_file1 = 't/blast.report'; # used in $bplite_parsing
2101 my $bp_parse_file2 = 't/psiblastreport.out'; # used in $bplite_parsing
2102 my $bp_parse_file3 = 't/bl2seq.out'; # used in $bplite_parsing
2103 my $unaligned_amino_file = 't/cysprot1a.fa'; # used in $run_clustalw_tcoffee
2104 my $aligned_amino_file = 't/testaln.msf'; # used in $simplealign_univaln
2106 # other global variables
2107 my (@runlist, $n );
2111 ##############################
2112 # display_help
2113 $display_help = sub {
2116 # Prints usage information for tutorial script.
2118 print STDERR <<"QQ_PARAMS_QQ";
2120 The following numeric arguments can be passed to run the corresponding demo-script.
2121 1 => access_remote_db ,
2122 2 => index_local_db ,
2123 3 => fetch_local_db , (# NOTE: needs to be run with demo 2)
2124 4 => sequence_manipulations ,
2125 5 => seqstats_and_seqwords ,
2126 6 => restriction_and_sigcleave ,
2127 7 => other_seq_utilities ,
2128 8 => run_standaloneblast ,
2129 9 => blast_parser ,
2130 10 => bplite_parsing ,
2131 11 => hmmer_parsing ,
2132 12 => run_clustalw_tcoffee ,
2133 13 => run_psw_bl2seq ,
2134 14 => simplealign_univaln ,
2135 15 => gene_prediction_parsing ,
2136 16 => sequence_annotation ,
2137 17 => largeseqs ,
2138 18 => liveseqs ,
2139 19 => demo_variations ,
2140 20 => demo_xml ,
2142 In addition the argument "100" followed by the name of a single
2143 bioperl object will display a list of all the public methods
2144 available from that object and from what object they are inherited.
2146 Using the parameter "0" will run all tests.
2147 Using any other argument (or no argument) will run this display.
2149 So typical command lines might be:
2150 To run all demo scripts:
2151 > perl -w bptutorial.pl 0
2152 or to just run the local indexing demos:
2153 > perl -w bptutorial.pl 2 3
2154 or to list all the methods available for object Bio::Tools::SeqStats -
2155 > perl -w bptutorial.pl 100 Bio::Tools::SeqStats
2157 QQ_PARAMS_QQ
2159 exit(0);
2163 ## Note: "main" routine is at very end of script
2165 #################################################
2166 # access_remote_db():
2169 $access_remote_db = sub {
2171 use Bio::DB::GenBank;
2173 print "Beginning remote database access example... \n";
2175 # III.2.1 Accessing remote databases (Bio::DB::GenBank, etc)
2177 my ($gb, $seq1, $seq2, $seq1_id, $seqobj, $seq2_id,$seqio);
2179 eval {
2180 $gb = new Bio::DB::GenBank();
2181 $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
2184 if ($@ || !$seq1) {
2185 warn "Warning: Couldn't connect to Genbank with Bio::DB::GenBank.pm!\nProbably no network access.\n Skipping Test\n";
2186 return 0;
2188 $seq1_id = $seq1->display_id();
2189 print "seq1 display id is $seq1_id \n";
2190 $seq2 = $gb->get_Seq_by_acc('AF303112');
2191 $seq2_id = $seq2->display_id();
2192 print "seq2 display id is $seq2_id \n";
2193 $seqio = $gb->get_Stream_by_batch([ qw(2981014 J00522 AF303112)]);
2194 $seqobj = $seqio->next_seq();
2195 print "Display id of first sequence in stream is ", $seqobj->display_id() ,"\n";
2196 return 1;
2200 #################################################
2201 # index_local_db ():
2204 $index_local_db = sub {
2206 use Bio::Index::Fasta; # using fasta file format
2207 my ( $Index_File_Name, $inx1, $inx2, $id, $dir, $key,
2208 $keyfound, $seq, $indexhash);
2209 print "\nBeginning indexing local_db example... \n";
2211 # This subroutine unlikely to run if unless OS = unix
2213 # III.2.2 Indexing and accessing local databases
2214 # (Bio::Index::*, bpindex.pl, bpfetch.pl)
2216 # first create the index
2217 $Index_File_Name = 'bptutorial.indx';
2219 eval { use Cwd; $dir = cwd; };
2220 # CWD not installed, revert to unix behavior, best we can do
2221 if( $@) { $dir = `pwd`;}
2224 $inx1 = Bio::Index::Fasta->new
2225 ('-FILENAME' => "$dir/$Index_File_Name",
2226 '-write_flag' => 1);
2227 $inx1->make_index("$dir/t/multifa.seq");
2228 $inx1->make_index("$dir/t/seqs.fas");
2229 print "Finished indexing local_db example... \n";
2231 return 1;
2235 #################################################
2236 # fetch_local_db ():
2239 $fetch_local_db = sub {
2241 use Bio::Index::Fasta; # using fasta file format
2242 my ( $Index_File_Name, $inx2, $id, $dir, $key,
2243 $keyfound, $seq, $indexhash);
2244 print "\nBeginning retrieving local_db example... \n";
2246 # then retrieve some files
2247 #$Index_File_Name = shift;
2248 $Index_File_Name = 'bptutorial.indx';
2250 eval { use Cwd; $dir = cwd; };
2251 # CWD not installed, revert to unix behavior, best we can do
2252 if( $@) { $dir = `pwd`;}
2254 $inx2 = Bio::Index::Abstract->new
2255 ('-FILENAME' => "$dir/$Index_File_Name");
2257 $indexhash = $inx2->db();
2258 $keyfound = "";
2259 foreach $key (sort keys %$indexhash) {
2260 if($key =~ /97/) {$keyfound = 'true' ; $id = $key; last};
2262 if ($keyfound) {
2263 $seq = $inx2->fetch($id);
2264 print "Sequence ", $seq->display_id(),
2265 " has length ", $seq->length()," \n";
2268 unlink "$dir/$Index_File_Name" ;
2270 return 1;
2275 #################################################
2276 # sequence_manipulations ():
2279 $sequence_manipulations = sub {
2281 my ($infile, $in, $out, $seqobj);
2282 $infile = $dna_seq_file;
2284 print "\nBeginning sequence_manipulations and SeqIO example... \n";
2287 # III.3.1 Transforming sequence files (SeqIO)
2289 $in = Bio::SeqIO->new('-file' => $infile ,
2290 '-format' => 'Fasta');
2291 $seqobj = $in->next_seq();
2293 # perl "tied filehandle" syntax is available to SeqIO,
2294 # allowing you to use the standard <> and print operations
2295 # to read and write sequence objects, eg:
2296 #$out = Bio::SeqIO->newFh('-format' => 'EMBL');
2298 $out = Bio::SeqIO->newFh('-format' => 'fasta');
2300 print "First sequence in fasta format... \n";
2301 print $out $seqobj;
2303 # III.4 Manipulating individual sequences
2305 # The following methods return strings
2308 print "Seq object display id is ",
2309 $seqobj->display_id(), "\n"; # the human read-able id of the sequence
2310 print "Sequence is ",
2311 $seqobj->seq()," \n"; # string of sequence
2312 print "Sequence from 5 to 10 is ",
2313 $seqobj->subseq(5,10)," \n"; # part of the sequence as a string
2314 print "Acc num is ",
2315 $seqobj->accession_number(), " \n"; # when there, the accession number
2316 print "Moltype is ",
2317 $seqobj->moltype(), " \n"; # one of 'dna','rna','protein'
2318 print "Primary id is ", $seqobj->primary_seq->primary_id(),
2319 " \n"; # a unique id for this sequence irregardless
2320 #print "Primary id is ", $seqobj->primary_id(), " \n";
2321 # a unique id for this sequence irregardless
2322 # of its display_id or accession number
2324 # The following methods return an array of Bio::SeqFeature objects
2325 $seqobj->top_SeqFeatures; # The 'top level' sequence features
2326 $seqobj->all_SeqFeatures; # All sequence features, including sub
2327 # seq features
2329 # The following methods returns new sequence objects,
2330 # but do not transfer features across
2331 # truncation from 5 to 10 as new object
2332 print "Truncated Seq object sequence is ",
2333 $seqobj->trunc(5,10)->seq(), " \n";
2334 # reverse complements sequence
2335 print "Reverse complemented sequence 5 to 10 is ",
2336 $seqobj->revcom->subseq(5,10), " \n";
2337 # translation of the sequence
2338 print "Translated sequence 6 to 15 is ",
2339 $seqobj->translate->subseq(6,15), " \n";
2342 my $c = shift;
2343 $c ||= 'ctgagaaaataa';
2345 print "\nBeginning 3-frame and alternate codon translation example... \n";
2347 my $seq = new Bio::PrimarySeq('-SEQ' => $c,
2348 '-ID' => 'no.One');
2349 print "$c translated using method defaults : ",
2350 $seq->translate->seq, "\n";
2352 # Bio::Seq uses same sequence methods as PrimarySeq
2353 my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two');
2354 print "$c translated as a coding region (CDS): ",
2355 $seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";
2357 print "\nTranslating in all six frames:\n";
2358 my @frames = (0, 1, 2);
2359 foreach my $frame (@frames) {
2360 print " frame: ", $frame, " forward: ",
2361 $seq->translate(undef, undef, $frame)->seq, "\n";
2362 print " frame: ", $frame, " reverse-complement: ",
2363 $seq->revcom->translate(undef, undef, $frame)->seq, "\n";
2366 print "Translating with all codon tables using method defaults:\n";
2367 my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
2368 foreach my $ct (@codontables) {
2369 print $ct, " : ",
2370 $seq->translate(undef, undef, undef, $ct)->seq, "\n";
2373 return 1;
2376 ################################################# ;
2377 # seqstats_and_seqwords ():
2380 $seqstats_and_seqwords = sub {
2382 use Bio::Tools::SeqStats;
2383 use Bio::Tools::SeqWords;
2385 print "\nBeginning seqstats_and_seqwords example... \n";
2388 my ($seqobj, $weight, $monomer_ref, $codon_ref,
2389 $seq_stats, $words, $hash);
2390 $seqobj = Bio::Seq->new
2391 ('-seq'=>'ACTGTGGCGTCAACTGACTGTGGCGTCAACTGACTGTGGGCGTCAACTGACTGTGGCGTCAACTG',
2392 '-moltype'=>'dna',
2393 '-id'=>'test');
2396 # III.4.1 Obtaining basic sequence statistics- MW,
2397 # residue &codon frequencies(SeqStats, SeqWord)
2400 $seq_stats = Bio::Tools::SeqStats->new($seqobj);
2401 $weight = $seq_stats->get_mol_wt();
2402 $monomer_ref = $seq_stats->count_monomers();
2403 $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
2404 print "Sequence \'test\' is ", $seqobj->seq(), " \n";
2405 print "Sequence ", $seqobj->id()," molecular weight is $$weight[0] \n";
2406 # Note, $weight is an array reference
2407 print "Number of A\'s in sequence is $$monomer_ref{'A'} \n";
2408 print "Number of ACT codon\'s in sequence is $$codon_ref{'ACT'} \n";
2410 $words = Bio::Tools::SeqWords->new($seqobj);
2411 $hash = $words->count_words(6);
2412 print "Number of 6-letter \'words\' of type ACTGTG in",
2413 " sequence is $$hash{'ACTGTG'} \n";
2415 return 1;
2419 #################################################
2420 # restriction_and_sigcleave ():
2423 $restriction_and_sigcleave = sub {
2425 use Bio::Tools::RestrictionEnzyme;
2426 use Bio::Tools::Sigcleave;
2428 my ($re, $re1, $re2, @sixcutters, @fragments1,
2429 @fragments2, $seqobj, $dna);
2430 print "\nBeginning restriction enzyme example... \n";
2432 # III.4.4 Identifying restriction enzyme sites (RestrictionEnzyme)
2434 $dna = 'CCTCCGGGGACTGCCGTGCCGGGCGGGAATTCGCCATGGCGACC'.
2435 'CTGGAAAAGCTGATATCGAAGGCCTTCGA';
2437 # Build sequence and restriction enzyme objects.
2438 $seqobj = new Bio::Seq('-ID' => 'test_seq',
2439 '-SEQ' =>$dna);
2441 #$re = new Bio::Tools::RestrictionEnzyme();
2442 $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
2443 @sixcutters = $re->available_list(6);
2445 print "The following 6-cutters are available",
2446 @sixcutters," \n";
2448 $re1 = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
2449 @fragments1 = $re1->cut_seq($seqobj);
2450 #$seqobj is the Seq object for the dna to be cut
2452 print "When cutting ", $seqobj->display_id(),
2453 " with ", $re1->seq->id ," \n";
2454 print "the initial fragment is ", $fragments1[0]," \n";
2456 $re2 = new Bio::Tools::RestrictionEnzyme
2457 ('-NAME' =>'EcoRV--GAT^ATC',
2458 '-MAKE' =>'custom');
2459 @fragments2 = $re2->cut_seq($seqobj);
2461 print "When cutting ", $seqobj->display_id(),
2462 " with ", $re2->seq->id ,", \n";
2463 print "the second fragment is ", $fragments2[1], " \n";
2465 # III.4.7 Identifying amino acid cleavage sites (Sigcleave)
2467 print "\nBeginning signal cleaving site location example... \n";
2469 my ( $sigcleave_object, %raw_results , $location,
2470 $formatted_output, $protein, $in, $seqobj2);
2472 $in = Bio::SeqIO->new('-file' => 't/cysprot1.fa' ,
2473 '-format' => 'Fasta');
2474 $seqobj2 = $in->next_seq();
2476 $protein = $seqobj2->seq;
2477 $formatted_output = "";
2479 # Build object
2480 # Note that Sigcleave is passed a raw sequence
2481 # (or file containing a sequence) rather than a
2482 # sequence object when it is created.
2483 $sigcleave_object = new Bio::Tools::Sigcleave
2484 ('-id' =>'test_sigcleave_seq',
2485 '-type' =>'amino',
2486 '-threshold' => 3.0,
2487 '-seq' =>$protein);
2489 %raw_results = $sigcleave_object->signals;
2490 $formatted_output = $sigcleave_object->pretty_print;
2492 print " SigCleave \'raw results\': \n";
2493 foreach $location (sort keys %raw_results) {
2494 print " Cleave site found at location $location ",
2495 "with value of $raw_results{$location} \n";
2497 print "\nSigCleave formatted output: \n $formatted_output \n";
2498 return 1;
2502 #################################################
2503 # run_standaloneblast ():
2506 $run_standaloneblast = sub {
2507 use Bio::Tools::Run::StandAloneBlast;
2508 use Bio::Tools::BPlite;
2510 print "\nBeginning run_standaloneblast example... \n";
2512 my (@params, $factory, $input, $blast_report, $blast_present,
2513 $database, $sbjct, $str, $seq1);
2515 # III.5.1 Running BLAST locally (StandAloneBlast)
2517 $database = $_[0] || 'ecoli.nt'; # user can select local nt database
2519 $blast_present = Bio::Tools::Run::StandAloneBlast->exists_blast();
2520 unless ($blast_present) {
2521 warn "blast program not found. Skipping StandAloneBlast example\n";
2522 return 0;
2524 #@params = ('program' => 'blastn', 'database' => 'ecoli.nt');
2525 @params = ('program' => 'blastn', 'database' => $database);
2526 $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
2528 $str = Bio::SeqIO->new('-file'=>'t/dna2.fa' ,
2529 '-format' => 'Fasta', );
2530 $seq1 = $str->next_seq();
2532 $blast_report = $factory->blastall($seq1);
2533 $sbjct = $blast_report->nextSbjct;
2535 print " Hit name is ", $sbjct->name, " \n";
2537 return 1;
2540 #################################################
2541 # blast_parser ():
2544 $blast_parser = sub {
2546 print "\nBeginning blast.pm parser example... \n";
2547 my ($blast, $num_hits, @hits,$frac1, @inds,
2548 $report_file, $identity_string);
2550 $report_file = $blast_report_file;
2552 # III.5.3.1 Parsing BLAST reports with Blast.pm
2554 use Bio::Tools::Blast;
2555 $blast = Bio::Tools::Blast->new('-file' => $report_file,
2556 '-signif' => 1e-5,
2557 '-parse' => 1,
2558 '-stats' => 1,
2559 '-check_all_hits' => 1, );
2560 $blast->display();
2561 $num_hits = $blast->num_hits;
2562 print "Number of hits is $num_hits \n";
2564 @hits = $blast->hits;
2565 $frac1 = $hits[1]->frac_identical;
2566 print "Fraction identical for hit 1 is $frac1 \n";
2568 @inds = $hits[1]->hsp->seq_inds('query', 'iden', 1);
2569 $identity_string = join " ", @inds;
2570 print "Sequence identities for hsp of hit 1 are ",
2571 $identity_string, " \n";
2573 return 1;
2579 #################################################
2580 # bplite_parsing ():
2583 $bplite_parsing = sub {
2585 use Bio::Tools::BPlite;
2586 my ($file1, $file2, $file3, $report,$report2 , $report3 ,
2587 $last_iteration, $sbjct, $total_iterations, $hsp,
2588 $matches, $loop);
2590 print "\nBeginning bplite, bppsilite, bpbl2seq parsing example... \n";
2592 ($file1, $file2, $file3) = ($bp_parse_file1,
2593 $bp_parse_file2 ,$bp_parse_file3 );
2594 #open FH, "t/blast.report";
2595 $report = Bio::Tools::BPlite->new('-file'=>$file1);
2596 $sbjct = $report->nextSbjct;
2598 print " Hit name is ", $sbjct->name, " \n";
2599 while ($hsp = $sbjct->nextHSP) {
2600 $hsp->score;
2603 use Bio::Tools::BPpsilite;
2604 $report2 = Bio::Tools::BPpsilite->new('-file'=>$file2);
2605 $total_iterations = $report2->number_of_iterations;
2606 $last_iteration = $report2->round($total_iterations);
2608 # print first 10 psiblast hits
2609 for ($loop = 1; $loop <= 10; $loop++) {
2610 $sbjct = $last_iteration->nextSbjct;
2611 $sbjct && print " PSIBLAST Hit is ", $sbjct->name, " \n";
2615 use Bio::Tools::BPbl2seq;
2616 $report3 = Bio::Tools::BPbl2seq->new('-file' => $file3,
2617 '-queryname' => "ALEU_HORVU");
2618 $matches = $report3->match;
2619 print " Number of Blast2seq matches is $matches \n";
2622 return 1;
2626 #################################################
2627 # hmmer_parsing ():
2630 $hmmer_parsing = sub {
2632 print "\nBeginning hmmer_parsing example... \n";
2634 # III.5.4 Parsing HMM reports (HMMER::Results)
2636 use Bio::Tools::HMMER::Domain;
2637 use Bio::Tools::HMMER::Set;
2638 use Bio::Tools::HMMER::Results;
2640 my ($res, @seq, @domain);
2641 $res = new Bio::Tools::HMMER::Results('-file' => 't/hmmsearch.out',
2642 '-type' => 'hmmsearch');
2644 @seq = $res->each_Set;
2645 print "Initial sequence bit score is ",$seq[0]->bits,"\n";
2646 @domain = $seq[0]->each_Domain;
2647 print " For initial domain, start = ",$domain[0]->start,
2648 " end = ",$domain[0]->end,"and score =",$domain[0]->bits,"\n";
2650 #foreach $seq ( $res->each_Set ) {
2651 # print "Sequence bit score is",$seq->bits,"\n";
2652 # foreach $domain ( $seq->each_Domain ) {
2653 # print " Domain start ",$domain->start," end ",
2654 # $domain->end," score ",$domain->bits,"\n";
2657 return 1;
2660 #################################################
2661 # run_clustalw_tcoffee ():
2664 $run_clustalw_tcoffee = sub {
2666 use Bio::Tools::Run::Alignment::TCoffee;
2667 use Bio::Tools::Run::Alignment::Clustalw;
2668 my (@params, $factory, $ktuple, $aln, $seq_array_ref, $strout);
2669 my (@seq_array, $seq, $str, $infile);
2671 $infile = $unaligned_amino_file;
2673 # III.6.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm)
2674 print "\nBeginning run_clustalw example... \n";
2676 # put unaligned sequences in a Bio::Seq array
2677 $str = Bio::SeqIO->new('-file'=> $infile,
2678 '-format' => 'Fasta');
2679 @seq_array =();
2680 while ($seq = $str->next_seq() ) { push (@seq_array, $seq) ;}
2681 $seq_array_ref = \@seq_array;
2682 # where @seq_array is an array of Bio::Seq objects
2684 my $clustal_present =
2685 Bio::Tools::Run::Alignment::Clustalw->exists_clustal();
2686 if ($clustal_present) {
2687 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
2688 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
2689 $ktuple = 3;
2690 $factory->ktuple($ktuple); # change the parameter before executing
2691 $aln = $factory->align($seq_array_ref);
2692 $strout = Bio::AlignIO->newFh('-format' => 'msf');
2693 print "Output of clustalw alignment... \n";
2694 print $strout $aln;
2695 } else {
2696 warn "Clustalw program not found. Skipping run_clustalw example....\n";
2698 print "\nBeginning run_tcoffee example... \n";
2700 my $tcoffee_present =
2701 Bio::Tools::Run::Alignment::TCoffee->exists_tcoffee();
2702 if ($tcoffee_present) {
2703 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'quiet' => 1);
2704 $factory = Bio::Tools::Run::Alignment::TCoffee->new(@params);
2705 $ktuple = 3;
2706 $factory->ktuple($ktuple); # change the parameter before executing
2707 $aln = $factory->align($seq_array_ref);
2708 $strout = Bio::AlignIO->newFh('-format' => 'msf');
2709 print "Output of TCoffee alignment... \n";
2710 print $strout $aln;
2711 } else {
2712 warn "TCoffee program not found. Skipping run_TCoffee example....\n";
2714 return 1;
2717 #################################################
2718 # simplealign_univaln ():
2721 $simplealign_univaln = sub {
2723 my ($in, $out1,$out2, $aln, $threshold_percent, $infile);
2724 my ( $str, $resSlice1, $resSlice2, $resSlice3, $tmpfile);
2726 print "\nBeginning simplealign and alignio example... \n";
2728 # III.3.2 Transforming alignment files (AlignIO)
2729 $infile = $aligned_amino_file;
2730 #$tmpfile = "test.tmp";
2731 $in = Bio::AlignIO->new('-file' => $infile ,
2732 '-format' => 'msf');
2733 $out1 = Bio::AlignIO->newFh('-format' => 'pfam');
2734 $out2 = Bio::AlignIO->newFh('-format' => 'fasta',
2735 '-file' => ">test.tmp");
2737 # while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
2738 $aln = $in->next_aln() ;
2739 print "Alignment to display in pfam format... \n";
2740 print $out1 $aln;
2741 print "Alignment to file in fasta format... \n";
2742 print $out2 $aln;
2744 # III.6.4 Manipulating / displaying alignments (SimpleAlign, UnivAln)
2746 $threshold_percent = 60;
2747 $str = $aln->consensus_string($threshold_percent) ;
2748 print "Consensus string with threshold = $threshold_percent is $str\n";
2750 use Bio::UnivAln;
2752 print "\nBeginning univaln example... \n";
2753 #$aln = Bio::UnivAln->new('-file'=>'test.tmp',
2754 $aln = Bio::UnivAln->new('-file'=>'t/alnfile.fasta',
2755 '-desc'=>'Sample alignment',
2756 '-type'=>'amino',
2757 '-ffmt'=>'fasta'
2760 print "\nCurrent alignment:\n",$aln->layout;
2762 $resSlice1 = $aln->remove_gaps(); # original sequences without gaps
2763 print "Alignment without gaps is \n $resSlice1\n";
2764 $resSlice3 = $aln->consensus(0.2, [1,3]); # 60% majority, columns 1+3 only
2765 print "Consensus with 20% threshold for columns".
2766 " 1 and 3 is \n $resSlice3\n";
2767 #unlink $tmpfile;
2768 return 1;
2771 #################################################
2772 # run_psw_bl2seq ():
2775 $run_psw_bl2seq = sub {
2777 print "\nBeginning run_psw_bl2seq example... \n";
2778 eval { require Bio::Tools::pSW; };
2779 if( $@ ) {
2780 print STDERR ("\nThe C-compiled engine for Smith Waterman alignments (Bio::Ext::Align) has not been installed.\n Please read the install the bioperl-ext package\n\n");
2781 return 0;
2784 #use Bio::Tools::pSW;
2785 use Bio::Tools::Run::StandAloneBlast;
2787 # III.6.1 Aligning 2 sequences with Smith-Waterman (pSW)
2789 my ($factory, $aln, $out1, $out2, $str, $seq1, $seq2, @params);
2791 # Get protein sequences from file
2792 $str = Bio::SeqIO->new('-file'=>'t/amino.fa' ,
2793 '-format' => 'Fasta', );
2794 $seq1 = $str->next_seq();
2795 $seq2 = $str->next_seq();
2797 $factory = new Bio::Tools::pSW( '-matrix' => 'examples/blosum62.bla',
2798 '-gap' => 12,
2799 '-ext' => 2, );
2801 $factory->align_and_show($seq1,$seq2,'STDOUT');
2802 $aln = $factory->pairwise_alignment($seq1,$seq2);
2804 $out1 = Bio::AlignIO->newFh('-format' => 'fasta');
2805 print "The Smith-Waterman alignment in fasta format... \n";
2806 print $out1 $aln;
2809 # III.6.2 Aligning 2 sequences with Blast using bl2seq and AlignIO
2811 # Bl2seq testing
2812 # first create factory for bl2seq
2813 @params = ('program' => 'blastp', 'outfile' => 'bl2seq.out');
2814 $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
2815 $factory->bl2seq($seq1, $seq2);
2817 # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
2818 $str = Bio::AlignIO->new('-file'=> 'bl2seq.out',
2819 '-format' => 'bl2seq');
2820 $aln = $str->next_aln();
2822 $out2 = Bio::AlignIO->newFh('-format' => 'fasta');
2823 print "The Blast 2 sequence alignment in fasta format... \n";
2824 print $out2 $aln;
2826 return 1;
2830 #################################################
2831 # gene_prediction_parsing ():
2834 $gene_prediction_parsing = sub {
2836 use Bio::Tools::Genscan;
2838 print "\nBeginning genscan_result_parsing example... \n";
2840 my ($genscan, $gene, @exon_arr, $first_exon);
2842 # III.7 Searching for genes and other structures
2843 # on genomic DNA (Genscan, ESTScan, MZEF)
2845 use Bio::Tools::Genscan;
2846 $genscan = Bio::Tools::Genscan->new('-file' => 't/genomic-seq.genscan');
2847 # $gene is an instance of Bio::Tools::Prediction::Gene
2848 # $gene->exons() returns an array of Bio::Tools::Prediction::Exon objects
2849 while($gene = $genscan->next_prediction()) {
2850 print "Protein corresponding to current gene prediction is ",
2851 $gene->predicted_protein->seq()," \n";
2852 @exon_arr = $gene->exons();
2853 $first_exon = $exon_arr[0];
2854 print "Start location of first exon of current predicted gene is ",
2855 $first_exon->start," \n";
2857 $genscan->close();
2859 use Bio::Tools::Sim4::Results;
2860 print "\nBeginning Sim4_result_parsing example... \n";
2861 my ($sim4,$exonset, @exons, $exon, $homol);
2863 $sim4 = new Bio::Tools::Sim4::Results(-file=> 't/sim4.rev', -estisfirst=>0);
2864 $exonset = $sim4->next_exonset;
2865 @exons = $exonset->sub_SeqFeature();
2866 $exon = $exons[1];
2867 print "Start location of first exon of first exon set is ",$exon->start(), " \n";
2868 print "Name of est sequence is ",$exon->est_hit()->seqname(), " \n";
2869 print "The length of the est hit is ",$exon->est_hit()->seqlength(), " \n";
2870 $sim4->close();
2872 return 1;
2876 #################################################
2877 # sequence_annotation ():
2880 $sequence_annotation = sub {
2882 print "\nBeginning sequence_annotation example... \n";
2883 my ($feat, $feature1, $feature2,$seqobj, @topfeatures, @allfeatures,
2884 $ann, $in, $infile, $other, $answer);
2885 # III.8.1 Representing sequence annotations (Annotation, SeqFeature)
2887 $infile = $amino_seq_file;
2889 $in = Bio::SeqIO->new('-file' => $infile ,
2890 '-format' => 'Fasta');
2891 $seqobj = $in->next_seq();
2893 $feature1 = new Bio::SeqFeature::Generic
2894 ('-start' => 10,
2895 '-end' => 40,
2896 '-strand' => 1,
2897 '-primary' => 'exon',
2898 '-source' => 'internal', );
2900 $feature2 = new Bio::SeqFeature::Generic
2901 ( '-start' => 20,
2902 '-end' => 30,
2903 '-strand' => 1,
2904 '-primary' => 'exon',
2905 '-source' => 'internal', );
2907 $seqobj->add_SeqFeature($feature1); # Add the SeqFeature to the parent
2908 $seqobj->add_SeqFeature($feature2); # Add the SeqFeature to the parent
2909 $seqobj->annotation(new Bio::Annotation('-description' => 'This is the annotation'));
2911 # Once the features and annotations have been associated
2912 # with the Seq, they can be with retrieved, eg:
2913 @topfeatures = $seqobj->top_SeqFeatures(); # just top level, or
2914 @allfeatures = $seqobj->all_SeqFeatures(); # descend into sub features
2915 $ann = $seqobj->annotation(); # annotation object
2916 $feat = $allfeatures[0];
2917 $other = $allfeatures[1];
2919 print " Total number of sequence features is: ", scalar @allfeatures, " \n";
2920 print " The sequence annotation is: ", $ann->description," \n";
2922 # The individual components of a SeqFeature can also be set
2923 # or retrieved with methods including:
2924 # attributes which return numbers
2925 $feat->start; # start position
2926 $feat->end; # end position
2927 $feat->strand; # 1 means forward, -1 reverse, 0 not relevant
2928 # attributes which return strings
2929 $feat->primary_tag; # the main 'name' of the sequence feature, eg, 'exon'
2930 $feat->source_tag; # where the feature comes from, eg'BLAST'
2931 # attributes which return Bio::PrimarySeq objects
2932 $feat->seq; # the sequence between start,end
2933 $feat->entire_seq; # the entire sequence
2935 print " The primary tag of the feature is: ", $feat->primary_tag, " \n";
2936 print " The first feature begins at location ", $feat->start, " \n";
2937 print " ends at location ", $feat->end, " \n";
2938 print " and is located on strand ", $feat->strand, " \n";
2940 # other useful methods include
2941 $answer = $feat->overlaps($other); # do SeqFeature $feat and SeqFeature $other overlap?
2942 print " Feature 1 does ", ($answer)? "":"not ", " overlap Feature2. \n";
2944 $answer = $feat->contains($other); # is $other completely within $feat?
2945 print " Feature 1 does ", ($answer)? "":"not ", " contain Feature2. \n";
2947 $answer = $feat->equals($other); # do $feat and $other completely $agree?
2948 print " Feature 1 does ", ($answer)? "":"not ", " equal Feature2. \n";
2950 $feat->sub_SeqFeature(); # create/access an array of subsequence features
2952 return 1;
2956 #################################################
2957 # largeseqs ():
2960 $largeseqs = sub {
2962 print "\nBeginning largeseqs example... \n";
2964 # III.8.2 Representing and large sequences
2965 my ( $tmpfile, $seqio, $pseq, $plength, $last_4);
2967 $tmpfile = 't/largefastatest.out';
2968 $seqio = new Bio::SeqIO('-format'=>'largefasta',
2969 '-file' =>'t/genomic-seq.fasta');
2970 $pseq = $seqio->next_seq();
2971 $plength = $pseq->length();
2973 print " The length of the sequence ",
2974 $pseq->display_id()," is $plength \n";
2975 $last_4 = $pseq->subseq($plength-3,$plength);
2976 print " The final four residues are $last_4 \n";
2978 #END { unlink $tmpfile; }
2979 unlink $tmpfile;
2981 return 1;
2985 #################################################
2986 # liveseqs ():
2989 $liveseqs = sub {
2991 use Bio::LiveSeq::IO::BioPerl;
2992 my ($loader, $gene, $id, $maxstart);
2994 print "\nBeginning liveseqs example... \n";
2996 # III.8.2 Representing changing sequences (LiveSeq)
2998 $loader=Bio::LiveSeq::IO::BioPerl->load('-db'=>"EMBL",
2999 '-file'=>"t/factor7.embl");
3000 $gene=$loader->gene2liveseq('-gene_name' => "factor7");
3001 $id = $gene->get_DNA->display_id ;
3002 print " The id of the gene is $id , \n";
3003 $maxstart = $gene->maxtranscript->start;
3004 print " The start position of the max transcript is $maxstart , \n";
3006 return 1;
3010 #################################################
3011 # demo_variations ():
3014 $demo_variations = sub {
3016 print "\nBeginning demo_variations example... \n";
3018 # III.8.3 Representing related sequences
3019 # - mutations, polymorphisms etc (Allele, SeqDiff,)
3021 use Bio::Variation::SeqDiff;
3022 use Bio::Variation::Allele;
3023 use Bio::Variation::DNAMutation;
3024 my ($a1, $a2, $dnamut, $seqDiff, @current_variants);
3026 $a1 = Bio::Variation::Allele->new;
3027 $a1->seq('aaaa');
3028 $a2 = Bio::Variation::Allele->new;
3029 $a2->seq('ttat');
3031 $dnamut = Bio::Variation::DNAMutation->new
3032 ('-start' => 1000,
3033 '-length' => 4,
3034 '-upStreamSeq' => 'actggg',
3035 '-proof' => 'experimental',
3036 '-isMutation' => 1, );
3037 $dnamut->allele_ori($a1);
3038 $dnamut->add_Allele($a2);
3039 $seqDiff = Bio::Variation::SeqDiff->new
3040 ('-id' => 'M20132',
3041 '-moltype' => 'rna',
3042 #$seqDiff = Bio::Variation::SeqDiff->new
3043 #( '-id' => 'M20132', '-moltype' => 'RNA',
3044 '-gene_symbol' => 'AR',
3045 '-chromosome' => 'X',
3046 '-numbering' => 'coding');
3047 # add it to a SeqDiff container object
3048 $seqDiff->add_Variant($dnamut);
3049 @current_variants = $seqDiff->each_Variant();
3050 print " The original sequence at current variation is ",
3051 $current_variants[0]->allele_ori->seq ," \n";
3052 print " The mutated sequence is ",
3053 $current_variants[0]->allele_mut->seq ," \n";
3054 print " The upstream sequence is ",
3055 $current_variants[0]->upStreamSeq ," \n";
3057 return 1;
3061 #################################################
3062 # demo_xml ():
3065 $demo_xml = sub {
3067 print "\nBeginning demo_xml example... \n";
3068 my ($str, $seqobj, $id, @feats, $first_primary_tag);
3070 # III.8.4 Sequence XML representations
3071 # - generation and parsing (SeqIO::game)
3073 $str = Bio::SeqIO->new('-file'=> 't/test.game',
3074 '-format' => 'game');
3075 $seqobj = $str->next_seq();
3076 # $seq = $str->next_primary_seq();
3077 # $id = $seq->id;
3079 print "Seq object display id is ", $seqobj->display_id(),
3080 "\n"; # the human read-able id of the sequence
3081 print "The sequence description is: \n";
3082 print " ", $seqobj->desc(), " \n";
3083 print "Acc num is ", $seqobj->accession_number(),
3084 " \n"; # when there, the accession number
3085 print "Moltype is ", $seqobj->moltype(),
3086 " \n"; # one of 'dna','rna','protein'
3088 @feats = $seqobj->all_SeqFeatures();
3089 $first_primary_tag = $feats[0]->primary_tag;
3090 print " Total number of sequence features is: ", scalar @feats, " \n";
3091 print " The primary tag of the first feature is: $first_primary_tag \n";
3092 print " The first feature begins at location ", $feats[0]->start, " \n";
3093 print " and ends at location ", $feats[0]->end, " \n";
3095 return 1;
3099 ################################################
3100 # other_seq_utilities ():
3103 $other_seq_utilities = sub {
3105 use Bio::Tools::OddCodes;
3106 use Bio::Tools::SeqPattern;
3107 use Bio::Tools::CodonTable;
3109 my ( $oddcode_obj, $amino_obj, $in, $infile, $chargeseq, $chemseq);
3110 my (@ii, $i, @res, $myCodonTable);
3111 my ( $pattern,$pattern_obj, $pattern_obj2, $pattern_obj3);
3113 print "\nBeginning sequence utilities example... \n";
3115 # III.4.6 Identifying amino acid characteristics
3116 # - eg charge, hydrophobicity (OddCodes)
3117 $infile = $amino_seq_file;
3118 $in = Bio::SeqIO->new('-file' => $infile ,
3119 '-format' => 'Fasta');
3120 $amino_obj = $in->next_seq->trunc(1,20);
3121 $oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
3122 print "Initial sequence is: \n ", $amino_obj->seq(), " \n";
3123 $chargeseq = $oddcode_obj->charge;
3124 # Note OddCodes returns a reference to the desired
3125 # sequence, not the sequence itself.
3126 print "Sequence in \'charge\' alphabet is : \n ",$$chargeseq, " \n";
3127 $chemseq = $oddcode_obj->chemical;
3128 print "Sequence in \'chemical\' alphabet is : \n ",$$chemseq, " \n";
3130 # III.4.2.1 non-standard nucleotide translation tables (CodonTable)
3132 # create a table object by giving an ID
3133 $myCodonTable = Bio::Tools::CodonTable -> new ( '-id' => 16);
3134 print "\nThe name of the current codon table is ",
3135 $myCodonTable->name() , " \n";
3137 # translate some codons
3138 @ii = qw(act acc att ggg ttt aaa ytr yyy);
3139 @res =();
3140 for $i (0..$#ii) { $res[$i] = $myCodonTable->translate($ii[$i]); }
3141 print "\nWith the current codon table, the sequence \n ",
3142 @ii, "\n is translated as \n" , @res, " \n";
3144 # III.4.2.2 utilities for ambiguous residues (SeqPattern, IUPAC)
3145 # III.4.2.3 regex-like nucleotide pattern manipulation
3147 print "\nBeginning demo of SeqPattern object \n ";
3149 $pattern = '(CCCCT)N{1,200}(agyyg)N{1,80}(agggg)';
3150 print "\nInput regular expression = $pattern \n ";
3151 $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' =>$pattern,
3152 '-TYPE' =>'dna');
3153 $pattern_obj2 = $pattern_obj->revcom();
3154 print "\nReverse-complemented expression = ",$pattern_obj2->str, "\n ";
3155 $pattern_obj3 = $pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3156 print "\nExpanded reverse-complemented expression = ",$pattern_obj3->str, "\n ";
3157 #$pattern_obj->revcom(1); ## returns expanded rev complement pattern.
3159 return 1;
3164 $bpinspect1 = sub {
3165 my ($object, $object_class,$package );
3166 my @package_list = ();
3167 $object = shift;
3168 ($object_class = $object) =~ s/::/\//g; # replace "::" with "/"
3169 $object_class = $object_class . ".pm";
3170 require $object_class;
3172 my $method_hash = $object->methods('all');
3174 print " \n ***Methods for Object $object ******** \n";
3175 # get all the *unique* package names providing methods
3176 my %seen=();
3177 foreach $package (values %$method_hash) {
3178 push(@package_list, $package) unless $seen{$package}++;
3180 for $package (sort @package_list) {
3181 print " \n\n Methods taken from package $package \n";
3182 my $out_count = 1;
3183 for my $method (sort keys %$method_hash) {
3184 if ($$method_hash{$method} eq $package ) {
3185 print " $method ";
3186 $out_count++;
3187 if ($out_count == 7) {
3188 print " \n"; # keeps output line from getting too long
3189 $out_count = 1;
3194 print "\n";
3196 # The following subroutine is based closely on Mark Summerfield's and Piers Cawley's
3197 # clever Class_Methods_Introspection program
3199 package UNIVERSAL;
3200 sub methods {
3201 my ($class, $types) = @_;
3202 $class = ref $class || $class;
3203 $types ||= '';
3204 my %classes_seen;
3205 my %methods;
3206 my @class = ($class);
3208 no strict 'refs';
3209 while ($class = shift @class) {
3210 next if $classes_seen{$class}++;
3211 unshift @class, @{"${class}::ISA"} if $types eq 'all';
3212 # Based on methods_via() in perl5db.pl
3213 for my $method (grep {not /^[(_]/ and
3214 defined &{${"${class}::"}{$_}}}
3215 keys %{"${class}::"}) {
3216 $methods{$method} = $class; #ps
3217 # $methods{$method} = wantarray ? undef : $class->can($method);
3221 wantarray ? keys %methods : \%methods;
3231 return 1;
3240 ## "main" program follows
3241 #no strict 'refs';
3243 @runlist = @ARGV;
3244 if (scalar(@runlist)==0) {&$display_help;}; # display help if no option
3245 if ($runlist[0] == 0) {@runlist = (1..20); }; # argument = 0 means run all tests
3246 foreach $n (@runlist) {
3247 if ($n ==100) {my $object = $runlist[1]; &$bpinspect1($object); last;}
3248 if ($n ==1) {&$access_remote_db; next;}
3249 if ($n ==2) {&$index_local_db; next;}
3250 if ($n ==3) {&$fetch_local_db; next;}
3251 if ($n ==4) {&$sequence_manipulations; next;}
3252 if ($n ==5) {&$seqstats_and_seqwords; next;}
3253 if ($n ==6) {&$restriction_and_sigcleave; next;}
3254 if ($n ==7) {&$other_seq_utilities; next;}
3255 if ($n ==8) {&$run_standaloneblast; next;}
3256 if ($n ==9) {&$blast_parser; next;}
3257 if ($n ==10) {&$bplite_parsing; next;}
3258 if ($n ==11) {&$hmmer_parsing; next;}
3259 if ($n ==12) {&$run_clustalw_tcoffee; next;}
3260 if ($n ==13) {&$run_psw_bl2seq; next;}
3261 if ($n ==14) {&$simplealign_univaln ; next;}
3262 if ($n ==15) {&$gene_prediction_parsing; next;}
3263 if ($n ==16) {&$sequence_annotation; next;}
3264 if ($n ==17) {&$largeseqs; next;}
3265 if ($n ==18) {&$liveseqs; next;}
3266 if ($n ==19) {&$demo_variations; next;}
3267 if ($n ==20) {&$demo_xml; next;}
3268 &$display_help;
3271 ## End of "main"