FAQ

   1
   2 Bioperl FAQ
   3 -----------
   4 v. 1.2
   5
   6 This FAQ maintained by:
   7 * Jason Stajich <jason@bioperl.org>
   8 * Brian Osborne <brian_osborne@cognia.com>
   9 * Heikki Lehvaslaiho <heikki@ebi.ac.uk>
  10
  11
  12 ---------------------------------------------------------------------------
  13
  14 Contents
  15
  16 ---------------------------------------------------------------------------
  17
  18 0. About this FAQ
  19
  20    Q0.1: What is this FAQ?
  21    Q0.2: How is it maintained?
  22
  23 1. Bioperl in general
  24
  25    Q1.1: What is Bioperl?
  26    Q1.2: Where do I go to get the latest release?
  27    Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
  28          developer release?
  29    Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl?  What's the deal?
  30    Q1.5: How do I figure out how to use a module?
  31    Q1.6: I'm interested in the bleeding edge version of the code, where can
  32          I get it?
  33    Q1.7: Who uses this toolkit?
  34    Q1.8: How should I cite Bioperl?
  35    Q1.9: What are the license terms for Bioperl?
  36   Q1.10: I want to help, where do I start?
  37   Q1.11: I've got an idea for a module how do I contribute it?
  38
  39 2. Sequences
  40
  41    Q2.1: How do I parse a sequence file?
  42    Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
  43    Q2.3: How can I get NT_ or NM_ or NP_ accessions from NCBI
  44          (Reference Sequences)?
  45    Q2.4: How can I use SeqIO to parse sequence data from a string?
  46
  47 3. Report parsing
  48
  49    Q3.1: I want to parse BLAST, how do I do this?
  50    Q3.2: What's wrong with Bio::Tools::Blast?
  51    Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
  52    Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can
  53          I do this?
  54    Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
  55          0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am
  56          I seeing these different numbers and how do I get the frame
  57          according to Blast?
  58
  59 4. Utilities
  60
  61    Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
  62          sites in a protein? Calculate nucleotide melting temperature? Find
  63          repeats?
  64    Q4.2: How do I do motif searches with Bioperl? Can I do "find all
  65          sequences that are 75% identical" to a given motif?
  66    Q4.3: Can I query MEDLINE or other bibliographic repositories using
  67          Bioperl?
  68
  69 5. Annotations and Features
  70
  71    Q5.1: I get the warning "(old style Annotation) on new style
  72          Annotation::Collection".  What is wrong?
  73    Q5.2: How do I retrieve all the features from a Sequence?  How about all
  74          the features which are exons or have a /note field that contains a
  75          certain gene name?
  76    Q5.3: How do I parse the CDS join() statements in Genbank or EMBL files
  77          so I can reconstruct the CDS sequence?
  78
  79 6. Running external programs
  80
  81    Q6.1: How do I run Blast from within Bioperl?
  82    Q6.2: Hey, I want to run clustalw within Bioperl, I used
  83          Bio::Tools::Run::Alignment::Clustalw before - where did it go?
  84    Q6.3: What does the future hold for running applications within Bioperl?
  85
  86 ---------------------------------------------------------------------------
  87
  88 0. About this FAQ
  89
  90 ---------------------------------------------------------------------------
  91
  92
  93
  94    Q0.1: What is this FAQ?
  95
  96       A: It is the list of Frequently Asked Questions about Bioperl.
  97
  98
  99    Q0.2: How is it maintained?
 100
 101       A: This FAQ was generated using a Perl script and an XML file. All
 102          the files are in the Bioperl distribution directory doc/faq. So do
 103          not edit this file! Edit file faq.xml and run:
 104
 105            % faq.pl -text faq.xml
 106
 107          The XML structure was originally used by the Perl XML project.
 108          Their website seems to have vanished, though. The XML and
 109          modifying scripts were copied from Michael Rodriguez's web site
 110          http://www.xmltwig.com/xmltwig/XML-Twig-FAQ.html and modified to
 111          our needs.
 112
 113
 114 ---------------------------------------------------------------------------
 115
 116 1. Bioperl in general
 117
 118 ---------------------------------------------------------------------------
 119
 120
 121
 122    Q1.1: What is Bioperl?
 123
 124       A: Bioperl is a tookit of perl modules useful in building
 125          bioinformatics solutions in perl.  It is built in an
 126          object-oriented manner so that many modules depend on each other
 127          to achieve a task. The collection of modules in the bioperl-live
 128          repository consist of the core of the functionality of bioperl.
 129          Additionally auxiliary modules for creating graphical interfaces
 130          (bioperl-gui), persistent storage in RDMBS (bioperl-db), and CORBA
 131          bridges to the BioCORBA (http://www.biocorba.org) specification
 132          (bioperl-corba-server and bioperl-corba-client) are all available
 133          as CVS modules in our repository.
 134
 135
 136    Q1.2: Where do I go to get the latest release?
 137
 138       A: You can always get our releases from ftp://bioperl.org/pub/DIST.
 139          Official releases will be noted on the website http://bioperl.org.
 140
 141
 142    Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
 143          developer release?
 144
 145       A: 0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were
 146          stable releases on 0.7 branch.  This means they had a set of
 147          functionality that is maintained throughout (no experimental
 148          modules) and were guaranteed to have all tests and subsequent bug
 149          fix releases with the 0.7 designation would not have any API
 150          changes.
 151
 152          The 0.9.X series was our first attempt at releasing so called
 153          developer releases.  These are snapshots of the actively developed
 154          code that at a minimum pass all our tests.
 155
 156          But really, you should be using version 1.*!
 157
 158
 159    Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl?  What's the deal?
 160
 161       A: Well, the perl.org guys granted us use of bio.perl.org. We prefer
 162          to be called Bioperl or BioPerl (unlike our Biopython friends).
 163          We're part of the Open Bioinformatics Foundation (OBF) and so as
 164          part of the Bio{*} toolkits we prefer the Bioperl spelling.  But
 165          we're not really all that picky so no worries.
 166
 167
 168    Q1.5: How do I figure out how to use a module?
 169
 170       A: Read the embedded perl documentation (Plain Old Documentation -
 171          POD) that is part of every modules.  Do:
 172
 173            % perldoc MODULE
 174
 175          (careful - spelling and case counts!).
 176
 177          The bioperl tutorial - bptutorial.pl - provided in the root
 178          directory of the bioperl release will also provide a good
 179          introduction.  There are links to tutorials off the bioperl
 180          website that may provide some additional help.
 181
 182          There are also many scripts in the examples/ and scripts/
 183          directories that could be useful - see bioperl.pod for a brief
 184          description of all of them.
 185
 186          Additionally we have written many tests for our modules, you can
 187          see test data and example usage of the modules in these tests -
 188          look in the test dir (called 't').
 189
 190
 191    Q1.6: I'm interested in the bleeding edge version of the code, where can
 192          I get it?
 193
 194       A: Go to http://cvs.bioperl.org and you'll see instructions on how to
 195          get the CVS code.
 196
 197          Basically:
 198
 199            % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
 200          login
 201
 202          enter 'cvs' for the password
 203
 204
 205            % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
 206          co bioperl_all
 207
 208
 209    Q1.7: Who uses this toolkit?
 210
 211       A: Lots of people.  Sanger Centre, EBI, many large and small academic
 212          laboratories, large and small pharmaceutical companies. All the
 213          developers on the bioperl list use the toolkit in some capacity on
 214          a regular basis.
 215
 216          The Genquire annotation system
 217          (http://www.bioinformatics.org/Genquire/) and Ensembl
 218          (http://www.ensembl.org/) use bioperl as the basis for their
 219          implementation.
 220
 221
 222    Q1.8: How should I cite Bioperl?
 223
 224       A: Please cite it as.
 225          Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA,
 226          Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H,
 227          Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI,
 228          Pocock MR, Schattner P, Senger M, Stein LD, Stupka ED,
 229          Wilkinson M, Birney E.
 230          The Bioperl Toolkit: Perl modules for the life sciences.
 231          Genome Research. 2002 Oct;12(10):1161-8.
 232
 233
 234    Q1.9: What are the license terms for Bioperl?
 235
 236       A: Bioperl is licensed under the same terms as Perl itself which is
 237          the Perl Artistic License. You can see more information on that
 238          license at http://www.perl.com/pub/a/language/misc/Artistic.html
 239          and http://www.opensource.org/licenses/artistic-license.html.
 240
 241
 242   Q1.10: I want to help, where do I start?
 243
 244       A: Bioperl is a pretty diverse collection of modules which has grown
 245          from the direct needs of the developers participating in the
 246          project.  So if you don't have a need for a specific module in the
 247          toolkit it becomes hard to just describe ways it needs to be
 248          expanded or adapted.  One area, however is the development of
 249          stand alone scripts which use bioperl components for common tasks.
 250           Some starting points for script: find out what people in your
 251          institution do routinely that a shortcut can be developed for.
 252          Identify modules in bioperl that need easy intefaces and write
 253          that wrapper - you'll learn how to use the module inside and out.
 254          We always need people to help fix bugs - read the jitterbug bug
 255          tracking system (webpage linked from bioperl website sidebar
 256          under "Bugs").
 257
 258
 259   Q1.11: I've got an idea for a module how do I contribute it?
 260
 261       A: We suggest the following.  Post your idea to the bioperl list,
 262          bioperl-l@bioperl.org. If it is a really new idea consider taking
 263          us through your thought process.  We'll help you tease out the
 264          necessary information such as what methods you'll want and how it
 265          can interact with other bioperl modules.  If it is a port of
 266          something you've already worked on, give us a summary of the
 267          current methods.  Make sure there is an interface to the module,
 268          not just an implementation (see the biodesign.pod for more info)
 269          and make sure there will be a set of tests that will be in the t/
 270          directory to insure that your module is tested.
 271
 272
 273 ---------------------------------------------------------------------------
 274
 275 2. Sequences
 276
 277 ---------------------------------------------------------------------------
 278
 279
 280
 281    Q2.1: How do I parse a sequence file?
 282
 283       A: Use the Bio::SeqIO system.  This will create Bio::Seq objects for
 284          you.  See the tutorial bptutorial.pl for more information or the
 285          documentation for Bio::SeqIO (e.g. 'perldoc SeqIO.pm').
 286
 287
 288    Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
 289
 290       A: NCBI changed the web CGI script that provided this access.  You
 291          must be using bioperl <= 0.7.2.  The developer release 0.9.3
 292          contains this fix as does the 1.0 release.
 293
 294
 295    Q2.3: How can I get NT_ or NM_ or NP_ accessions from NCBI
 296          (Reference Sequences)?
 297
 298       A: Use Bio::DB::RefSeq not Bio::DB::GenBank or Bio::DB::GenPept when
 299          you are retrieving these accessions. This is still an area of
 300          active development because the data providers have not provided
 301          the best interface for us to query.  EBI has provided a mirror
 302          with their dbfetch system which is accessible through the
 303          Bio::DB::RefSeq object however, there are cases where NT_
 304          accessions will not be retrievable.
 305
 306
 307    Q2.4: How can I use SeqIO to parse sequence data from a string?
 308
 309       A:
 310            use IO::String;
 311            use Bio::SeqIO;
 312            my $stringfh = new IO::String($string);
 313
 314            my $seqio = new Bio::SeqIO(-fh => $stringfh,
 315                                       -format => 'fasta');
 316            while( my $seq = $seqio->next_seq ) { # process each seq
 317            }
 318
 319
 320 ---------------------------------------------------------------------------
 321
 322 3. Report parsing
 323
 324 ---------------------------------------------------------------------------
 325
 326
 327
 328    Q3.1: I want to parse BLAST, how do I do this?
 329
 330       A: Well you might notice that there are a lot of choices.  Sorry
 331          about that.  We've been evolving towards a single solution.
 332
 333          Currently the best way to parse a report is to use the SearchIO
 334          system.  This supports blast and fasta report parsing.  The
 335          bptutorial provides an example of how to use this system as well
 336          as the documentation in the Bio::SearchIO system.
 337
 338
 339    Q3.2: What's wrong with Bio::Tools::Blast?
 340
 341       A: Nothing is really wrong with it, it has just been outgrown by a
 342          more generic approach to reports.  This generic approach allows us
 343          to just write pluggable modules for fasta and Blast parsing while
 344          using the same framework.  This is completely analogous to the
 345          Bio::SeqIO system of parsing sequence files.  However, the objects
 346          produced are of the Bio::Search rather than Bio::Seq variety.
 347
 348
 349    Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
 350
 351       A: It is as simple as parsing text BLAST results - you simply need to
 352          specify the format as "fasta" or "blastxml" and the parser will
 353          load the appropriate module for you.  You can use the exact logic
 354          and code for all of these formats as we have generalized the
 355          modules for sequence database searching.
 356
 357
 358    Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can
 359          I do this?
 360
 361       A: Look at Bio::Factory::EMBOSS to see how to use the 'water' and
 362          'needle' alignment programs that are part of the EMBOSS suite.
 363
 364          Additionally you can use the pSW module that is part of the
 365          bioperl-ext package (distributed separated at
 366          ftp://bioperl.org/pub/DIST). However note this only does protein
 367          alignments and is no longer a supported module.  Instead the
 368          EMBOSS implementation is the the best path ahead unless someone
 369          else wants to provide an Inline::C implementation.
 370
 371
 372    Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
 373          0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am
 374          I seeing these different numbers and how do I get the frame
 375          according to Blast?
 376
 377       A: These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with
 378          a frame of 2 with the strand being set to -1 (for more on GFF see
 379          http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml).
 380
 381          Frames are relative to the hit or query sequence so you need to
 382          query it based on sequence you are interested in:
 383
 384          $hsp->hit->strand();
 385          $hsp->hit->frame();
 386
 387          or
 388
 389          $hsp->query->strand();
 390          $hsp->query->frame();
 391
 392          So the value according to a blast report of -3 can be constructed
 393          as:
 394
 395          my $blastvalue = ($hsp->query->frame + 1) * $hsp->query->strand;
 396
 397
 398 ---------------------------------------------------------------------------
 399
 400 4. Utilities
 401
 402 ---------------------------------------------------------------------------
 403
 404
 405
 406    Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
 407          sites in a protein? Calculate nucleotide melting temperature? Find
 408          repeats?
 409
 410       A: In fact, none of these functions are built into Bioperl but they
 411          are all available in the EMBOSS package (http://www.emboss.org/),
 412          as well as many others. The Bioperl developers created a simple
 413          interface to EMBOSS such that any and all EMBOSS programs can be
 414          run from within Bioperl. See Bio::Factory::EMBOSS for more
 415          information.
 416
 417          If you can't find the functionality you want in Bioperl then make
 418          sure to look for it in EMBOSS, these packages integrate quite
 419          gracefully with Bioperl. Of course, you will have to install
 420          EMBOSS to get this access.
 421
 422          In addition, Bioperl after version 1.0.1 contains the Pise/Bioperl
 423          modules. The Pise package
 424          (http://www-alt.pasteur.fr/~letondal/Pise) was designed to provide
 425          a uniform interface to bioinformatics applications, and currently
 426          provides wrappers to greater than 250 such applications! Included
 427          amongst these wrapped apps are HMMER, Phylip, BLAST, GENSCAN, even
 428          the EMBOSS suite. Use of the Pise/Bioperl modules does not require
 429          installation of the Pise package.
 430
 431
 432    Q4.2: How do I do motif searches with Bioperl? Can I do "find all
 433          sequences that are 75% identical" to a given motif?
 434
 435       A: There are a number of approaches. Within Bioperl take a look at
 436          Bio::Tools::SeqPattern. Or, take a look at the TFBS package, at
 437          http://forkhead.cgb.ki.se/TFBS(Transcription Factor Binding Site).
 438          This Bioperl-compliant package specializes in pattern searching of
 439          nucleotide sequence using matrices.
 440
 441          It's also conceivable that the combination of Bioperl and Perl's
 442          regular expressions could do the trick. You might also consider
 443          the CPAN module String::Approx (this module addresses the percent
 444          match query), but experienced users question whether its distance
 445          estimates are correct, the Unix agrep command is thought to be
 446          faster and more accurate.  Finally, you could use EMBOSS, as
 447          discussed in the previous question (or you could use Pise to run
 448          EMBOSS applications). The relevant programs would be fuzzpro or
 449          fuzznuc.
 450
 451
 452    Q4.3: Can I query MEDLINE or other bibliographic repositories using
 453          Bioperl?
 454
 455       A: Yes! The solution lies in Bio::Biblio*, a set of modules that
 456          provide access to MEDLINE and OpenBQS-compliant servers using
 457          SOAP. See Bio/Biblio.pm or examples/biblio.pl for details and
 458          example code.
 459
 460
 461 ---------------------------------------------------------------------------
 462
 463 5. Annotations and Features
 464
 465 ---------------------------------------------------------------------------
 466
 467
 468
 469    Q5.1: I get the warning "(old style Annotation) on new style
 470          Annotation::Collection".  What is wrong?
 471
 472       A: This is because we have transitioned from the
 473          add_Comment/each_Comment, add_Reference/each_Reference style to
 474          add_Annotation('comment', $ann)/get_Annotations('comment). Please
 475          update your code in order to avoid seeing these warning messages.
 476
 477          This is because we have changed (starting with 1.0) the annotation
 478          collection implementation from the Bio::Annotation object to the
 479          Bio::Annotation::Collection object which is a more general and
 480          extensible system.  In the future the Reference objects will
 481          likely be implemented by the Bio::Biblio system but we hope to
 482          maintain a compatible API for these.
 483
 484
 485    Q5.2: How do I retrieve all the features from a Sequence?  How about all
 486          the features which are exons or have a /note field that contains a
 487          certain gene name?
 488
 489       A: To get all the features:
 490
 491          my @features = $seq->all_SeqFeatures();
 492
 493          To get all the features filtering on only those which have the
 494          primary tag 'exon'.
 495
 496          my @genes = grep { $_->primary_tag eq 'exon'}
 497                        $seq->all_SeqFeatures();
 498
 499          To get all the features filtering on this which have the tag
 500          'note' and within the note field contain the requested string
 501          $noteval.
 502
 503          my @f_with_note = grep {
 504                                     my @a = $_->has_tag('note') ?
 505                                     $_->each_tag_value('note') : ();
 506                                     grep { /$noteval/ } @a;
 507                                   }  $seq->all_SeqFeatures();
 508
 509
 510    Q5.3: How do I parse the CDS join() statements in Genbank or EMBL files
 511          so I can reconstruct the CDS sequence?
 512
 513       A: You need to use a Location::SplitLocationI object to get the
 514          coordinates and primary_tag() to find the CDS features:
 515
 516          if ( $feature->location->isa('Bio::Location::SplitLocationI')
 517                           && $feature->primary_tag eq 'CDS' )  {
 518            foreach $location ( $feature->location->sub_Location ) {
 519              print $location->start . ".." . $location->end . "\n";
 520            }
 521          }
 522
 523
 524 ---------------------------------------------------------------------------
 525
 526 6. Running external programs
 527
 528 ---------------------------------------------------------------------------
 529
 530
 531
 532    Q6.1: How do I run Blast from within Bioperl?
 533
 534       A:  Use the module Bio::Tools::Run::StandAloneBlast.  It will give
 535          you access to many of the search tools in the NCBI blast suite
 536          including blastll, bl2seq, blastpgp.  The basic structure is like
 537          this.
 538
 539          use Bio::Tools::Run::StandAloneBlast;
 540          my $factory = Bio::Tools::Run::StandAloneBlast->new(p => 'blastn',
 541                                                              d => 'nt',
 542                                                              e => '1e-5');
 543          my $seq = new Bio::PrimarySeq(-id => 'test1',
 544                                        -seq => 'AGATCAGTAGATGATAGGGGTAGA');
 545          my $report = $factory->blastall($seq);
 546
 547
 548    Q6.2: Hey, I want to run clustalw within Bioperl, I used
 549          Bio::Tools::Run::Alignment::Clustalw before - where did it go?
 550
 551       A: The Bio::Tools::Run directory was moved to a new package to help
 552          make the size of the core code smaller and separate out the more
 553          specialized nature of application running from the rest of
 554          Bioperl.  You can get these modules by installing the bioperl-run
 555          package.  This is either available from CVS under the same name or
 556          available in the http://bioperl.org/DIST directory and on CPAN.
 557          This changeover began in the bioperl 1.1 developer release.
 558
 559
 560    Q6.3: What does the future hold for running applications within Bioperl?
 561       A: We are trying to build a standard starting point for analysis
 562          application which will probably look like
 563          Bio::Tools::Run::AnalysisFactory which will allow the user to
 564          request which type of remote or local server they want to use to
 565          run their analyses.  This will connect to the Pasteur's PISE
 566          server, the EBI's Novella server, as well as be aware of wrappers
 567          to run applications locally.
 568
 569          Additionally we suggest investigating the BioPipe
 570          project(http://www.biopipe.org) which is making it easier to chain
 571          together various sets of analyses and build rules for peforming
 572          these computes.
 573
 574 ---------------------------------------------------------------------------
 575 Copyright (c)2002-2003 Open Bioinformatics Foundation. You may distribute
 576 this FAQ under the same terms as perl itself.
 577