FAQ

   1
   2 Bioperl FAQ
   3 -----------
   4 v. 1.0.2
   5
   6 This FAQ maintained by:
   7 * Jason Stajich <jason@bioperl.org>
   8 * Brian Osborne <brian_osborne@cognia.com>
   9 * Heikki Lehvaslaiho <heikki@ebi.ac.uk>
  10
  11
  12 ---------------------------------------------------------------------------
  13
  14 Contents
  15
  16 ---------------------------------------------------------------------------
  17
  18 0. About this FAQ
  19
  20    Q0.1: What is this FAQ?
  21    Q0.2: How is it maintained?
  22
  23 1. Bioperl in general
  24
  25    Q1.1: What is Bioperl?
  26    Q1.2: Where do I go to get the latest release?
  27    Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
  28          developer release?
  29    Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl?  What's the deal?
  30    Q1.5: How do I figure out how to use a module?
  31    Q1.6: I'm interested in the bleeding edge version of the code, where can
  32          I get it?
  33    Q1.7: Who uses this toolkit?
  34    Q1.8: How should I cite Bioperl?
  35    Q1.9: What are the license terms for Bioperl?
  36   Q1.10: I want to help, where do I start?
  37   Q1.11: I've got an idea for a module how do I contribute it?
  38
  39 2. Sequences
  40
  41    Q2.1: How do I parse a sequence file?
  42    Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
  43    Q2.3: How can I get NT_ or NM_ or NP_ accessions from NCBI
  44          (Reference Sequences)?
  45    Q2.4: How can I use SeqIO to parse sequence data from a string?
  46
  47 3. Report parsing
  48
  49    Q3.1: I want to parse BLAST, how do I do this?
  50    Q3.2: What's wrong with Bio::Tools::Blast?
  51    Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
  52    Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can
  53          I do this?
  54    Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
  55          0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am
  56          I seeing these different numbers and how do I get the frame
  57          according to Blast?
  58
  59 4. Utilities
  60
  61    Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
  62          sites in a protein? Calculate nucleotide melting temperature? Find
  63          repeats?
  64    Q4.2: How do I do motif searches with Bioperl? Can I do "find all
  65          sequences that are 75% identical" to a given motif?
  66    Q4.3: Can I query MEDLINE or other bibliographic repositories using
  67          Bioperl?
  68
  69 5. Annotations and Features
  70
  71    Q5.1: I get the warning "(old style Annotation) on new style
  72          Annotation::Collection".  What is wrong?
  73    Q5.2: How do I retrieve all the features from a Sequence?  How about all
  74          the features which are exons or have a /note field that contains a
  75          certain gene name?
  76    Q5.3: How do I parse the CDS join() statements in Genbank or EMBL files
  77          so I can reconstruct the CDS sequence?
  78
  79 6. Running external programs
  80
  81    Q6.1: I want to run StandAloneBlast within bioperl - where did it go?
  82    Q6.2: What does the future hold for running applications within Bioperl?
  83
  84 ---------------------------------------------------------------------------
  85
  86 0. About this FAQ
  87
  88 ---------------------------------------------------------------------------
  89
  90
  91
  92    Q0.1: What is this FAQ?
  93
  94       A: It is the list of Frequently Asked Questions about Bioperl.
  95
  96
  97    Q0.2: How is it maintained?
  98
  99       A: This FAQ was generated using a Perl script and an XML file. All
 100          the files are in the Bioperl distribution directory doc/faq. So do
 101          not edit this file! Edit file faq.xml and run:
 102
 103            % faq.pl -text faq.xml
 104
 105          The XML structure was originally used by the Perl XML project.
 106          Their website seems to have vanished, though. The XML and
 107          modifying scripts were copied from Michael Rodriguez's web site
 108          http://www.xmltwig.com/xmltwig/XML-Twig-FAQ.html and modified to
 109          our needs.
 110
 111
 112 ---------------------------------------------------------------------------
 113
 114 1. Bioperl in general
 115
 116 ---------------------------------------------------------------------------
 117
 118
 119
 120    Q1.1: What is Bioperl?
 121
 122       A: Bioperl is a tookit of perl modules useful in building
 123          bioinformatics solutions in perl.  It is built in an
 124          object-oriented manner so that many modules depend on each other
 125          to achieve a task. The collection of modules in the bioperl-live
 126          repository consist of the core of the functionality of bioperl.
 127          Additionally auxiliary modules for creating graphical interfaces
 128          (bioperl-gui), persistent storage in RDMBS (bioperl-db), and CORBA
 129          bridges to the BioCORBA (http://www.biocorba.org) specification
 130          (bioperl-corba-server and bioperl-corba-client) are all available
 131          as CVS modules in our repository.
 132
 133
 134    Q1.2: Where do I go to get the latest release?
 135
 136       A: You can always get our releases from ftp://bioperl.org/pub/DIST.
 137          Official releases will be noted on the website http://bioperl.org.
 138
 139
 140    Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
 141          developer release?
 142
 143       A: 0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were
 144          stable releases on 0.7 branch.  This means they had a set of
 145          functionality that is maintained throughout (no experimental
 146          modules) and were guaranteed to have all tests and subsequent bug
 147          fix releases with the 0.7 designation would not have any API
 148          changes.
 149
 150          The 0.9.X series was our first attempt at releasing so called
 151          developer releases.  These are snapshots of the actively developed
 152          code that at a minimum pass all our tests.
 153
 154          But really, you should be using version 1.*!
 155
 156
 157    Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl?  What's the deal?
 158
 159       A: Well, the perl.org guys granted us use of bio.perl.org. We prefer
 160          to be called Bioperl or BioPerl (unlike our Biopython friends).
 161          We're part of the Open Bioinformatics Foundation (OBF) and so as
 162          part of the Bio{*} toolkits we prefer the Bioperl spelling.  But
 163          we're not really all that picky so no worries.
 164
 165
 166    Q1.5: How do I figure out how to use a module?
 167
 168       A: Read the embedded perl documentation (Plain Old Documentation -
 169          POD) that is part of every modules.  Do:
 170
 171            % perldoc MODULE
 172
 173          (careful - spelling and case counts!).
 174
 175          The bioperl tutorial - bptutorial.pl - provided in the root
 176          directory of the bioperl release will also provide a good
 177          introduction.  There are links to tutorials off the bioperl
 178          website that may provide some additional help.
 179
 180          There are also many scripts in the examples/ and scripts/
 181          directories that could be useful - see bioperl.pod for a brief
 182          description of all of them.
 183
 184          Additionally we have written many tests for our modules, you can
 185          see test data and example usage of the modules in these tests -
 186          look in the test dir (called 't').
 187
 188
 189    Q1.6: I'm interested in the bleeding edge version of the code, where can
 190          I get it?
 191
 192       A: Go to http://cvs.bioperl.org and you'll see instructions on how to
 193          get the CVS code.
 194
 195          Basically:
 196
 197            % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
 198          login
 199
 200          enter 'cvs' for the password
 201
 202
 203            % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
 204          co bioperl_all
 205
 206
 207    Q1.7: Who uses this toolkit?
 208
 209       A: Lots of people.  Sanger Centre, EBI, many large and small academic
 210          laboratories, large and small pharmaceutical companies. All the
 211          developers on the bioperl list use the toolkit in some capacity on
 212          a regular basis.
 213
 214          The Genquire annotation system
 215          (http://www.bioinformatics.org/Genquire/) and Ensembl
 216          (http://www.ensembl.org/) use bioperl as the basis for their
 217          implementation.
 218
 219
 220    Q1.8: How should I cite Bioperl?
 221
 222       A: Please cite it as.
 223          Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA,
 224          Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H,
 225          Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI,
 226          Pocock MR, Schattner P, Senger M, Stein LD, Stupka ED,
 227          Wilkinson M, Birney E.
 228          The Bioperl Toolkit: Perl modules for the life sciences.
 229          Genome Research. 2002 Oct;12(10):1161-8.
 230
 231
 232    Q1.9: What are the license terms for Bioperl?
 233
 234       A: Bioperl is licensed under the same terms as Perl itself which is
 235          the Perl Artistic License. You can see more information on that
 236          license at http://www.perl.com/pub/a/language/misc/Artistic.html
 237          and http://www.opensource.org/licenses/artistic-license.html.
 238
 239
 240   Q1.10: I want to help, where do I start?
 241
 242       A: Bioperl is a pretty diverse collection of modules which has grown
 243          from the direct needs of the developers participating in the
 244          project.  So if you don't have a need for a specific module in the
 245          toolkit it becomes hard to just describe ways it needs to be
 246          expanded or adapted.  One area, however is the development of
 247          stand alone scripts which use bioperl components for common tasks.
 248           Some starting points for script: find out what people in your
 249          institution do routinely that a shortcut can be developed for.
 250          Identify modules in bioperl that need easy intefaces and write
 251          that wrapper - you'll learn how to use the module inside and out.
 252          We always need people to help fix bugs - read the jitterbug bug
 253          tracking system (webpage linked from bioperl website sidebar
 254          under "Bugs").
 255
 256
 257   Q1.11: I've got an idea for a module how do I contribute it?
 258
 259       A: We suggest the following.  Post your idea to the bioperl list,
 260          bioperl-l@bioperl.org. If it is a really new idea consider taking
 261          us through your thought process.  We'll help you tease out the
 262          necessary information such as what methods you'll want and how it
 263          can interact with other bioperl modules.  If it is a port of
 264          something you've already worked on, give us a summary of the
 265          current methods.  Make sure there is an interface to the module,
 266          not just an implementation (see the biodesign.pod for more info)
 267          and make sure there will be a set of tests that will be in the t/
 268          directory to insure that your module is tested.
 269
 270
 271 ---------------------------------------------------------------------------
 272
 273 2. Sequences
 274
 275 ---------------------------------------------------------------------------
 276
 277
 278
 279    Q2.1: How do I parse a sequence file?
 280
 281       A: Use the Bio::SeqIO system.  This will create Bio::Seq objects for
 282          you.  See the tutorial bptutorial.pl for more information or the
 283          documentation for Bio::SeqIO (e.g. 'perldoc SeqIO.pm').
 284
 285
 286    Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
 287
 288       A: NCBI changed the web CGI script that provided this access.  You
 289          must be using bioperl <= 0.7.2.  The developer release 0.9.3
 290          contains this fix as does the 1.0 release.
 291
 292
 293    Q2.3: How can I get NT_ or NM_ or NP_ accessions from NCBI
 294          (Reference Sequences)?
 295
 296       A: Use Bio::DB::RefSeq not Bio::DB::GenBank or Bio::DB::GenPept when
 297          you are retrieving these accessions. This is still an area of
 298          active development because the data providers have not provided
 299          the best interface for us to query.  EBI has provided a mirror
 300          with their dbfetch system which is accessible through the
 301          Bio::DB::RefSeq object however, there are cases where NT_
 302          accessions will not be retrievable.
 303
 304
 305    Q2.4: How can I use SeqIO to parse sequence data from a string?
 306
 307       A:
 308            use IO::String;
 309            use Bio::SeqIO;
 310            my $stringfh = new IO::String($string);
 311
 312            my $seqio = new Bio::SeqIO(-fh => $stringfh,
 313                                       -format => 'fasta');
 314            while( my $seq = $seqio->next_seq ) { # process each seq
 315            }
 316
 317
 318 ---------------------------------------------------------------------------
 319
 320 3. Report parsing
 321
 322 ---------------------------------------------------------------------------
 323
 324
 325
 326    Q3.1: I want to parse BLAST, how do I do this?
 327
 328       A: Well you might notice that there are a lot of choices.  Sorry
 329          about that.  We've been evolving towards a single solution.
 330
 331          Currently the best way to parse a report is to use the SearchIO
 332          system.  This supports blast and fasta report parsing.  The
 333          bptutorial provides an example of how to use this system as well
 334          as the documentation in the Bio::SearchIO system.
 335
 336
 337    Q3.2: What's wrong with Bio::Tools::Blast?
 338
 339       A: Nothing is really wrong with it, it has just been outgrown by a
 340          more generic approach to reports.  This generic approach allows us
 341          to just write pluggable modules for fasta and Blast parsing while
 342          using the same framework.  This is completely analogous to the
 343          Bio::SeqIO system of parsing sequence files.  However, the objects
 344          produced are of the Bio::Search rather than Bio::Seq variety.
 345
 346
 347    Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
 348
 349       A: It is as simple as parsing text BLAST results - you simply need to
 350          specify the format as "fasta" or "blastxml" and the parser will
 351          load the appropriate module for you.  You can use the exact logic
 352          and code for all of these formats as we have generalized the
 353          modules for sequence database searching.
 354
 355
 356    Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can
 357          I do this?
 358
 359       A: Look at Bio::Factory::EMBOSS to see how to use the 'water' and
 360          'needle' alignment programs that are part of the EMBOSS suite.
 361
 362          Additionally you can use the pSW module that is part of the
 363          bioperl-ext package (distributed separated at
 364          ftp://bioperl.org/pub/DIST). However note this only does protein
 365          alignments and is no longer a supported module.  Instead the
 366          EMBOSS implementation is the the best path ahead unless someone
 367          else wants to provide an Inline::C implementation.
 368
 369
 370    Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
 371          0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am
 372          I seeing these different numbers and how do I get the frame
 373          according to Blast?
 374
 375       A: These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with
 376          a frame of 2 with the strand being set to -1 (for more on GFF see
 377          http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml).
 378
 379          Frames are relative to the hit or query sequence so you need to
 380          query it based on sequence you are interested in:
 381
 382          $hsp->hit->strand();
 383          $hsp->hit->frame();
 384
 385          or
 386
 387          $hsp->query->strand();
 388          $hsp->query->frame();
 389
 390          So the value according to a blast report of -3 can be constructed
 391          as:
 392
 393          my $blastvalue = ($hsp->query->frame + 1) * $hsp->query->strand;
 394
 395
 396 ---------------------------------------------------------------------------
 397
 398 4. Utilities
 399
 400 ---------------------------------------------------------------------------
 401
 402
 403
 404    Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
 405          sites in a protein? Calculate nucleotide melting temperature? Find
 406          repeats?
 407
 408       A: In fact, none of these functions are built into Bioperl but they
 409          are all available in the EMBOSS package (http://www.emboss.org/),
 410          as well as many others. The Bioperl developers created a simple
 411          interface to EMBOSS such that any and all EMBOSS programs can be
 412          run from within Bioperl. See Bio::Factory::EMBOSS for more
 413          information.
 414
 415          If you can't find the functionality you want in Bioperl then make
 416          sure to look for it in EMBOSS, these packages integrate quite
 417          gracefully with Bioperl. Of course, you will have to install
 418          EMBOSS to get this access.
 419
 420          In addition, Bioperl after version 1.0.1 contains the Pise/Bioperl
 421          modules. The Pise package
 422          (http://www-alt.pasteur.fr/~letondal/Pise) was designed to provide
 423          a uniform interface to bioinformatics applications, and currently
 424          provides wrappers to greater than 250 such applications! Included
 425          amongst these wrapped apps are HMMER, Phylip, BLAST, GENSCAN, even
 426          the EMBOSS suite. Use of the Pise/Bioperl modules does not require
 427          installation of the Pise package.
 428
 429
 430    Q4.2: How do I do motif searches with Bioperl? Can I do "find all
 431          sequences that are 75% identical" to a given motif?
 432
 433       A: There are a number of approaches. Within Bioperl take a look at
 434          Bio::Tools::SeqPattern. Or, take a look at the TFBS package, at
 435          http://forkhead.cgb.ki.se/TFBS(Transcription Factor Binding Site).
 436          This Bioperl-compliant package specializes in pattern searching of
 437          nucleotide sequence using matrices.
 438
 439          It's also conceivable that the combination of Bioperl and Perl's
 440          regular expressions could do the trick. You might also consider
 441          the CPAN module String::Approx (this module addresses the percent
 442          match query), but experienced users question whether its distance
 443          estimates are correct, the Unix agrep command is thought to be
 444          faster and more accurate.  Finally, you could use EMBOSS, as
 445          discussed in the previous question (or you could use Pise to run
 446          EMBOSS applications). The relevant programs would be fuzzpro or
 447          fuzznuc.
 448
 449
 450    Q4.3: Can I query MEDLINE or other bibliographic repositories using
 451          Bioperl?
 452
 453       A: Yes! The solution lies in Bio::Biblio*, a set of modules that
 454          provide access to MEDLINE and OpenBQS-compliant servers using
 455          SOAP. See Bio/Biblio.pm or examples/biblio.pl for details and
 456          example code.
 457
 458
 459 ---------------------------------------------------------------------------
 460
 461 5. Annotations and Features
 462
 463 ---------------------------------------------------------------------------
 464
 465
 466
 467    Q5.1: I get the warning "(old style Annotation) on new style
 468          Annotation::Collection".  What is wrong?
 469
 470       A: This is because we have transitioned from the
 471          add_Comment/each_Comment, add_Reference/each_Reference style to
 472          add_Annotation('comment', $ann)/get_Annotations('comment). Please
 473          update your code in order to avoid seeing these warning messages.
 474
 475          This is because we have changed (starting with 1.0) the annotation
 476          collection implementation from the Bio::Annotation object to the
 477          Bio::Annotation::Collection object which is a more general and
 478          extensible system.  In the future the Reference objects will
 479          likely be implemented by the Bio::Biblio system but we hope to
 480          maintain a compatible API for these.
 481
 482
 483    Q5.2: How do I retrieve all the features from a Sequence?  How about all
 484          the features which are exons or have a /note field that contains a
 485          certain gene name?
 486
 487       A: To get all the features:
 488
 489          my @features = $seq->all_SeqFeatures();
 490
 491          To get all the features filtering on only those which have the
 492          primary tag 'exon'.
 493
 494          my @genes = grep { $_->primary_tag eq 'exon'}
 495                        $seq->all_SeqFeatures();
 496
 497          To get all the features filtering on this which have the tag
 498          'note' and within the note field contain the requested string
 499          $noteval.
 500
 501          my @f_with_note = grep {
 502                                     my @a = $_->has_tag('note') ?
 503                                     $_->each_tag_value('note') : ();
 504                                     grep { /$noteval/ } @a;
 505                                   }  $seq->all_SeqFeatures();
 506
 507
 508    Q5.3: How do I parse the CDS join() statements in Genbank or EMBL files
 509          so I can reconstruct the CDS sequence?
 510
 511       A: You need to use a Location::SplitLocationI object to get the
 512          coordinates and primary_tag() to find the CDS features:
 513
 514          if ( $feature->location->isa('Bio::Location::SplitLocationI')
 515                           && $feature->primary_tag eq 'CDS' )  {
 516            foreach $location ( $feature->location->sub_Location ) {
 517              print $location->start . ".." . $location->end . "\n";
 518            }
 519          }
 520
 521
 522 ---------------------------------------------------------------------------
 523
 524 6. Running external programs
 525
 526 ---------------------------------------------------------------------------
 527
 528
 529
 530    Q6.1: I want to run StandAloneBlast within bioperl - where did it go?
 531
 532       A: The Bio::Tools::Run directory was moved to a new package to help
 533          make the size of the core code smaller and separate out the more
 534          specialized nature of application running from the rest of
 535          Bioperl.  You can get these modules by installing the bioperl-run
 536          package.  This is either available from CVS under the same name or
 537          available in the http://bioperl.org/DIST directory and on CPAN.
 538          This changeover began in the bioperl 1.1 developer release.
 539
 540
 541    Q6.2: What does the future hold for running applications within Bioperl?
 542       A: We are trying to build a standard starting point for analysis
 543          application which will probably look like
 544          Bio::Tools::Run::AnalysisFactory which will allow the user to
 545          request which type of remote or local server they want to use to
 546          run their analyses.  This will connect to the Pasteur's PISE
 547          server, the EBI's Novella server, as well as be aware of wrappers
 548          to run applications locally.
 549
 550          Additionally we suggest investigating the BioPipe
 551          project(http://www.biopipe.org) which is making it easier to chain
 552          together various sets of analyses and build rules for peforming
 553          these computes.
 554
 555 ---------------------------------------------------------------------------
 556 Copyright (c)2002 Open Bioinformatics Foundation. You may distribute this
 557 FAQ under the same terms as perl itself.
 558