6 This FAQ maintained by:
7 * Jason Stajich <jason@bioperl.org>
8 * Brian Osborne <brian_osborne@cognia.com>
9 * Heikki Lehvaslaiho <heikki@ebi.ac.uk>
12 ---------------------------------------------------------------------------
16 ---------------------------------------------------------------------------
20 Q0.1: What is this FAQ?
21 Q0.2: How is it maintained?
25 Q1.1: What is Bioperl?
26 Q1.2: Where do I go to get the latest release?
27 Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
29 Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl? What's the deal?
30 Q1.5: How do I figure out how to use a module?
31 Q1.6: I'm interested in the bleeding edge version of the code, where can
33 Q1.7: Who uses this toolkit?
34 Q1.8: How should I cite Bioperl?
35 Q1.9: What are the license terms for Bioperl?
36 Q1.10: I want to help, where do I start?
37 Q1.11: I've got an idea for a module how do I contribute it?
41 Q2.1: How do I parse a sequence file?
42 Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
43 Q2.3: How can I get NT_ or NM_ or NP_ accessions from NCBI
44 (Reference Sequences)?
45 Q2.4: How can I use SeqIO to parse sequence data from a string?
49 Q3.1: I want to parse BLAST, how do I do this?
50 Q3.2: What's wrong with Bio::Tools::Blast?
51 Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
52 Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can
54 Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
55 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am
56 I seeing these different numbers and how do I get the frame
61 Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
62 sites in a protein? Calculate nucleotide melting temperature? Find
64 Q4.2: How do I do motif searches with Bioperl? Can I do "find all
65 sequences that are 75% identical" to a given motif?
66 Q4.3: Can I query MEDLINE or other bibliographic repositories using
69 5. Annotations and Features
71 Q5.1: I get the warning "(old style Annotation) on new style
72 Annotation::Collection". What is wrong?
73 Q5.2: How do I retrieve all the features from a Sequence? How about all
74 the features which are exons or have a /note field that contains a
76 Q5.3: How do I parse the CDS join() statements in Genbank or EMBL files
77 so I can reconstruct the CDS sequence?
79 6. Running external programs
81 Q6.1: How do I run Blast from within Bioperl?
82 Q6.2: Hey, I want to run clustalw within Bioperl, I used
83 Bio::Tools::Run::Alignment::Clustalw before - where did it go?
84 Q6.3: What does the future hold for running applications within Bioperl?
86 ---------------------------------------------------------------------------
90 ---------------------------------------------------------------------------
94 Q0.1: What is this FAQ?
96 A: It is the list of Frequently Asked Questions about Bioperl.
99 Q0.2: How is it maintained?
101 A: This FAQ was generated using a Perl script and an XML file. All
102 the files are in the Bioperl distribution directory doc/faq. So do
103 not edit this file! Edit file faq.xml and run:
105 % faq.pl -text faq.xml
107 The XML structure was originally used by the Perl XML project.
108 Their website seems to have vanished, though. The XML and
109 modifying scripts were copied from Michael Rodriguez's web site
110 http://www.xmltwig.com/xmltwig/XML-Twig-FAQ.html and modified to
114 ---------------------------------------------------------------------------
116 1. Bioperl in general
118 ---------------------------------------------------------------------------
122 Q1.1: What is Bioperl?
124 A: Bioperl is a tookit of perl modules useful in building
125 bioinformatics solutions in perl. It is built in an
126 object-oriented manner so that many modules depend on each other
127 to achieve a task. The collection of modules in the bioperl-live
128 repository consist of the core of the functionality of bioperl.
129 Additionally auxiliary modules for creating graphical interfaces
130 (bioperl-gui), persistent storage in RDMBS (bioperl-db), and CORBA
131 bridges to the BioCORBA (http://www.biocorba.org) specification
132 (bioperl-corba-server and bioperl-corba-client) are all available
133 as CVS modules in our repository.
136 Q1.2: Where do I go to get the latest release?
138 A: You can always get our releases from ftp://bioperl.org/pub/DIST.
139 Official releases will be noted on the website http://bioperl.org.
142 Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
145 A: 0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were
146 stable releases on 0.7 branch. This means they had a set of
147 functionality that is maintained throughout (no experimental
148 modules) and were guaranteed to have all tests and subsequent bug
149 fix releases with the 0.7 designation would not have any API
152 The 0.9.X series was our first attempt at releasing so called
153 developer releases. These are snapshots of the actively developed
154 code that at a minimum pass all our tests.
156 But really, you should be using version 1.*!
159 Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl? What's the deal?
161 A: Well, the perl.org guys granted us use of bio.perl.org. We prefer
162 to be called Bioperl or BioPerl (unlike our Biopython friends).
163 We're part of the Open Bioinformatics Foundation (OBF) and so as
164 part of the Bio{*} toolkits we prefer the Bioperl spelling. But
165 we're not really all that picky so no worries.
168 Q1.5: How do I figure out how to use a module?
170 A: Read the embedded perl documentation (Plain Old Documentation -
171 POD) that is part of every modules. Do:
175 (careful - spelling and case counts!).
177 The bioperl tutorial - bptutorial.pl - provided in the root
178 directory of the bioperl release will also provide a good
179 introduction. There are links to tutorials off the bioperl
180 website that may provide some additional help.
182 There are also many scripts in the examples/ and scripts/
183 directories that could be useful - see bioperl.pod for a brief
184 description of all of them.
186 Additionally we have written many tests for our modules, you can
187 see test data and example usage of the modules in these tests -
188 look in the test dir (called 't').
191 Q1.6: I'm interested in the bleeding edge version of the code, where can
194 A: Go to http://cvs.bioperl.org and you'll see instructions on how to
199 % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
202 enter 'cvs' for the password
205 % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
209 Q1.7: Who uses this toolkit?
211 A: Lots of people. Sanger Centre, EBI, many large and small academic
212 laboratories, large and small pharmaceutical companies. All the
213 developers on the bioperl list use the toolkit in some capacity on
216 The Genquire annotation system
217 (http://www.bioinformatics.org/Genquire/) and Ensembl
218 (http://www.ensembl.org/) use bioperl as the basis for their
222 Q1.8: How should I cite Bioperl?
224 A: Please cite it as.
225 Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA,
226 Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H,
227 Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI,
228 Pocock MR, Schattner P, Senger M, Stein LD, Stupka ED,
229 Wilkinson M, Birney E.
230 The Bioperl Toolkit: Perl modules for the life sciences.
231 Genome Research. 2002 Oct;12(10):1161-8.
234 Q1.9: What are the license terms for Bioperl?
236 A: Bioperl is licensed under the same terms as Perl itself which is
237 the Perl Artistic License. You can see more information on that
238 license at http://www.perl.com/pub/a/language/misc/Artistic.html
239 and http://www.opensource.org/licenses/artistic-license.html.
242 Q1.10: I want to help, where do I start?
244 A: Bioperl is a pretty diverse collection of modules which has grown
245 from the direct needs of the developers participating in the
246 project. So if you don't have a need for a specific module in the
247 toolkit it becomes hard to just describe ways it needs to be
248 expanded or adapted. One area, however is the development of
249 stand alone scripts which use bioperl components for common tasks.
250 Some starting points for script: find out what people in your
251 institution do routinely that a shortcut can be developed for.
252 Identify modules in bioperl that need easy intefaces and write
253 that wrapper - you'll learn how to use the module inside and out.
254 We always need people to help fix bugs - read the jitterbug bug
255 tracking system (webpage linked from bioperl website sidebar
259 Q1.11: I've got an idea for a module how do I contribute it?
261 A: We suggest the following. Post your idea to the bioperl list,
262 bioperl-l@bioperl.org. If it is a really new idea consider taking
263 us through your thought process. We'll help you tease out the
264 necessary information such as what methods you'll want and how it
265 can interact with other bioperl modules. If it is a port of
266 something you've already worked on, give us a summary of the
267 current methods. Make sure there is an interface to the module,
268 not just an implementation (see the biodesign.pod for more info)
269 and make sure there will be a set of tests that will be in the t/
270 directory to insure that your module is tested.
273 ---------------------------------------------------------------------------
277 ---------------------------------------------------------------------------
281 Q2.1: How do I parse a sequence file?
283 A: Use the Bio::SeqIO system. This will create Bio::Seq objects for
284 you. See the tutorial bptutorial.pl for more information or the
285 documentation for Bio::SeqIO (e.g. 'perldoc SeqIO.pm').
288 Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
290 A: NCBI changed the web CGI script that provided this access. You
291 must be using bioperl <= 0.7.2. The developer release 0.9.3
292 contains this fix as does the 1.0 release.
295 Q2.3: How can I get NT_ or NM_ or NP_ accessions from NCBI
296 (Reference Sequences)?
298 A: Use Bio::DB::RefSeq not Bio::DB::GenBank or Bio::DB::GenPept when
299 you are retrieving these accessions. This is still an area of
300 active development because the data providers have not provided
301 the best interface for us to query. EBI has provided a mirror
302 with their dbfetch system which is accessible through the
303 Bio::DB::RefSeq object however, there are cases where NT_
304 accessions will not be retrievable.
307 Q2.4: How can I use SeqIO to parse sequence data from a string?
312 my $stringfh = new IO::String($string);
314 my $seqio = new Bio::SeqIO(-fh => $stringfh,
316 while( my $seq = $seqio->next_seq ) { # process each seq
320 ---------------------------------------------------------------------------
324 ---------------------------------------------------------------------------
328 Q3.1: I want to parse BLAST, how do I do this?
330 A: Well you might notice that there are a lot of choices. Sorry
331 about that. We've been evolving towards a single solution.
333 Currently the best way to parse a report is to use the SearchIO
334 system. This supports blast and fasta report parsing. The
335 bptutorial provides an example of how to use this system as well
336 as the documentation in the Bio::SearchIO system.
339 Q3.2: What's wrong with Bio::Tools::Blast?
341 A: Nothing is really wrong with it, it has just been outgrown by a
342 more generic approach to reports. This generic approach allows us
343 to just write pluggable modules for fasta and Blast parsing while
344 using the same framework. This is completely analogous to the
345 Bio::SeqIO system of parsing sequence files. However, the objects
346 produced are of the Bio::Search rather than Bio::Seq variety.
349 Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
351 A: It is as simple as parsing text BLAST results - you simply need to
352 specify the format as "fasta" or "blastxml" and the parser will
353 load the appropriate module for you. You can use the exact logic
354 and code for all of these formats as we have generalized the
355 modules for sequence database searching.
358 Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can
361 A: Look at Bio::Factory::EMBOSS to see how to use the 'water' and
362 'needle' alignment programs that are part of the EMBOSS suite.
364 Additionally you can use the pSW module that is part of the
365 bioperl-ext package (distributed separated at
366 ftp://bioperl.org/pub/DIST). However note this only does protein
367 alignments and is no longer a supported module. Instead the
368 EMBOSS implementation is the the best path ahead unless someone
369 else wants to provide an Inline::C implementation.
372 Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
373 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am
374 I seeing these different numbers and how do I get the frame
377 A: These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with
378 a frame of 2 with the strand being set to -1 (for more on GFF see
379 http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml).
381 Frames are relative to the hit or query sequence so you need to
382 query it based on sequence you are interested in:
389 $hsp->query->strand();
390 $hsp->query->frame();
392 So the value according to a blast report of -3 can be constructed
395 my $blastvalue = ($hsp->query->frame + 1) * $hsp->query->strand;
398 ---------------------------------------------------------------------------
402 ---------------------------------------------------------------------------
406 Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
407 sites in a protein? Calculate nucleotide melting temperature? Find
410 A: In fact, none of these functions are built into Bioperl but they
411 are all available in the EMBOSS package (http://www.emboss.org/),
412 as well as many others. The Bioperl developers created a simple
413 interface to EMBOSS such that any and all EMBOSS programs can be
414 run from within Bioperl. See Bio::Factory::EMBOSS for more
417 If you can't find the functionality you want in Bioperl then make
418 sure to look for it in EMBOSS, these packages integrate quite
419 gracefully with Bioperl. Of course, you will have to install
420 EMBOSS to get this access.
422 In addition, Bioperl after version 1.0.1 contains the Pise/Bioperl
423 modules. The Pise package
424 (http://www-alt.pasteur.fr/~letondal/Pise) was designed to provide
425 a uniform interface to bioinformatics applications, and currently
426 provides wrappers to greater than 250 such applications! Included
427 amongst these wrapped apps are HMMER, Phylip, BLAST, GENSCAN, even
428 the EMBOSS suite. Use of the Pise/Bioperl modules does not require
429 installation of the Pise package.
432 Q4.2: How do I do motif searches with Bioperl? Can I do "find all
433 sequences that are 75% identical" to a given motif?
435 A: There are a number of approaches. Within Bioperl take a look at
436 Bio::Tools::SeqPattern. Or, take a look at the TFBS package, at
437 http://forkhead.cgb.ki.se/TFBS(Transcription Factor Binding Site).
438 This Bioperl-compliant package specializes in pattern searching of
439 nucleotide sequence using matrices.
441 It's also conceivable that the combination of Bioperl and Perl's
442 regular expressions could do the trick. You might also consider
443 the CPAN module String::Approx (this module addresses the percent
444 match query), but experienced users question whether its distance
445 estimates are correct, the Unix agrep command is thought to be
446 faster and more accurate. Finally, you could use EMBOSS, as
447 discussed in the previous question (or you could use Pise to run
448 EMBOSS applications). The relevant programs would be fuzzpro or
452 Q4.3: Can I query MEDLINE or other bibliographic repositories using
455 A: Yes! The solution lies in Bio::Biblio*, a set of modules that
456 provide access to MEDLINE and OpenBQS-compliant servers using
457 SOAP. See Bio/Biblio.pm or examples/biblio.pl for details and
461 ---------------------------------------------------------------------------
463 5. Annotations and Features
465 ---------------------------------------------------------------------------
469 Q5.1: I get the warning "(old style Annotation) on new style
470 Annotation::Collection". What is wrong?
472 A: This is because we have transitioned from the
473 add_Comment/each_Comment, add_Reference/each_Reference style to
474 add_Annotation('comment', $ann)/get_Annotations('comment). Please
475 update your code in order to avoid seeing these warning messages.
477 This is because we have changed (starting with 1.0) the annotation
478 collection implementation from the Bio::Annotation object to the
479 Bio::Annotation::Collection object which is a more general and
480 extensible system. In the future the Reference objects will
481 likely be implemented by the Bio::Biblio system but we hope to
482 maintain a compatible API for these.
485 Q5.2: How do I retrieve all the features from a Sequence? How about all
486 the features which are exons or have a /note field that contains a
489 A: To get all the features:
491 my @features = $seq->all_SeqFeatures();
493 To get all the features filtering on only those which have the
496 my @genes = grep { $_->primary_tag eq 'exon'}
497 $seq->all_SeqFeatures();
499 To get all the features filtering on this which have the tag
500 'note' and within the note field contain the requested string
503 my @f_with_note = grep {
504 my @a = $_->has_tag('note') ?
505 $_->each_tag_value('note') : ();
506 grep { /$noteval/ } @a;
507 } $seq->all_SeqFeatures();
510 Q5.3: How do I parse the CDS join() statements in Genbank or EMBL files
511 so I can reconstruct the CDS sequence?
513 A: You need to use a Location::SplitLocationI object to get the
514 coordinates and primary_tag() to find the CDS features:
516 if ( $feature->location->isa('Bio::Location::SplitLocationI')
517 && $feature->primary_tag eq 'CDS' ) {
518 foreach $location ( $feature->location->sub_Location ) {
519 print $location->start . ".." . $location->end . "\n";
524 ---------------------------------------------------------------------------
526 6. Running external programs
528 ---------------------------------------------------------------------------
532 Q6.1: How do I run Blast from within Bioperl?
534 A: Use the module Bio::Tools::Run::StandAloneBlast. It will give
535 you access to many of the search tools in the NCBI blast suite
536 including blastll, bl2seq, blastpgp. The basic structure is like
539 use Bio::Tools::Run::StandAloneBlast;
540 my $factory = Bio::Tools::Run::StandAloneBlast->new(p => 'blastn',
543 my $seq = new Bio::PrimarySeq(-id => 'test1',
544 -seq => 'AGATCAGTAGATGATAGGGGTAGA');
545 my $report = $factory->blastall($seq);
548 Q6.2: Hey, I want to run clustalw within Bioperl, I used
549 Bio::Tools::Run::Alignment::Clustalw before - where did it go?
551 A: The Bio::Tools::Run directory was moved to a new package to help
552 make the size of the core code smaller and separate out the more
553 specialized nature of application running from the rest of
554 Bioperl. You can get these modules by installing the bioperl-run
555 package. This is either available from CVS under the same name or
556 available in the http://bioperl.org/DIST directory and on CPAN.
557 This changeover began in the bioperl 1.1 developer release.
560 Q6.3: What does the future hold for running applications within Bioperl?
561 A: We are trying to build a standard starting point for analysis
562 application which will probably look like
563 Bio::Tools::Run::AnalysisFactory which will allow the user to
564 request which type of remote or local server they want to use to
565 run their analyses. This will connect to the Pasteur's PISE
566 server, the EBI's Novella server, as well as be aware of wrappers
567 to run applications locally.
569 Additionally we suggest investigating the BioPipe
570 project(http://www.biopipe.org) which is making it easier to chain
571 together various sets of analyses and build rules for peforming
574 ---------------------------------------------------------------------------
575 Copyright (c)2002-2003 Open Bioinformatics Foundation. You may distribute
576 this FAQ under the same terms as perl itself.