README

   1 This is the README file for the BioPerl central distribution.
   2
   3 o Getting Started
   4
   5  Please see the the INSTALL or INSTALL.WIN documents for installation
   6  instructions.
   7
   8 o About BioPerl
   9
  10  BioPerl is a package of public domain Perl tools for computational
  11  molecular biology.
  12
  13  Our website, http://bioperl.org/, provides an online resource of
  14  modules, scripts, and web links for developers of Perl-based software
  15  for life science research.
  16
  17 o Contact info
  18
  19  BioPerl mailing list: bioperl-l@bioperl.org
  20
  21  There's quite a variety of tools available in BioPerl, and more are
  22  added all the time. If the tool you're looking for isn't described in
  23  the documentation please write us, it could be undocumented or in
  24  process.
  25
  26  Project website : http://bioperl.org/
  27
  28  Bug reports : https://redmine.open-bio.org/projects/bioperl/
  29
  30      Please send us bugs, in particular about documentation which you
  31      think is unclear or problems in installation. We are also very
  32      interested in functions which don't work the way you think they
  33      do!
  34
  35 o The directory structure
  36
  37  The BioPerl directory structure is organized as follows:
  38
  39    - Bio/ - BioPerl modules
  40
  41    - doc/ - Documentation utilities
  42
  43    - examples/ - Scripts demonstrating the many uses of BioPerl
  44
  45    - ide/ - files for developing BioPerl using an IDE
  46
  47    - maintenance/ - BioPerl housekeeping scripts
  48
  49    - models/ - DIA drawing program generated OO UML for BioPerl classes
  50                (these are quite out-of-date)
  51
  52    - scripts/ - Useful production-quality scripts with POD documentation
  53
  54    - t/ - Perl built-in tests, tests are divided into subdirectories
  55           based on the specific classes being tested
  56
  57    - t/data/ - Data files used for the tests, provides good example data
  58
  59 o Documentation
  60
  61  For documentation on BioPerl see the HOWTO documents and tutorials
  62  online at http://bioperl.org.
  63
  64  Useful documentation in the form of example code can also be found in
  65  the examples/ and scripts/ directories. The current collection
  66  includes scripts that run BLAST, index flat files, parse PDB
  67  structure files, make primers, retrieve ESTs based on tissue, align
  68  protein to nucleotide sequence, run GENSCAN on multiple sequences,
  69  and much more! See bioscripts.pod for a complete listing.
  70
  71  Individual *.pm modules have their own embedded POD documentation as
  72  well. A complete set of hyperlinked POD, or module, documentation is
  73  available at http://www.bioperl.org/.
  74
  75  Remember that 'perldoc' is your friend. You can use it to read any
  76  file containing POD formatted documentation without needing any type
  77  of translator (e.g. 'perldoc Bio::SeqIO').
  78
  79  If you used the Build.PL installation, and depending on your
  80  platform, you may have documentation installed as man pages, which
  81  can be accessed in the usual way.
  82
  83 o Releases
  84
  85  BioPerl releases are always available from the website at
  86  http://www.bioperl.org/DIST or in CPAN. The latest code can be found
  87  at https://github.com/bioperl.
  88
  89  BioPerl formerly used a numbering scheme to indicate stable release
  90  series vs. development release series. A release number is a three
  91  digit number like 1.2.0. The first digit indicates the major release
  92  - the idea being that all the API calls in a major release are
  93  reasonably consistent. The second number is the release series. This
  94  is probably the most important number.
  95
  96  From the 1.0 release until the 1.6 release, even numbers (1.0, 1.2
  97  etc) indicated stable releases. Stable releases were well tested and
  98  recommended for most uses. Odd numbers (1.1, 1.3 etc) were development
  99  releases which one would only use if one were interested in the
 100  latest and greatest features. The final number (e.g. 1.2.0, 1.2.1) is
 101  the bug fix release. The higher the number the more bug fixes has
 102  been incorporated. In theory you can upgrade from one bug fix release
 103  to the next with no changes to your own code (for production cases,
 104  obviously check things out carefully before you switch over).
 105
 106  The 1.6 release will be the last release series to utilize the
 107  alternating 'stable'/'developer' convention. Starting immediately
 108  after the 1.6 branch, we will start splitting BioPerl into several
 109  smaller easier-to-manage distributions, including a developer
 110  distribution for cutting-edge (in development) code, untested
 111  modules, and alternative implementations.
 112
 113 o Caveats and warnings
 114
 115  When you run the tests ("./Build test") some tests may issue warnings
 116  messages or even fail. Sometimes this is because we didn't have
 117  anyone to test the test system on the combination of your operating
 118  system, version of perl, and associated libraries and other modules.
 119  Because BioPerl depends on several outside libraries we may not be
 120  able to test every single combination so if there are warnings you
 121  may find that the package is still perfectly useful.
 122
 123  If you install the bioperl-run system and run tests when you don't
 124  have the program installed you'll get messages like 'program XXX not
 125  found, skipping tests'. That's okay, BioPerl is doing what it is
 126  supposed to do. If you wanted to run the program you'd need to
 127  install it first.
 128
 129  Not all scripts in the examples/ directory are correct and up-to-date.
 130  We need volunteers to help maintain these so if you find they do not
 131  submit a bug report to https://redmine.open-bio.org/projects/bioperl/
 132  and consider helping out in their maintenance.
 133
 134  If you are confused about what modules are appropriate when you try
 135  and solve a particular issue in bioinformatics we urge you to look at
 136  HOWTO documents first.
 137
 138 o A simple module summary
 139
 140  Here is a quick summary of many of the useful modules and how the
 141  toolkit is laid out:
 142
 143  All modules are in the Bio/ namespace,
 144
 145  - Perl is for newbies and gives a functional interface to the main
 146    parts of the package
 147
 148  - Seq is for Sequences (protein and DNA).
 149    o Bio::PrimarySeq is a plain sequence (sequence data + identifiers)
 150    o Bio::Seq is a PrimarySeq plus it has a Bio::Annotation::Collection
 151      and Bio::SeqFeatureI objects attached
 152      (via Bio::FeatureHolderI).
 153    o Bio::Seq::RichSeq is all of the above plus it has slots for
 154      extra information specific to GenBank/EMBL/SwissProt files.
 155    o Bio::Seq::LargeSeq is for sequences which are too big for
 156      fitting into memory.
 157
 158  - SeqIO is for reading and writing Sequences, it is a front end
 159    module for separate driver modules supporting the different
 160    sequence formats
 161
 162  - SeqFeature - start/stop/strand annotations of sequences
 163    o Bio::SeqFeature::Generic is basic catchall
 164    o Bio::SeqFeature::Similarity a similarity sequence feature
 165    o Bio::SeqFeature::FeaturePair a sequence feature which is pairwise
 166      such as query/hit pairs
 167
 168  - SearchIO is for reading and writing pairwise alignment reports like
 169    BLAST or FASTA
 170
 171  - Search is where the alignment objects are defined
 172    o Bio::Search::Result::GenericResult is the result object (a blast
 173      query is a Result object)
 174    o Bio::Search::Hit::GenericHit is the Hit object (a query will have
 175      0 -> many hits in a database)
 176    o Bio::Search::HSP::GenericHSP is the High-scoring Segment Pair
 177      object defining the alignment(s) of the query and hit.
 178
 179  - SimpleAlign is for multiple sequence alignments
 180
 181  - AlignIO is for reading and writing multiple sequence alignment
 182    formats
 183
 184  - Assembly provides the start of an infrastructure for assemblies and
 185    Assembly::IO IO converters for them
 186
 187  - DB is the namespace for all the database query objects
 188    o Bio::DB::GenBank/GenPept are two modules which query NCBI entrez
 189      for sequences
 190    o Bio::DB::SwissProt/EMBL query various EMBL and SwissProt
 191      repositories for a sequences
 192    o Bio::DB::GFF is Lincoln Stein's fast, lightweight feature and
 193      sequence database which is the backend to his GBrowse system (see
 194      www.gmod.org)
 195    o Bio::DB::Flat is a fast implementation of the OBDA flat-file
 196      indexing system (cross-language and cross-platform supported by
 197      O|B|F projects see http://obda.open-bio.org).
 198    o Bio::DB::BioFetch/DBFetch for OBDA, Web (HTTP) access to remote
 199      databases.
 200    o Bio::DB::InMemoryCache/FileCache (fast local caching of sequences
 201      from remote dbs to speed up your access).
 202    o Bio::DB::Registry interface to the OBDA specification for remote
 203      data sources
 204    o Bio::DB::Biblio for access to remote bibliographic databases.
 205    o Bio::DB::EUtilities is the initial set of modules used for
 206      generic queried using NCBI's eUtils.
 207
 208  - Annotation collection of annotation objects (comments, DBlinks,
 209    References, and misc key/value pairs)
 210
 211  - Coordinate is a system for mapping between different coordinate
 212    systems such as DNA to protein or between assemblies
 213
 214  - Index is for locally indexed flatfiles with BerkeleyDB
 215
 216  - Tools contains many miscellaneous parsers and function for
 217    different bioinformatics needs
 218    o Gene prediction parser (Genscan, MZEF, Grail, Genemark)
 219    o Annotation format (GFF)
 220    o Enumerate codon tables and valid sequences symbols (CodonTable,
 221      IUPAC)
 222    o Phylogenetic program parsing (PAML, Molphy, Phylip)
 223
 224  - Map genetic and physical map representations
 225
 226  - Structure - parse and represent protein structure data
 227
 228  - TreeIO is for reading and writing Tree formats
 229
 230  - Tree is the namespace for all the associated Tree objects
 231    o Bio::Tree::Tree is the basic tree object
 232    o Bio::Tree::Node are the nodes which make up the tree
 233    o Bio::Tree::Statistics is for computing statistics for a tree
 234    o Bio::Tree::TreeFunctionsI is where specific tree functions are
 235      implemented (like is_monophyletic and lca)
 236
 237  - Bio::Biblio is where bibliographic data and database access objects
 238    are kept
 239
 240  - Variation represent sequences with mutations and variations applied
 241    so one can compare and represent wild-type and mutation versions of
 242    a sequence.
 243
 244  - Root, basic objects for the internals of BioPerl
 245
 246 o Upgrading from an older version
 247
 248  If you have a previously installed version of BioPerl on your system
 249  some of these notes may help you.
 250
 251  Some modules have been removed because they have been superceded by
 252  new development efforts. They are documented in the DEPRECATED file
 253  that is included in the release. In addition some methods, or the
 254  Application Programming Interface (API), have changed or been
 255  removed. You may find that scripts which worked with BioPerl 1.4 may
 256  give you warnings or may not work at all (although we have tried very
 257  hard to minimize this!). Send an email to the list and we'll be happy
 258  to give you pointers.