HACKING.md

   1 # The Directory Structure
   2
   3 The BioPerl directory structure is organized as follows:
   4
   5 * **`Bio/`** - BioPerl modules
   6
   7 * **`examples/`** - Scripts demonstrating the many uses of BioPerl
   8
   9 * **`ide/`** - Files for developing BioPerl using an IDE
  10
  11 * **`maintenance/`** - BioPerl housekeeping scripts
  12
  13 * **`models/`** - DIA drawing program generated OO UML for BioPerl classes
  14   (these are quite out-of-date)
  15
  16 * **`scripts/`** - Useful production-quality scripts with POD documentation
  17
  18 * **`t/`** - Perl built-in tests, tests are divided into subdirectories
  19   based on the specific classes being tested
  20
  21 * **`t/data/`** - Data files used for the tests, provides good example data
  22
  23 * **`travis_scripts/`** - script to customize Travis
  24
  25 # Documentation
  26
  27 For documentation on BioPerl see the **HOWTO** documents online at http://bioperl.org/howtos.
  28
  29 Useful documentation in the form of example code can also be found in the
  30 **`examples/`** and **`scripts/`** directories. The current collection includes
  31 scripts that run BLAST, index flat files, parse PDB structure files, make
  32 primers, retrieve ESTs based on tissue, align protein to nucleotide sequence,
  33 run GENSCAN on multiple sequences, and much more! See `bioscripts.pod` for a
  34 complete listing.
  35
  36 Individual `*.pm` modules have their own embedded POD documentation as well. A
  37 complete set of hyperlinked POD, or module, documentation is available at
  38 http://www.bioperl.org/.
  39
  40 Remember that '`perldoc`' is your friend. You can use it to read any file
  41 containing POD formatted documentation without needing any type of translator
  42 (e.g. '`perldoc Bio::SeqIO`').
  43
  44 If you used the Build.PL installation, and depending on your platform, you may
  45 have documentation installed as man pages, which can be accessed in the usual
  46 way.
  47
  48 # Releases
  49
  50 BioPerl releases are always available from the website at http://www.bioperl.org/DIST or in CPAN. The latest code can be found at https://github.com/bioperl.
  51
  52 * BioPerl currently uses a semantic numbering scheme to indicate stable release
  53   series vs. development release series. A release number is a three digit
  54   number like `1.2.0`.
  55   * The *first digit indicates the major release*, the idea being that all the
  56     API calls in a major release are reasonably consistent.
  57   * The *second number is the release series*. This is probably the most
  58     important number, and represents added functionality that is
  59     backwards-compatible.
  60   * The *third number is the point or patch release* and represents mainly bug
  61     fixes or additional code that doesn't add significant functionality to the
  62     code base.
  63
  64 From the **1.0 release until the 1.6 release** even numbers (e.g. `1.4`) indicated stable releases. Stable releases were well tested and recommended for most uses. Odd numbers (e.g. `1.3`) were development releases which one would only use if one were interested in the latest features. The final number (e.g. in `1.2.1`) is the point or patch release. The higher the number the more bug fixes has been incorporated. In theory you can upgrade from one point or patch release to the next with no changes to your own code (for production cases, obviously check things out carefully before you switch over).
  65
  66 The upcoming **1.7 release** will be the last release series to utilise the alternating 'stable'/'developer' convention. Starting immediately after the final 1.6 branch, we will start splitting BioPerl into several smaller easier-to-manage distributions. These will have independent versions, all likely starting with v1.7.0. **We do not anticipate major API changes in the 1.7.x release series, merely that the code will be restructured in a way to make maintenance more feasible.** We anticipate retaining semantic versioning until the 2.x release.
  67
  68 # Caveats and Warnings
  69
  70 When you run the tests with `./Build test` some tests may issue warnings messages or even fail. Sometimes this is because we didn't have anyone to test the test system on the combination of your operating system, version of perl, and associated libraries and other modules. Because BioPerl depends on several
  71 outside libraries we may not be able to test every single combination so if
  72 there are warnings you may find that the package is still perfectly useful.
  73
  74 If you install the bioperl-run system and run tests when you don't have the
  75 program installed you'll get messages like `program XXX not found, skipping
  76 tests`. That's okay, BioPerl is doing what it is supposed to do. If you wanted
  77 to run the program you'd need to install it first.
  78
  79 Not all scripts in the `examples/` directory are correct and up-to-date. If you find an issue with a script please submit a bug report to https://github.com/bioperl/bioperl-live/issues and consider helping out in their maintenance.
  80
  81 If you are confused about what modules are appropriate when you try and solve a
  82 particular issue in bioinformatics we urge you to look at HOWTO documents first.
  83
  84 # Module summary
  85
  86 Here is a quick summary of many of the useful modules and how the
  87 toolkit is laid out.  Some of these are on their own distribution.
  88
  89 All modules are in the `Bio::` namespace.
  90
  91 * **`Seq`** is for *Sequences* (protein and DNA).
  92     * `Bio::PrimarySeq` is a plain sequence (sequence data + identifiers)
  93     * `Bio::Seq` is a fancier `PrimarySeq`, in that it has annotation (via
  94     `Bio::Annotation::Collection`) and sequence features (via `Bio::SeqFeatureI` objects, attached via
  95     `Bio::FeatureHolderI`).
  96     * `Bio::Seq::RichSeq` is all of the above, plus it has slots for extra information specific to GenBank/EMBL/SwissProt files.
  97     * `Bio::Seq::LargeSeq` is for sequences which are too big for
  98     fitting into memory.
  99
 100 * **`SeqIO`** is for *reading and writing Sequences*. It is a front end module
 101   for separate driver modules supporting the different sequence formats
 102
 103 * **`SeqFeature`** represent *start/stop/strand-based localised annotations (features) of sequences*
 104     * **`Bio::SeqFeature::Generic`** is basic catchall
 105     * **`Bio::SeqFeature::Similarity`** a similarity sequence feature
 106     * **`Bio::SeqFeature::FeaturePair`** a sequence feature which is pairwise
 107     such as query/hit pairs
 108
 109 * **`SearchIO`** is for *reading and writing pairwise alignment reports*, like
 110   BLAST or FASTA
 111
 112 * **`Search`** is where the *alignment objects for `SearchIO` are defined*
 113     * **`Bio::Search::Result::GenericResult`** is the result object (a blast
 114     query is a `Result` object)
 115     * **`Bio::Search::Hit::GenericHit`** is the `Hit` object (a query will have
 116     0 to many hits in a database)
 117     * **`Bio::Search::HSP::GenericHSP`** is the High-scoring Segment Pair
 118     object defining the alignment(s) of the query and hit.
 119
 120 * **`SimpleAlign`** is for *multiple sequence alignments*
 121
 122 * **`AlignIO`** is for *reading and writing multiple sequence alignment
 123   formats*
 124
 125 * **`Assembly`** provides the start of an *infrastructure for assemblies* and
 126   **`Assembly::IO`** *IO converters* for them
 127
 128 * **`DB`** is the namespace for *all the database query classes*
 129     * **`Bio::DB::GenBank/GenPept`** are two modules which query NCBI entrez for
 130       sequences
 131     * **`Bio::DB::SwissProt/EMBL`** query various EMBL and SwissProt
 132       repositories for a sequences
 133     * **`Bio::DB::GFF`** is Lincoln Stein's fast, lightweight feature and
 134       sequence database which is the backend to his GBrowse system (see
 135       www.gmod.org)
 136     * **`Bio::DB::Flat`** is a fast implementation of the OBDA flat-file
 137       indexing system (cross-language and cross-platform supported by O|B|F
 138       projects see http://obda.open-bio.org).
 139     * **`Bio::DB::BioFetch/DBFetch`** for OBDA, Web (HTTP) access to remote
 140       databases.
 141     * **`Bio::DB::InMemoryCache/FileCache`** (fast local caching of sequences
 142       from remote dbs to speed up your access).
 143     * **`Bio::DB::Registry`** interface to the OBDA specification for remote
 144       data sources
 145     * **`Bio::DB::Biblio`** for access to remote bibliographic databases.
 146     * **`Bio::DB::EUtilities`** is the initial set of modules used for generic
 147       queried using NCBI's eUtils.
 148
 149 * **`Annotation`** collection of *annotation objects* (comments, DBlinks,
 150   References, and misc key/value pairs)
 151
 152 * **`Coordinate`** is a system for *mapping between different coordinate systems*
 153   such as DNA to protein or between assemblies
 154
 155 * **`Index`** is for *locally indexed flatfiles* with BerkeleyDB
 156
 157 * **`Tools`** contains many *miscellaneous parsers and functions* for different
 158   bioinformatics needs
 159     * Gene prediction parser (Genscan, MZEF, Grail, Genemark)
 160     * Annotation format (GFF)
 161     * Enumerate codon tables and valid sequences symbols (CodonTable,
 162     IUPAC)
 163     * Phylogenetic program parsing (PAML, Molphy, Phylip)
 164
 165 * **`Map`** represents *genetic and physical map representations*
 166
 167 * **`Structure`** - parse and represent *protein structure data*
 168
 169 * **`TreeIO`** is for reading and writing *Tree formats*
 170
 171 * **`Tree`** is the namespace for **all associated Tree classes**
 172     * **`Bio::Tree::Tree`** is the basic tree object
 173     * **`Bio::Tree::Node`** are the nodes which make up the tree
 174     * **`Bio::Tree::Statistics`** is for computing statistics for a tree
 175     * **`Bio::Tree::TreeFunctionsI`** is where specific tree functions are
 176       implemented (like `is_monophyletic` and `lca`)
 177
 178 * **`Bio::Biblio`** is where *bibliographic data and database access objects*
 179   are kept
 180
 181 * **`Variation`** represent *sequences with mutations and variations* applied so one can compare and represent wild-type and mutation versions of a sequence.
 182
 183 * **`Root`**, basic objects for the *internals of BioPerl*
 184
 185 ## The Test System
 186
 187 The BioPerl test system is located in the `t/` directory and is
 188 automatically run whenever you execute the `./Build test` command.
 189
 190 The tests have been organised into groups
 191 based upon the specific task or class the module being tested belongs
 192 to. If you want to investigate the behaviour of a specific test such as
 193 the Seq test you would type:
 194
 195 ```
 196 ./Build test --test_files t/Seq/Seq.t --verbose
 197 ```
 198
 199 The `--test_files` argument can be used multiple times to try a set of test
 200 scripts in one go. The `--verbose` argument outputs the detailed test results, instead of just the summary you see during `./Build test`.
 201
 202 The `--test-files` argument can also work as a glob. For instance, to run tests on all SearchIO modules, use the following:
 203
 204 ```
 205 ./Build test --test_files t/SearchIO* --verbose
 206 ```
 207
 208 You can also use the command-line tool `prove` to run tests as well, which
 209 is quite useful if you are developing code:
 210
 211 ```
 212 prove -lrv t/SearchIO*
 213 ```
 214
 215 If you are trying to learn how to use a module, often the test suite
 216 is a good place to look. All good extreme programmers try and write a
 217 test BEFORE they write the module to insure that their module behaves
 218 the way they expect. You'll notice some `ok` and `skip` commands in a
 219 test, this is part of the Perl test suite that signifies a passed test
 220 with an 'ok N', where N is the test number. Alternatively you can tell
 221 Perl to skip tests. This is useful when, for example, your test
 222 detects that the network is not present and thus should skip, not
 223 fail, any tests that require a network connection.
 224
 225 The core developers have indicated that future releases of BioPerl
 226 will require that new modules come with a test suite with some minimal
 227 tests.  Modules that lack adequate tests or could otherwise be
 228 considered 'unstable' will be moved into a separate developer
 229 distribution until adequate tests are added and the API stabilises.
 230
 231 [how to install Docker]: https://docs.docker.com/engine/installation/
 232 [bioperl/bioperl]: https://hub.docker.com/r/bioperl/bioperl/
 233 [bioperl/bioperl-deps]: https://hub.docker.com/r/bioperl/bioperl-deps/
 234
 235 # Using BioPerl via Docker
 236
 237 If you don't have Docker installed already, instructions for [how to install Docker] on Linux, MacOSX, and Windows are available online.
 238
 239 We officially support several builds (latest, stable, and releases)
 240 hosted in the [bioperl/bioperl] repo on Docker Hub. These images do not
 241 have a pre-defined entrypoint. If you have a BioPerl script in the
 242 current directory, you can run it as simple as this:
 243
 244 ```
 245 docker run -t --rm -v `pwd`:/work -w /work bioperl/bioperl perl my-script.pl
 246 ```
 247
 248 Or run an interactive shell:
 249
 250 ```
 251 docker run -ti --rm -v `pwd`:/work -w /work bioperl/bioperl bash
 252 ```
 253
 254 You can also build your own Docker image of BioPerl, using the same
 255 base image and pre-built dependencies that we use. Simply build off of
 256 the [bioperl/bioperl-deps] image.