Bio/DB/GenBank.pm

   1 #
   2 # BioPerl module for Bio::DB::GenBank
   3 #
   4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
   5 #
   6 # Cared for by Aaron Mackey <amackey@virginia.edu>
   7 #
   8 # Copyright Aaron Mackey
   9 #
  10 # You may distribute this module under the same terms as perl itself
  11 #
  12 # POD documentation - main docs before the code
  13 #
  14 # Added LWP support - Jason Stajich 2000-11-6
  15 # completely reworked by Jason Stajich 2000-12-8
  16 # to use WebDBSeqI
  17
  18 # Added batch entrez back when determined that new entrez cgi will
  19 # essentially work (there is a limit to the number of characters in a
  20 # GET request so I am not sure how we can get around this).  The NCBI
  21 # Batch Entrez form has changed some and it does not support retrieval
  22 # of text only data.  Still should investigate POST-ing (tried and
  23 # failed) a message to the entrez cgi to get around the GET
  24 # limitations.
  25
  26 =head1 NAME
  27
  28 Bio::DB::GenBank - Database object interface to GenBank
  29
  30 =head1 SYNOPSIS
  31
  32     use Bio::DB::GenBank;
  33     $gb = Bio::DB::GenBank->new();
  34
  35     $seq = $gb->get_Seq_by_id('J00522'); # Unique ID, *not always the LOCUS ID*
  36
  37     # or ...
  38
  39     $seq = $gb->get_Seq_by_acc('J00522'); # Accession Number
  40     $seq = $gb->get_Seq_by_version('J00522.1'); # Accession.version
  41     $seq = $gb->get_Seq_by_gi('405830'); # GI Number
  42
  43     # get a stream via a query string
  44     my $query = Bio::DB::Query::GenBank->new
  45         (-query   =>'Oryza sativa[Organism] AND EST',
  46          -reldate => '30',
  47          -db      => 'nucleotide');
  48     my $seqio = $gb->get_Stream_by_query($query);
  49
  50     while( my $seq =  $seqio->next_seq ) {
  51       print "seq length is ", $seq->length,"\n";
  52     }
  53
  54     # or ... best when downloading very large files, prevents
  55     # keeping all of the file in memory
  56
  57     # also don't want features, just sequence so let's save bandwith
  58     # and request Fasta sequence
  59     $gb = Bio::DB::GenBank->new(-retrievaltype => 'tempfile' ,
  60                                               -format => 'Fasta');
  61     my $seqio = $gb->get_Stream_by_acc(['AC013798', 'AC021953'] );
  62     while( my $clone =  $seqio->next_seq ) {
  63       print "cloneid is ", $clone->display_id, " ",
  64              $clone->accession_number, "\n";
  65     }
  66     # note that get_Stream_by_version is not implemented
  67
  68     # don't want the entire sequence or more options
  69     my $gb = Bio::DB::GenBank->new(-format     => 'Fasta',
  70                                    -seq_start  => 100,
  71                                    -seq_stop   => 200,
  72                                    -strand     => 1,
  73                                    -complexity => 4);
  74     my $seqi = $gb->get_Stream_by_query($query);
  75
  76
  77 =head1 DESCRIPTION
  78
  79 Allows the dynamic retrieval of L<Bio::Seq> sequence objects from the
  80 GenBank database at NCBI, via an Entrez query.
  81
  82 WARNING: Please do B<NOT> spam the Entrez web server with multiple
  83 requests.  NCBI offers Batch Entrez for this purpose.
  84
  85 Note that when querying for GenBank accessions starting with 'NT_' you
  86 will need to call $gb-E<gt>request_format('fasta') beforehand, because
  87 in GenBank format (the default) the sequence part will be left out
  88 (the reason is that NT contigs are rather annotation with references
  89 to clones).
  90
  91 Some work has been done to automatically detect and retrieve whole NT_
  92 clones when the data is in that format (NCBI RefSeq clones). The
  93 former behavior prior to bioperl 1.6 was to retrieve these from EBI,
  94 but now these are retrieved directly from NCBI. The older behavior can
  95 be regained by setting the 'redirect_refseq' flag to a value
  96 evaluating to TRUE.
  97
  98 =head2 Running
  99
 100 Alternate methods are described at
 101 L<http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html>
 102
 103 NOTE: strand should be 1 for plus or 2 for minus.
 104
 105 Complexity: gi is often a part of a biological blob, containing other
 106 gis
 107
 108 complexity regulates the display:
 109   0 - get the whole blob
 110   1 - get the bioseq for gi of interest (default in Entrez)
 111   2 - get the minimal bioseq-set containing the gi of interest
 112   3 - get the minimal nuc-prot containing the gi of interest
 113   4 - get the minimal pub-set containing the gi of interest
 114
 115 'seq_start' and 'seq_stop' will not work when setting complexity to
 116 any value other than 1.  'strand' works for any setting other than a
 117 complexity of 0 (whole glob); when you try this with a GenBank return
 118 format nothing happens, whereas using FASTA works but causes display
 119 problems with the other sequences in the glob.  As Tao Tao says from
 120 NCBI, "Better left it out or set it to 1."
 121
 122 =head1 FEEDBACK
 123
 124 =head2 Mailing Lists
 125
 126 User feedback is an integral part of the evolution of this and other
 127 Bioperl modules. Send your comments and suggestions preferably to one
 128 of the Bioperl mailing lists. Your participation is much appreciated.
 129
 130   bioperl-l@bioperl.org                  - General discussion
 131   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 132
 133 =head2 Support
 134
 135 Please direct usage questions or support issues to the mailing list:
 136
 137 I<bioperl-l@bioperl.org>
 138
 139 rather than to the module maintainer directly. Many experienced and
 140 reponsive experts will be able look at the problem and quickly
 141 address it. Please include a thorough description of the problem
 142 with code and data examples if at all possible.
 143
 144 =head2 Reporting Bugs
 145
 146 Report bugs to the Bioperl bug tracking system to help us keep track
 147 the bugs and their resolution.  Bug reports can be submitted via the
 148 web:
 149
 150   https://github.com/bioperl/bioperl-live/issues
 151
 152 =head1 AUTHOR - Aaron Mackey, Jason Stajich
 153
 154 Email amackey@virginia.edu
 155 Email jason@bioperl.org
 156
 157 =head1 APPENDIX
 158
 159 The rest of the documentation details each of the
 160 object methods. Internal methods are usually
 161 preceded with a _
 162
 163 =cut
 164
 165 # Let the code begin...
 166
 167 package Bio::DB::GenBank;
 168 use strict;
 169 use vars qw(%PARAMSTRING $DEFAULTFORMAT $DEFAULTMODE);
 170
 171 use base qw(Bio::DB::NCBIHelper);
 172 BEGIN {
 173     $DEFAULTMODE   = 'single';
 174     $DEFAULTFORMAT = 'gbwithparts';
 175     %PARAMSTRING = (
 176                     'batch' => { 'db'     => 'nucleotide',
 177                                   'usehistory' => 'n',
 178                                   'tool'   => 'bioperl'},
 179                      'query' => { 'usehistory' => 'y',
 180                                   'tool'   => 'bioperl',
 181                                   'retmode' => 'text'},
 182                      'gi' => { 'db'     => 'nucleotide',
 183                                'usehistory' => 'n',
 184                                'tool'   => 'bioperl',
 185                                'retmode' => 'text'},
 186                      'version' => { 'db'     => 'nucleotide',
 187                                     'usehistory' => 'n',
 188                                     'tool'   => 'bioperl',
 189                                     'retmode' => 'text'},
 190                      'single' => { 'db'     => 'nucleotide',
 191                                    'usehistory' => 'n',
 192                                    'tool'   => 'bioperl',
 193                                    'retmode' => 'text'},
 194                       'webenv' => {
 195                                   'query_key'  => 'querykey',
 196                                   'WebEnv'  => 'cookie',
 197                                   'db'     => 'nucleotide',
 198                                   'usehistory' => 'n',
 199                                   'tool'   => 'bioperl',
 200                                   'retmode' => 'text'},
 201                      );
 202 }
 203
 204 # new is in NCBIHelper
 205
 206 # helper method to get db specific options
 207
 208 =head2 new
 209
 210  Title   : new
 211  Usage   : $gb = Bio::DB::GenBank->new(@options)
 212  Function: Creates a new genbank handle
 213  Returns : a new Bio::DB::Genbank object
 214  Args    : -delay   number of seconds to delay between fetches (3s)
 215
 216 NOTE:  There are other options that are used internally.  By NCBI policy, this
 217 module introduces a 3s delay between fetches.  If you are fetching multiple genbank
 218 ids, it is a good idea to use get
 219
 220 =cut
 221
 222 =head2 get_params
 223
 224  Title   : get_params
 225  Usage   : my %params = $self->get_params($mode)
 226  Function: Returns key,value pairs to be passed to NCBI database
 227            for either 'batch' or 'single' sequence retrieval method
 228  Returns : a key,value pair hash
 229  Args    : 'single' or 'batch' mode for retrieval
 230
 231 =cut
 232
 233 sub get_params {
 234     my ($self, $mode) = @_;
 235     return defined $PARAMSTRING{$mode} ?
 236         %{$PARAMSTRING{$mode}} : %{$PARAMSTRING{$DEFAULTMODE}};
 237 }
 238
 239 # from Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI
 240
 241 =head1 Routines Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI
 242
 243 =head2 get_Seq_by_id
 244
 245  Title   : get_Seq_by_id
 246  Usage   : $seq = $db->get_Seq_by_id('ROA1_HUMAN')
 247  Function: Gets a Bio::Seq object by its name
 248  Returns : a Bio::Seq object
 249  Args    : the id (as a string) of a sequence
 250  Throws  : "id does not exist" exception
 251
 252 =head2 get_Seq_by_acc
 253
 254   Title   : get_Seq_by_acc
 255   Usage   : $seq = $db->get_Seq_by_acc($acc);
 256   Function: Gets a Seq object by accession numbers
 257   Returns : a Bio::Seq object
 258   Args    : the accession number as a string
 259   Note    : For GenBank, this just calls the same code for get_Seq_by_id().
 260             Caveat: this normally works, but in rare cases simply passing the
 261             accession can lead to odd results, possibly due to unsynchronized
 262             NCBI ID servers. Using get_Seq_by_version() is slightly better, but
 263             using the unique identifier (GI) and get_Seq_by_id is the most
 264             consistent
 265   Throws  : "id does not exist" exception
 266
 267 =head2 get_Seq_by_gi
 268
 269  Title   : get_Seq_by_gi
 270  Usage   : $seq = $db->get_Seq_by_gi('405830');
 271  Function: Gets a Bio::Seq object by gi number
 272  Returns : A Bio::Seq object
 273  Args    : gi number (as a string)
 274  Throws  : "gi does not exist" exception
 275
 276 =head2 get_Seq_by_version
 277
 278  Title   : get_Seq_by_version
 279  Usage   : $seq = $db->get_Seq_by_version('X77802.1');
 280  Function: Gets a Bio::Seq object by sequence version
 281  Returns : A Bio::Seq object
 282  Args    : accession.version (as a string)
 283  Note    : Caveat: this normally works, but using the unique identifier (GI) and
 284            get_Seq_by_id is the most consistent
 285  Throws  : "acc.version does not exist" exception
 286
 287 =head1 Routines implemented by Bio::DB::NCBIHelper
 288
 289 =head2 get_Stream_by_query
 290
 291   Title   : get_Stream_by_query
 292   Usage   : $seq = $db->get_Stream_by_query($query);
 293   Function: Retrieves Seq objects from Entrez 'en masse', rather than one
 294             at a time.  For large numbers of sequences, this is far superior
 295             than get_Stream_by_[id/acc]().
 296   Example :
 297   Returns : a Bio::SeqIO stream object
 298   Args    : $query :   An Entrez query string or a
 299             Bio::DB::Query::GenBank object.  It is suggested that you
 300             create a Bio::DB::Query::GenBank object and get the entry
 301             count before you fetch a potentially large stream.
 302
 303 =cut
 304
 305 =head2 get_Stream_by_id
 306
 307   Title   : get_Stream_by_id
 308   Usage   : $stream = $db->get_Stream_by_id( [$uid1, $uid2] );
 309   Function: Gets a series of Seq objects by unique identifiers
 310   Returns : a Bio::SeqIO stream object
 311   Args    : $ref : a reference to an array of unique identifiers for
 312                    the desired sequence entries
 313
 314 =head2 get_Stream_by_acc
 315
 316   Title   : get_Stream_by_acc
 317   Usage   : $seq = $db->get_Stream_by_acc([$acc1, $acc2]);
 318   Function: Gets a series of Seq objects by accession numbers
 319   Returns : a Bio::SeqIO stream object
 320   Args    : $ref : a reference to an array of accession numbers for
 321                    the desired sequence entries
 322   Note    : For GenBank, this just calls the same code for get_Stream_by_id()
 323
 324 =cut
 325
 326 =head2 get_Stream_by_gi
 327
 328   Title   : get_Stream_by_gi
 329   Usage   : $seq = $db->get_Seq_by_gi([$gi1, $gi2]);
 330   Function: Gets a series of Seq objects by gi numbers
 331   Returns : a Bio::SeqIO stream object
 332   Args    : $ref : a reference to an array of gi numbers for
 333                    the desired sequence entries
 334   Note    : For GenBank, this just calls the same code for get_Stream_by_id()
 335
 336 =head2 get_Stream_by_batch
 337
 338   Title   : get_Stream_by_batch
 339   Usage   : $seq = $db->get_Stream_by_batch($ref);
 340   Function: Retrieves Seq objects from Entrez 'en masse', rather than one
 341             at a time.
 342   Example :
 343   Returns : a Bio::SeqIO stream object
 344   Args    : $ref : either an array reference, a filename, or a filehandle
 345             from which to get the list of unique ids/accession numbers.
 346
 347 NOTE: This method is redundant and deprecated.  Use get_Stream_by_id()
 348 instead.
 349
 350 =head2 get_request
 351
 352  Title   : get_request
 353  Usage   : my $url = $self->get_request
 354  Function: HTTP::Request
 355  Returns :
 356  Args    : %qualifiers = a hash of qualifiers (ids, format, etc)
 357
 358 =cut
 359
 360 =head2 default_format
 361
 362  Title   : default_format
 363  Usage   : my $format = $self->default_format
 364  Function: Returns default sequence format for this module
 365  Returns : string
 366  Args    : none
 367
 368 =cut
 369
 370 sub default_format {
 371     return $DEFAULTFORMAT;
 372 }
 373
 374 1;
 375 __END__