Bio/SeqIO/genbank.pm

   1 # $Id$
   2 #
   3 # BioPerl module for Bio::SeqIO::genbank
   4 #
   5 # Please direct questions and support issues to <bioperl-l@bioperl.org>
   6 #
   7 # Cared for by Bioperl project bioperl-l(at)bioperl.org
   8 #
   9 # Copyright Elia Stupka and contributors see AUTHORS section
  10 #
  11 # You may distribute this module under the same terms as perl itself
  12
  13 # POD documentation - main docs before the code
  14
  15 =head1 NAME
  16
  17 Bio::SeqIO::genbank - GenBank sequence input/output stream
  18
  19 =head1 SYNOPSIS
  20
  21 It is probably best not to use this object directly, but
  22 rather go through the SeqIO handler:
  23
  24     $stream = Bio::SeqIO->new(-file => $filename,
  25                               -format => 'GenBank');
  26
  27     while ( my $seq = $stream->next_seq() ) {
  28             # do something with $seq
  29     }
  30
  31
  32 =head1 DESCRIPTION
  33
  34 This object can transform Bio::Seq objects to and from GenBank flat
  35 file databases.
  36
  37 There is some flexibility here about how to write GenBank output
  38 that is not fully documented.
  39
  40 =head2 Optional functions
  41
  42 =over 3
  43
  44 =item _show_dna()
  45
  46 (output only) shows the dna or not
  47
  48 =item _post_sort()
  49
  50 (output only) provides a sorting func which is applied to the FTHelpers
  51 before printing
  52
  53 =item _id_generation_func()
  54
  55 This is function which is called as
  56
  57    print "ID   ", $func($seq), "\n";
  58
  59 To generate the ID line. If it is not there, it generates a sensible ID
  60 line using a number of tools.
  61
  62 If you want to output annotations in Genbank format they need to be
  63 stored in a Bio::Annotation::Collection object which is accessible
  64 through the Bio::SeqI interface method L<annotation()|annotation>.
  65
  66 The following are the names of the keys which are pulled from a
  67 L<Bio::Annotation::Collection> object:
  68
  69  reference       - Should contain Bio::Annotation::Reference objects
  70  comment         - Should contain Bio::Annotation::Comment objects
  71  dblink          - Should contain a Bio::Annotation::DBLink object
  72  segment         - Should contain a Bio::Annotation::SimpleValue object
  73  origin          - Should contain a Bio::Annotation::SimpleValue object
  74  wgs             - Should contain a Bio::Annotation::SimpleValue object
  75
  76 =back
  77
  78 =head1 Where does the data go?
  79
  80 Data parsed in Bio::SeqIO::genbank is stored in a variety of data
  81 fields in the sequence object that is returned. Here is a partial list
  82 of fields.
  83
  84 Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
  85 the top level object which defines a function called NAME() which
  86 stores this information.
  87
  88 Items listed as Annotation 'NAME' tell you the data is stored the
  89 associated Bio::AnnotationCollectionI object which is associated with
  90 Bio::Seq objects.  If it is explictly requested that no annotations
  91 should be stored when parsing a record of course they will not be
  92 available when you try and get them.  If you are having this problem
  93 look at the type of SeqBuilder that is being used to contruct your
  94 sequence object.
  95
  96  Comments             Annotation 'comment'
  97  References           Annotation 'reference'
  98  Segment              Annotation 'segment'
  99  Origin               Annotation 'origin'
 100  Dbsource             Annotation 'dblink'
 101
 102  Accessions           PrimarySeq accession_number()
 103  Secondary accessions RichSeq get_secondary_accessions()
 104  GI number            PrimarySeq primary_id()
 105  LOCUS                PrimarySeq display_id()
 106  Keywords             RichSeq get_keywords()
 107  Dates                RichSeq get_dates()
 108  Molecule             RichSeq molecule()
 109  Seq Version          RichSeq seq_version()
 110  PID                  RichSeq pid()
 111  Division             RichSeq division()
 112  Features             Seq get_SeqFeatures()
 113  Alphabet             PrimarySeq alphabet()
 114  Definition           PrimarySeq description() or desc()
 115  Version              PrimarySeq version()
 116
 117  Sequence             PrimarySeq seq()
 118
 119 There is more information in the Feature-Annotation HOWTO about each
 120 field and how it is mapped to the Sequence object
 121 L<http://bioperl.open-bio.org/wiki/HOWTO:Feature-Annotation>.
 122
 123 =head1 FEEDBACK
 124
 125 =head2 Mailing Lists
 126
 127 User feedback is an integral part of the evolution of this and other
 128 Bioperl modules. Send your comments and suggestions preferably to one
 129 of the Bioperl mailing lists.  Your participation is much appreciated.
 130
 131   bioperl-l@bioperl.org                  - General discussion
 132   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 133
 134 =head2 Support
 135
 136 Please direct usage questions or support issues to the mailing list:
 137
 138 L<bioperl-l@bioperl.org>
 139
 140 rather than to the module maintainer directly. Many experienced and
 141 reponsive experts will be able look at the problem and quickly
 142 address it. Please include a thorough description of the problem
 143 with code and data examples if at all possible.
 144
 145 =head2 Reporting Bugs
 146
 147 Report bugs to the Bioperl bug tracking system to help us keep track
 148 the bugs and their resolution. Bug reports can be submitted via the web:
 149
 150   http://bugzilla.open-bio.org/
 151
 152 =head1 AUTHOR - Bioperl Project
 153
 154 bioperl-l at bioperl.org
 155
 156 Original author Elia Stupka, elia -at- tigem.it
 157
 158 =head1 CONTRIBUTORS
 159
 160 Ewan Birney birney at ebi.ac.uk
 161 Jason Stajich jason at bioperl.org
 162 Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
 163 Lincoln Stein lstein at cshl.org
 164 Heikki Lehvaslaiho, heikki at ebi.ac.uk
 165 Hilmar Lapp, hlapp at gmx.net
 166 Donald G. Jackson, donald.jackson at bms.com
 167 James Wasmuth, james.wasmuth at ed.ac.uk
 168 Brian Osborne, bosborne at alum.mit.edu
 169 Chris Fields, cjfields at bioperl dot org
 170
 171 =head1 APPENDIX
 172
 173 The rest of the documentation details each of the object
 174 methods. Internal methods are usually preceded with a _
 175
 176 =cut
 177
 178 # Let the code begin...
 179
 180 package Bio::SeqIO::genbank;
 181 use strict;
 182
 183 use Bio::SeqIO::FTHelper;
 184 use Bio::SeqFeature::Generic;
 185 use Bio::Species;
 186 use Bio::Seq::SeqFactory;
 187 use Bio::Annotation::Collection;
 188 use Bio::Annotation::Comment;
 189 use Bio::Annotation::Reference;
 190 use Bio::Annotation::DBLink;
 191
 192 use base qw(Bio::SeqIO);
 193
 194 our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
 195     anticodon           citation
 196     codon               codon_start
 197     cons_splice         direction
 198     evidence            label
 199     mod_base            number
 200     rpt_type            rpt_unit
 201     transl_except       transl_table
 202     usedin
 203     );
 204
 205 our %DBSOURCE = map {$_ => 1} qw(
 206     EchoBASE     IntAct    SWISS-2DPAGE    ECO2DBASE    ECOGENE    TIGRFAMs
 207     TIGR    GO    InterPro    Pfam    PROSITE    SGD    GermOnline
 208     HSSP    PhosSite    Ensembl    RGD    AGD    ArrayExpress    KEGG
 209     H-InvDB    HGNC    LinkHub    PANTHER    PRINTS    SMART    SMR
 210     MGI    MIM    RZPD-ProtExp    ProDom    MEROPS    TRANSFAC    Reactome
 211     UniGene    GlycoSuiteDB    PIRSF    HSC-2DPAGE    PHCI-2DPAGE
 212     PMMA-2DPAGE    Siena-2DPAGE    Rat-heart-2DPAGE    Aarhus/Ghent-2DPAGE
 213     Biocyc    MetaCyc    Biocyc:Metacyc    GenomeReviews    FlyBase
 214     TMHOBP    COMPLUYEAST-2DPAGE    OGP    DictyBase    HAMAP
 215     PhotoList    Gramene    WormBase    WormPep    Genew    ZFIN
 216     PeroxiBase    MaizeDB    TAIR    DrugBank    REBASE    HPA
 217     swissprot    GenBank    GenPept    REFSEQ    embl    PDB    UniProtKB
 218     DIP    PeptideAtlas    PRIDE    CYGD    HOGENOME    Gene3D);
 219
 220 our %VALID_ALPHABET = (
 221     'bp' => 'dna',
 222     'aa' => 'protein',
 223     'rc' => '' # rc = release candidate; file has no sequences
 224 );
 225
 226 sub _initialize {
 227     my($self,@args) = @_;
 228
 229     $self->SUPER::_initialize(@args);
 230     # hash for functions for decoding keys.
 231     $self->{'_func_ftunit_hash'} = {};
 232     $self->_show_dna(1); # sets this to one by default. People can change it
 233     if( ! defined $self->sequence_factory ) {
 234             $self->sequence_factory(Bio::Seq::SeqFactory->new
 235                             (-verbose => $self->verbose(),
 236                              -type => 'Bio::Seq::RichSeq'));
 237     }
 238 }
 239
 240 =head2 next_seq
 241
 242  Title   : next_seq
 243  Usage   : $seq = $stream->next_seq()
 244  Function: returns the next sequence in the stream
 245  Returns : Bio::Seq object
 246  Args    :
 247
 248 =cut
 249
 250 sub next_seq {
 251     my ($self,@args) = @_;
 252     my $builder = $self->sequence_builder();
 253     my $seq;
 254     my %params;
 255
 256   RECORDSTART:
 257     while (1) {
 258         my $buffer;
 259         my (@acc, @features);
 260         my ($display_id, $annotation);
 261         my $species;
 262
 263         # initialize; we may come here because of starting over
 264         @features = ();
 265         $annotation = undef;
 266         @acc = ();
 267         $species = undef;
 268         %params = (-verbose => $self->verbose); # reset hash
 269         local($/) = "\n";
 270         while(defined($buffer = $self->_readline())) {
 271             last if index($buffer,'LOCUS       ') == 0;
 272         }
 273         return unless defined $buffer; # end of file
 274         $buffer =~ /^LOCUS\s+(\S.*)$/o ||
 275             $self->throw("GenBank stream with bad LOCUS line. Not GenBank in my book. Got '$buffer'");
 276
 277         my @tokens = split(' ', $1);
 278
 279         # this is important to have the id for display in e.g. FTHelper,
 280         # otherwise you won't know which entry caused an error
 281         $display_id = shift(@tokens);
 282         $params{'-display_id'} = $display_id;
 283         # may still be useful if we don't want the seq
 284     my $seqlength = shift(@tokens);
 285     if (exists $VALID_ALPHABET{$seqlength}) {
 286         # moved one token too far.  No locus name?
 287         $self->warn("Bad LOCUS name?  Changing [$params{'-display_id'}] to 'unknown' and length to $display_id");
 288         $params{'-display_id'} = 'unknown';
 289         $params{'-length'} = $display_id;
 290         # add token back...
 291         unshift @tokens, $seqlength;
 292     } else {
 293         $params{'-length'} = $seqlength;
 294     }
 295         # the alphabet of the entry
 296     # shouldn't assign alphabet unless one is specifically designated (such as for rc files)
 297     my $alphabet = lc(shift @tokens);
 298         $params{'-alphabet'} = (exists $VALID_ALPHABET{$alphabet}) ? $VALID_ALPHABET{$alphabet} :
 299                            $self->warn("Unknown alphabet: $alphabet");
 300         # for aa there is usually no 'molecule' (mRNA etc)
 301         if (($params{'-alphabet'} eq 'dna') || (@tokens > 2)) {
 302             $params{'-molecule'} = shift(@tokens);
 303             my $circ = shift(@tokens);
 304             if ($circ eq 'circular') {
 305             $params{'-is_circular'} = 1;
 306             $params{'-division'} = shift(@tokens);
 307             } else {
 308                         # 'linear' or 'circular' may actually be omitted altogether
 309             $params{'-division'} =
 310                 (CORE::length($circ) == 3 ) ? $circ : shift(@tokens);
 311             }
 312         } else {
 313             $params{'-molecule'} = 'PRT' if($params{'-alphabet'} eq 'aa');
 314             $params{'-division'} = shift(@tokens);
 315         }
 316         my $date = join(' ', @tokens); # we lump together the rest
 317
 318         # this is per request bug #1513
 319         # we can handle
 320         # 9-10-2003
 321         # 9-10-03
 322         # 09-10-2003
 323         # 09-10-03
 324         if($date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/) {
 325             if( length($date) < 11 ) {
 326             # improperly formatted date
 327             # But we'll be nice and fix it for them
 328             my ($d,$m,$y) = ($2,$3,$4);
 329             if( length($d) == 1 ) {
 330                 $d = "0$d";
 331             }
 332             # guess the century here
 333             if( length($y) == 2 ) {
 334                 if( $y > 60 ) { # arbitrarily guess that '60' means 1960
 335                     $y = "19$y";
 336                 } else {
 337                     $y = "20$y";
 338                 }
 339                 $self->warn("Date was malformed, guessing the century for $date to be $y\n");
 340             }
 341             $params{'-dates'} = [join('-',$d,$m,$y)];
 342         } else {
 343             $params{'-dates'} = [$date];
 344             }
 345         }
 346         # set them all at once
 347         $builder->add_slot_value(%params);
 348         %params = ();
 349
 350         # parse the rest if desired, otherwise start over
 351         if(! $builder->want_object()) {
 352             $builder->make_object();
 353             next RECORDSTART;
 354         }
 355
 356         # set up annotation depending on what the builder wants
 357         if($builder->want_slot('annotation')) {
 358             $annotation = Bio::Annotation::Collection->new();
 359         }
 360         $buffer = $self->_readline();
 361         until( !defined ($buffer) ) {
 362             $_ = $buffer;
 363             # Description line(s)
 364             if (/^DEFINITION\s+(\S.*\S)/) {
 365                 my @desc = ($1);
 366                 while ( defined($_ = $self->_readline) ) {
 367                     if( /^\s+(.*)/ ) { push (@desc, $1); next };
 368                     last;
 369                 }
 370                 $builder->add_slot_value(-desc => join(' ', @desc));
 371                                 # we'll continue right here because DEFINITION always comes
 372                                 # at the top of the entry
 373                 $buffer= $_;
 374             }
 375             # accession number (there can be multiple accessions)
 376             if( /^ACCESSION\s+(\S.*\S)/ ) {
 377                 push(@acc, split(/\s+/,$1));
 378                 while( defined($_ = $self->_readline) ) {
 379                     /^\s+(.*)/ && do { push (@acc, split(/\s+/,$1)); next };
 380                     last;
 381                 }
 382                 $buffer = $_;
 383                 next;
 384             }
 385             # PID
 386             elsif( /^PID\s+(\S+)/ ) {
 387                 $params{'-pid'} = $1;
 388             }
 389             # Version number
 390             elsif( /^VERSION\s+(\S.+)$/ ) {
 391                 my ($acc,$gi) = split(' ',$1);
 392                 if($acc =~ /^\w+\.(\d+)/) {
 393                     $params{'-version'} = $1;
 394                     $params{'-seq_version'} = $1;
 395                 }
 396                 if($gi && (index($gi,"GI:") == 0)) {
 397                     $params{'-primary_id'} = substr($gi,3);
 398                 }
 399             }
 400             # Keywords
 401             elsif( /^KEYWORDS\s+(\S.*)/ ) {
 402                 my @kw = split(/\s*\;\s*/,$1);
 403                 while( defined($_ = $self->_readline) ) {
 404                     chomp;
 405                     /^\s+(.*)/ && do { push (@kw, split(/\s*\;\s*/,$1)); next };
 406                     last;
 407                 }
 408
 409                 @kw && $kw[-1] =~ s/\.$//;
 410                 $params{'-keywords'} = \@kw;
 411                 $buffer = $_;
 412                 next;
 413             }
 414             # Organism name and phylogenetic information
 415             elsif (/^SOURCE\s+\S/) {
 416                 if($builder->want_slot('species')) {
 417                     $species = $self->_read_GenBank_Species(\$buffer);
 418                     $builder->add_slot_value(-species => $species);
 419                 } else {
 420                     while(defined($buffer = $self->_readline())) {
 421                         last if substr($buffer,0,1) ne ' ';
 422                     }
 423                 }
 424                 next;
 425             }
 426             # References
 427             elsif (/^REFERENCE\s+\S/) {
 428                 if($annotation) {
 429                     my @refs = $self->_read_GenBank_References(\$buffer);
 430                     foreach my $ref ( @refs ) {
 431                         $annotation->add_Annotation('reference',$ref);
 432                     }
 433                 } else {
 434                     while(defined($buffer = $self->_readline())) {
 435                         last if substr($buffer,0,1) ne ' ';
 436                     }
 437                 }
 438                 next;
 439             }
 440                 # Project
 441                 elsif (/^PROJECT\s+(\S.*)/) {
 442                         if ($annotation) {
 443                                 my $project = new Bio::Annotation::SimpleValue->new(-value => $1);
 444                                 $annotation->add_Annotation('project',$project);
 445                         }
 446                 }
 447             # Comments
 448             elsif (/^COMMENT\s+(\S.*)/) {
 449                 if($annotation) {
 450                     my $comment = $1;
 451                     while (defined($_ = $self->_readline)) {
 452                         last if (/^\S/);
 453                         $comment .= $_;
 454                     }
 455                     $comment =~ s/\n/ /g;
 456                     $comment =~ s/  +/ /g;
 457                     $annotation->add_Annotation('comment',
 458                                                 Bio::Annotation::Comment->new(-text => $comment,
 459                                                                               -tagname => 'comment'));
 460                     $buffer = $_;
 461                 } else {
 462                     while(defined($buffer = $self->_readline())) {
 463                         last if substr($buffer,0,1) ne ' ';
 464                     }
 465                 }
 466                 next;
 467             }
 468             # Corresponding Genbank nucleotide id, Genpept only
 469             elsif( /^DBSOURCE\s+(\S.+)/ ) {
 470                 if ($annotation) {
 471                     my $dbsource = $1;
 472                     while (defined($_ = $self->_readline)) {
 473                         last if (/^\S/);
 474                         $dbsource .= $_;
 475                     }
 476                                 # deal with UniProKB dbsources
 477                     if( $dbsource =~ s/(UniProtKB|swissprot):\s+locus\s+(\S+)\,.+\n// ) {
 478                         $annotation->add_Annotation
 479                             ('dblink',
 480                              Bio::Annotation::DBLink->new
 481                              (-primary_id => $2,
 482                               -database => $1,
 483                               -tagname => 'dblink'));
 484                         if( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
 485                             $annotation->add_Annotation
 486                                 ('swissprot_dates',
 487                                  Bio::Annotation::SimpleValue->new
 488                                  (-tagname => 'date_created',
 489                                   -value => $1));
 490                         }
 491                         while( $dbsource =~ s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g ) {
 492                             $annotation->add_Annotation
 493                                 ('swissprot_dates',
 494                                  Bio::Annotation::SimpleValue->new
 495                                  (-tagname => 'date_updated',
 496                                   -value => $2));
 497                         }
 498                         $dbsource =~ s/\n/ /g;
 499                         if( $dbsource =~ s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/ ) {
 500                             # will use $i to determine even or odd
 501                             # for swissprot the accessions are paired
 502                             my $i = 0;
 503                             for my $dbsrc ( split(/,\s+/,$1) ) {
 504                                 if( $dbsrc =~ /(\S+)\.(\d+)/ ||
 505                                     $dbsrc =~ /(\S+)/ ) {
 506                                     my ($id,$version) = ($1,$2);
 507                                     $version ='' unless defined $version;
 508                                     my $db;
 509                                     if( $id =~ /^\d\S{3}/) {
 510                                         $db = 'PDB';
 511                                     } else {
 512                                         $db = ($i++ % 2 ) ? 'GenPept' : 'GenBank';
 513                                     }
 514                                     $annotation->add_Annotation
 515                                         ('dblink',
 516                                          Bio::Annotation::DBLink->new
 517                                          (-primary_id => $id,
 518                                           -version => $version,
 519                                           -database => $db,
 520                                           -tagname => 'dblink'));
 521                                 }
 522                             }
 523                         } elsif( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) {
 524                             # download screwed up and ncbi didn't put acc in for gi numbers
 525                             my $i = 0;
 526                             for my $id ( split(/\,\s+/,$1) ) {
 527                                 my ($acc,$db);
 528                                 if( $id =~ /gi:\s+(\d+)/ ) {
 529                                     $acc= $1;
 530                                     $db = ($i++ % 2 ) ? 'GenPept' : 'GenBank';
 531                                 } elsif( $id =~ /pdb\s+accession\s+(\S+)/ ) {
 532                                     $acc= $1;
 533                                     $db = 'PDB';
 534                                 } else {
 535                                     $acc= $id;
 536                                     $db = '';
 537                                 }
 538                                 $annotation->add_Annotation
 539                                     ('dblink',
 540                                      Bio::Annotation::DBLink->new
 541                                      (-primary_id => $acc,
 542                                       -database => $db,
 543                                       -tagname => 'dblink'));
 544                             }
 545                         } else {
 546                             $self->debug("Cannot match $dbsource\n");
 547                         }
 548                         if( $dbsource =~ s/xrefs\s+\(non\-sequence\s+databases\):\s+
 549                             ((?:\S+,\s+)+\S+)//x ) {
 550                             for my $id ( split(/\,\s+/,$1) ) {
 551                                 my $db;
 552                                 # this is because GenBank dropped the spaces!!!
 553                                 # I'm sure we're not going to get this right
 554                                 ##if( $id =~ s/^://i ) {
 555                                 ##    $db = $1;
 556                                 ##}
 557                                 $db = substr($id,0,index($id,':'));
 558                                 if (! exists $DBSOURCE{ $db }) {
 559                                       $db = '';   # do we want 'GenBank' here?
 560                                 }
 561                                 $id = substr($id,index($id,':')+1);
 562                                 $annotation->add_Annotation
 563                                     ('dblink',
 564                                      Bio::Annotation::DBLink->new
 565                                      (-primary_id => $id,
 566                                       -database => $db,
 567                                       -tagname => 'dblink'));
 568                             }
 569                         }
 570
 571                     } else {
 572                 if( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/ ) {
 573                     my ($db,$id,$version) = ($1,$2,$3);
 574                     $annotation->add_Annotation
 575                     ('dblink',
 576                      Bio::Annotation::DBLink->new
 577                      (-primary_id => $id,
 578                       -version => $version,
 579                       -database => $db || 'GenBank',
 580                       -tagname => 'dblink'));
 581                 } elsif ( $dbsource =~ /(\S+)\.(\d+)/ ) {
 582                     my ($id,$version) = ($1,$2);
 583                     $annotation->add_Annotation
 584                     ('dblink',
 585                      Bio::Annotation::DBLink->new
 586                      (-primary_id => $id,
 587                       -version => $version,
 588                       -database => 'GenBank',
 589                       -tagname => 'dblink'));
 590                 } else {
 591                     $self->warn("Unrecognized DBSOURCE data: $dbsource\n");
 592                 }
 593                     }
 594
 595                     $buffer = $_;
 596                 } else {
 597                     while(defined($buffer = $self->_readline())) {
 598                         last if substr($buffer,0,1) ne ' ';
 599                     }
 600                 }
 601                 next;
 602             }
 603             # Exit at start of Feature table, or start of sequence
 604             last if( /^(FEATURES|ORIGIN)/ );
 605             # Get next line and loop again
 606             $buffer = $self->_readline;
 607         }
 608         return unless defined $buffer;
 609
 610         # add them all at once for efficiency
 611         $builder->add_slot_value(-accession_number => shift(@acc),
 612                                  -secondary_accessions => \@acc,
 613                                  %params);
 614         $builder->add_slot_value(-annotation => $annotation) if $annotation;
 615         %params = (); # reset before possible re-use to avoid setting twice
 616
 617         # start over if we don't want to continue with this entry
 618         if(! $builder->want_object()) {
 619             $builder->make_object();
 620             next RECORDSTART;
 621         }
 622         # some "minimal" formats may not necessarily have a feature table
 623         if($builder->want_slot('features') && defined($_) && /^FEATURES/o) {
 624             # need to read the first line of the feature table
 625             $buffer = $self->_readline;
 626             # DO NOT read lines in the while condition -- this is done as a side
 627             # effect in _read_FTHelper_GenBank!
 628             while( defined($buffer) ) {
 629                                 # check immediately -- not at the end of the loop
 630                                 # note: GenPept entries obviously do not have a BASE line
 631                 last if( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o);
 632
 633                                 # slurp in one feature at a time -- at return, the start of
 634                                 # the next feature will have been read already, so we need
 635                                 # to pass a reference, and the called method must set this
 636                                 # to the last line read before returning
 637
 638                 my $ftunit = $self->_read_FTHelper_GenBank(\$buffer);
 639
 640                                 # fix suggested by James Diggans
 641
 642                 if( !defined $ftunit ) {
 643                     # GRRRR. We have fallen over. Try to recover
 644                     $self->warn("Unexpected error in feature table for ".$params{'-display_id'}." Skipping feature, attempting to recover");
 645                     unless( ($buffer =~ /^\s{5,5}\S+/o) or
 646                             ($buffer =~ /^\S+/o)) {
 647                         $buffer = $self->_readline();
 648                     }
 649                     next;       # back to reading FTHelpers
 650                 }
 651
 652                                 # process ftunit
 653                 my $feat =
 654                     $ftunit->_generic_seqfeature($self->location_factory(),
 655                                                  $display_id);
 656                                 # add taxon_id from source if available
 657                 if($species && ($feat->primary_tag eq 'source') &&
 658                    $feat->has_tag('db_xref') && (! $species->ncbi_taxid() ||
 659                                                  ($species->ncbi_taxid && $species->ncbi_taxid =~ /^list/))) {
 660                     foreach my $tagval ($feat->get_tag_values('db_xref')) {
 661                         if(index($tagval,"taxon:") == 0) {
 662                             $species->ncbi_taxid(substr($tagval,6));
 663                             last;
 664                         }
 665                     }
 666                 }
 667                                 # add feature to list of features
 668                 push(@features, $feat);
 669             }
 670             $builder->add_slot_value(-features => \@features);
 671             $_ = $buffer;
 672         }
 673         if( defined ($_) ) {
 674             if( /^CONTIG/o ) {
 675                 my @contig;
 676                 while($_ !~ m{^//}) { # end of file
 677                     $_ =~ /^(?:CONTIG)?\s+(.*)/;
 678                     $annotation->add_Annotation(
 679                                                 Bio::Annotation::SimpleValue->new(-value   => $1,
 680                                                                                   -tagname => 'contig'));
 681                     $_ = $self->_readline;
 682                 }
 683                 $self->_pushback($_);
 684             } elsif( /^WGS|WGS_SCAFLD\s+/o ) { # catch WGS/WGS_SCAFLD lines
 685                 while($_ =~ s/(^WGS|WGS_SCAFLD)\s+//){ # gulp lines
 686                     chomp;
 687                     $annotation->add_Annotation(
 688                                                 Bio::Annotation::SimpleValue->new(-value => $_,
 689                                                                                   -tagname => lc($1)));
 690                     $_ = $self->_readline;
 691                 }
 692             } elsif(! m{^(ORIGIN|//)} ) { # advance to the sequence, if any
 693                 while (defined( $_ = $self->_readline) ) {
 694                     last if m{^(ORIGIN|//)};
 695                 }
 696             }
 697         }
 698         if(! $builder->want_object()) {
 699             $builder->make_object(); # implicit end-of-object
 700             next RECORDSTART;
 701         }
 702         if($builder->want_slot('seq')) {
 703             # the fact that we want a sequence does not necessarily mean that
 704             # there also is a sequence ...
 705             if(defined($_) && s/^ORIGIN\s+//) {
 706                 chomp;
 707                 if( $annotation && length($_) > 0 ) {
 708                     $annotation->add_Annotation('origin',
 709                                                 Bio::Annotation::SimpleValue->new(-tagname => 'origin',
 710                                                                                   -value => $_));
 711                 }
 712                 my $seqc = '';
 713                 while( defined($_ = $self->_readline) ) {
 714                     m{^//} && last;
 715                     $_ = uc($_);
 716                     s/[^A-Za-z]//g;
 717                     $seqc .= $_;
 718                 }
 719                 #$self->debug("sequence length is ". length($seqc) ."\n");
 720                 $builder->add_slot_value(-seq => $seqc);
 721             }
 722         } elsif ( defined($_) && (substr($_,0,2) ne '//')) {
 723             # advance to the end of the record
 724             while( defined($_ = $self->_readline) ) {
 725                 last if substr($_,0,2) eq '//';
 726             }
 727         }
 728         # Unlikely, but maybe the sequence is so weird that we don't want it
 729         # anymore. We don't want to return undef if the stream's not exhausted
 730         # yet.
 731         $seq = $builder->make_object();
 732         next RECORDSTART unless $seq;
 733         last RECORDSTART;
 734     }                           # end while RECORDSTART
 735
 736     return $seq;
 737 }
 738
 739 =head2 write_seq
 740
 741  Title   : write_seq
 742  Usage   : $stream->write_seq($seq)
 743  Function: writes the $seq object (must be seq) to the stream
 744  Returns : 1 for success and 0 for error
 745  Args    : array of 1 to n Bio::SeqI objects
 746
 747
 748 =cut
 749
 750 sub write_seq {
 751     my ($self,@seqs) = @_;
 752
 753     foreach my $seq ( @seqs ) {
 754         $self->throw("Attempting to write with no seq!") unless defined $seq;
 755
 756         if( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
 757             $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
 758         }
 759
 760         my $str = $seq->seq;
 761
 762         my ($div, $mol);
 763         my $len = $seq->length();
 764
 765         if ( $seq->can('division') ) {
 766             $div=$seq->division;
 767         }
 768         if( !defined $div || ! $div ) { $div = 'UNK'; }
 769         my $alpha = $seq->alphabet;
 770         if( !$seq->can('molecule') || ! defined ($mol = $seq->molecule()) ) {
 771             $mol =  $alpha || 'DNA';
 772         }
 773
 774         my $circular = 'linear  ';
 775         $circular = 'circular' if $seq->is_circular;
 776
 777         local($^W) = 0; # supressing warnings about uninitialized fields.
 778
 779         my $temp_line;
 780         if( $self->_id_generation_func ) {
 781             $temp_line = &{$self->_id_generation_func}($seq);
 782         } else {
 783             my $date = '';
 784             if( $seq->can('get_dates') ) {
 785                 ($date) = $seq->get_dates();
 786             }
 787
 788             $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
 789                 if $seq->display_id =~ /\s/;
 790
 791             $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s",
 792                                   'LOCUS', $seq->id(),$len,
 793                                   (lc($alpha) eq 'protein') ? ('aa','', '') :
 794                                   ('bp', '',$mol),$circular,
 795                                   $div,$date);
 796         }
 797
 798         $self->_print("$temp_line\n");
 799         $self->_write_line_GenBank_regex("DEFINITION  ", "            ",
 800                                          $seq->desc(),"\\s\+\|\$",80);
 801
 802         # if there, write the accession line
 803
 804         if( $self->_ac_generation_func ) {
 805             $temp_line = &{$self->_ac_generation_func}($seq);
 806             $self->_print("ACCESSION   $temp_line\n");
 807         } else {
 808             my @acc = ();
 809             push(@acc, $seq->accession_number());
 810             if( $seq->isa('Bio::Seq::RichSeqI') ) {
 811                 push(@acc, $seq->get_secondary_accessions());
 812             }
 813             $self->_print("ACCESSION   ", join(" ", @acc), "\n");
 814             # otherwise - cannot print <sigh>
 815         }
 816
 817         # if PID defined, print it
 818         if($seq->isa('Bio::Seq::RichSeqI') && $seq->pid()) {
 819             $self->_print("PID         ", $seq->pid(), "\n");
 820         }
 821
 822         # if there, write the version line
 823
 824         if( defined $self->_sv_generation_func() ) {
 825             $temp_line = &{$self->_sv_generation_func}($seq);
 826             if( $temp_line ) {
 827                 $self->_print("VERSION     $temp_line\n");
 828             }
 829         } else {
 830             if($seq->isa('Bio::Seq::RichSeqI') && defined($seq->seq_version)) {
 831                 my $id = $seq->primary_id(); # this may be a GI number
 832                 $self->_print("VERSION     ",
 833                               $seq->accession_number(), ".", $seq->seq_version,
 834                               ($id && ($id =~ /^\d+$/) ? "  GI:".$id : ""),
 835                               "\n");
 836             }
 837         }
 838
 839     # if there, write the PROJECT line
 840         for my $proj ( $seq->annotation->get_Annotations('project') ) {
 841                 $self->_print("PROJECT     ".$proj->value."\n");
 842     }
 843
 844         # if there, write the DBSOURCE line
 845         foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
 846             # if ($ref->comment eq 'DBSOURCE') {
 847         my $text = $ref->display_text(
 848             sub{($ref->database eq 'GenBank' ? '' : $_[0]->database.' ').
 849                 'accession '.$_[0]->primary_id});
 850             $self->_print("DBSOURCE    $text\n");
 851             # }
 852         }
 853
 854         # if there, write the keywords line
 855         if( defined $self->_kw_generation_func() ) {
 856             $temp_line = &{$self->_kw_generation_func}($seq);
 857             $self->_print("KEYWORDS    $temp_line\n");
 858         } else {
 859             if( $seq->can('keywords') ) {
 860                 my $kw = $seq->keywords;
 861                 $kw .= '.' if( $kw !~ /\.$/ );
 862                 $self->_print("KEYWORDS    $kw\n");
 863             }
 864         }
 865
 866         # SEGMENT if it exists
 867         foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
 868             $self->_print(sprintf ("%-11s %s\n",'SEGMENT',
 869                                    $ref->value));
 870         }
 871
 872         # Organism lines
 873         if (my $spec = $seq->species) {
 874             my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
 875                                   $spec->scientific_name,
 876                                   $spec->common_name);
 877             my @classification;
 878         if ($spec->isa('Bio::Species')) {
 879             @classification = $spec->classification;
 880             shift(@classification);
 881         } else {
 882             # Bio::Taxon should have a DB handle of some type attached, so
 883             # derive the classification from that
 884             my $node = $spec;
 885             while ($node) {
 886                 $node = $node->ancestor || last;
 887                 unshift(@classification, $node->node_name);
 888                 #$node eq $root && last;
 889             }
 890             @classification = reverse @classification;
 891         }
 892             my $abname = $spec->name('abbreviated') ? # from genbank file
 893                 $spec->name('abbreviated')->[0] : $sn;
 894             my $sl = $on ? "$on "            : '';
 895             $sl   .= $cn ? $abname." ($cn)" : "$abname";
 896
 897             $self->_write_line_GenBank_regex("SOURCE      ", ' 'x12, $sl, "\\s\+\|\$",80);
 898             $self->_print("  ORGANISM  ", $spec->scientific_name, "\n");
 899             my $OC = join('; ', (reverse(@classification))) .'.';
 900             $self->_write_line_GenBank_regex(' 'x12,' 'x12,
 901                                              $OC,"\\s\+\|\$",80);
 902         }
 903
 904         # Reference lines
 905         my $count = 1;
 906         foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
 907             $temp_line = "REFERENCE   $count";
 908             if ($ref->start) {
 909                 $temp_line .= sprintf ("  (%s %d to %d)",
 910                                        ($seq->alphabet() eq "protein" ?
 911                                         "residues" : "bases"),
 912                                        $ref->start,$ref->end);
 913             } elsif ($ref->gb_reference) {
 914                 $temp_line .= sprintf ("  (%s)", $ref->gb_reference);
 915             }
 916             $self->_print("$temp_line\n");
 917             $self->_write_line_GenBank_regex("  AUTHORS   ",' 'x12,
 918                                              $ref->authors,"\\s\+\|\$",80);
 919             $self->_write_line_GenBank_regex("  CONSRTM   ",' 'x12,
 920                                              $ref->consortium,"\\s\+\|\$",80) if $ref->consortium;
 921             $self->_write_line_GenBank_regex("  TITLE     "," "x12,
 922                                              $ref->title,"\\s\+\|\$",80);
 923             $self->_write_line_GenBank_regex("  JOURNAL   "," "x12,
 924                                              $ref->location,"\\s\+\|\$",80);
 925             if( $ref->medline) {
 926                 $self->_write_line_GenBank_regex("  MEDLINE   "," "x12,
 927                                                  $ref->medline, "\\s\+\|\$",80);
 928                 # I am assuming that pubmed entries only exist when there
 929                 # are also MEDLINE entries due to the indentation
 930             }
 931             # This could be a wrong assumption
 932             if( $ref->pubmed ) {
 933                 $self->_write_line_GenBank_regex("   PUBMED   "," "x12,
 934                                                  $ref->pubmed, "\\s\+\|\$",
 935                                                  80);
 936             }
 937             # put remark at the end
 938             if ($ref->comment) {
 939                 $self->_write_line_GenBank_regex("  REMARK    "," "x12,
 940                                                  $ref->comment,"\\s\+\|\$",80);
 941             }
 942             $count++;
 943         }
 944
 945         # Comment lines
 946         foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
 947             $self->_write_line_GenBank_regex("COMMENT     "," "x12,
 948                                              $comment->text,"\\s\+\|\$",80);
 949         }
 950         $self->_print("FEATURES             Location/Qualifiers\n");
 951
 952         if( defined $self->_post_sort ) {
 953             # we need to read things into an array. Process. Sort them. Print 'em
 954
 955             my $post_sort_func = $self->_post_sort();
 956             my @fth;
 957
 958             foreach my $sf ( $seq->top_SeqFeatures ) {
 959                 push(@fth,Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq));
 960             }
 961
 962             @fth = sort { &$post_sort_func($a,$b) } @fth;
 963
 964             foreach my $fth ( @fth ) {
 965                 $self->_print_GenBank_FTHelper($fth);
 966             }
 967         } else {
 968             # not post sorted. And so we can print as we get them.
 969             # lower memory load...
 970
 971             foreach my $sf ( $seq->top_SeqFeatures ) {
 972                 my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq);
 973                 foreach my $fth ( @fth ) {
 974                     if( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
 975                         $sf->throw("Cannot process FTHelper... $fth");
 976                     }
 977                     $self->_print_GenBank_FTHelper($fth);
 978                 }
 979             }
 980         }
 981
 982         # deal with WGS; WGS_SCAFLD present only if WGS is also present
 983         if($seq->annotation->get_Annotations('wgs')) {
 984             foreach my $wgs
 985                 (map {$seq->annotation->get_Annotations($_)} qw(wgs wgs_scaffold)) {
 986                     $self->_print(sprintf ("%-11s %s\n",uc($wgs->tagname),
 987                                            $wgs->value));
 988                 }
 989             $self->_show_dna(0);
 990         }
 991         if($seq->annotation->get_Annotations('contig')) {
 992             my $ct = 0;
 993             my $cline;
 994             foreach my $contig ($seq->annotation->get_Annotations('contig')) {
 995                 unless ($ct) {
 996                     $cline = uc($contig->tagname)."      ".$contig->value."\n";
 997                 } else {
 998                     $cline = "            ".$contig->value."\n";
 999                 }
1000                 $self->_print($cline);
1001                 $ct++;
1002             }
1003             $self->_show_dna(0);
1004         }
1005         if( $seq->length == 0 ) { $self->_show_dna(0) }
1006
1007         if( $self->_show_dna() == 0 ) {
1008             $self->_print("\n//\n");
1009             return;
1010         }
1011
1012         # finished printing features.
1013
1014         $str =~ tr/A-Z/a-z/;
1015
1016         my ($o) = $seq->annotation->get_Annotations('origin');
1017         $self->_print(sprintf("%-12s%s\n",
1018                               'ORIGIN', $o ? $o->value : ''));
1019         # print out the sequence
1020         my $nuc = 60;           # Number of nucleotides per line
1021         my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line
1022         my $out_pat   = 'A11' x 6; # Pattern for packing a line
1023         my $length = length($str);
1024
1025         # Calculate the number of nucleotides which fit on whole lines
1026         my $whole = int($length / $nuc) * $nuc;
1027
1028         # Print the whole lines
1029         my $i;
1030         for ($i = 0; $i < $whole; $i += $nuc) {
1031             my $blocks = pack $out_pat,
1032             unpack $whole_pat,
1033             substr($str, $i, $nuc);
1034             chop $blocks;
1035             $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
1036         }
1037
1038         # Print the last line
1039         if (my $last = substr($str, $i)) {
1040             my $last_len = length($last);
1041             my $last_pat = 'a10' x int($last_len / 10) .
1042                 'a'. $last_len % 10;
1043             my $blocks = pack $out_pat,
1044             unpack($last_pat, $last);
1045             $blocks =~ s/ +$//;
1046             $self->_print(sprintf("%9d $blocks\n",
1047                                   $length - $last_len + 1));
1048         }
1049
1050         $self->_print("//\n");
1051
1052         $self->flush if $self->_flush_on_write && defined $self->_fh;
1053         return 1;
1054     }
1055 }
1056
1057 =head2 _print_GenBank_FTHelper
1058
1059  Title   : _print_GenBank_FTHelper
1060  Usage   :
1061  Function:
1062  Example :
1063  Returns :
1064  Args    :
1065
1066
1067 =cut
1068
1069 sub _print_GenBank_FTHelper {
1070     my ($self,$fth) = @_;
1071
1072     if( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {
1073         $fth->warn("$fth is not a FTHelper class. Attempting to print, but there could be tears!");
1074     }
1075     my $spacer = (length $fth->key >= 15) ? ' ' : '';
1076     $self->_write_line_GenBank_regex(sprintf("     %-16s%s",$fth->key,$spacer),
1077                                      " "x21,
1078                                      $fth->loc,"\,\|\$",80);
1079     foreach my $tag ( keys %{$fth->field} ) {
1080         foreach my $value ( @{$fth->field->{$tag}} ) {
1081             $value =~ s/\"/\"\"/g;
1082             if ($value eq "_no_value") {
1083                 $self->_write_line_GenBank_regex(" "x21,
1084                                                  " "x21,
1085                                                  "/$tag","\.\|\$",80);
1086             }
1087             # there are almost 3x more quoted qualifier values and they
1088             # are more common too so we take quoted ones first
1089             elsif (!$FTQUAL_NO_QUOTE{$tag}) {
1090                 my ($pat) = ($value =~ /\s/ ? '\s|$' : '.|$');
1091                 $self->_write_line_GenBank_regex(" "x21,
1092                                                  " "x21,
1093                                                  "/$tag=\"$value\"",$pat,80);
1094
1095             } else {
1096                 $self->_write_line_GenBank_regex(" "x21,
1097                                                  " "x21,
1098                                                  "/$tag=$value","\.\|\$",80);
1099             }
1100         }
1101     }
1102 }
1103
1104 =head2 _read_GenBank_References
1105
1106  Title   : _read_GenBank_References
1107  Usage   :
1108  Function: Reads references from GenBank format. Internal function really
1109  Returns :
1110  Args    :
1111
1112 =cut
1113
1114 sub _read_GenBank_References {
1115     my ($self,$buffer) = @_;
1116     my (@refs);
1117     my $ref;
1118
1119     # assumme things are starting with RN
1120
1121     if( $$buffer !~ /^REFERENCE/ ) {
1122         warn("Not parsing line '$$buffer' which maybe important");
1123     }
1124
1125     $_ = $$buffer;
1126
1127     my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
1128
1129   REFLOOP: while( defined($_) || defined($_ = $self->_readline) ) {
1130       if (/^\s{2}AUTHORS\s+(.*)/o) {
1131           push (@authors, $1);
1132           while ( defined($_ = $self->_readline) ) {
1133               /^\s{9,}(.*)/o && do { push (@authors, $1);next;};
1134               last;
1135           }
1136           $ref->authors(join(' ', @authors));
1137       }
1138       if (/^\s{2}CONSRTM\s+(.*)/o) {
1139           push (@consort, $1);
1140           while ( defined($_ = $self->_readline) ) {
1141               /^\s{9,}(.*)/o && do { push (@consort, $1);next;};
1142               last;
1143           }
1144           $ref->consortium(join(' ', @consort));
1145       }
1146       if (/^\s{2}TITLE\s+(.*)/o)  {
1147           push (@title, $1);
1148           while ( defined($_ = $self->_readline) ) {
1149               /^\s{9,}(.*)/o && do { push (@title, $1);
1150                                      next;
1151                                  };
1152               last;
1153           }
1154           $ref->title(join(' ', @title));
1155       }
1156       if (/^\s{2}JOURNAL\s+(.*)/o) {
1157           push(@loc, $1);
1158           while ( defined($_ = $self->_readline) ) {
1159               # we only match when there are at least 4 spaces
1160               # there is probably a better way to match this
1161               # as it assumes that the describing tag is short enough
1162               /^\s{9,}(.*)/o && do { push(@loc, $1);
1163                                      next;
1164                                  };
1165               last;
1166           }
1167           $ref->location(join(' ', @loc));
1168           redo REFLOOP;
1169       }
1170       if (/^\s{2}REMARK\s+(.*)/o) {
1171           push (@com, $1);
1172           while ( defined($_ = $self->_readline) ) {
1173               /^\s{9,}(.*)/o && do { push(@com, $1);
1174                                      next;
1175                                  };
1176               last;
1177           }
1178           $ref->comment(join(' ', @com));
1179           redo REFLOOP;
1180       }
1181       if( /^\s{2}MEDLINE\s+(.*)/ ) {
1182           push(@medline,$1);
1183           while ( defined($_ = $self->_readline) ) {
1184               /^\s{9,}(.*)/ && do { push(@medline, $1);
1185                                     next;
1186                                 };
1187               last;
1188           }
1189           $ref->medline(join(' ', @medline));
1190           redo REFLOOP;
1191       }
1192       if( /^\s{3}PUBMED\s+(.*)/ ) {
1193           push(@pubmed,$1);
1194           while ( defined($_ = $self->_readline) ) {
1195               /^\s{9,}(.*)/ && do { push(@pubmed, $1);
1196                                     next;
1197                                 };
1198               last;
1199           }
1200           $ref->pubmed(join(' ', @pubmed));
1201           redo REFLOOP;
1202       }
1203
1204       /^REFERENCE/o && do {
1205           # store current reference
1206           $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1207           # reset
1208           @authors = ();
1209           @title = ();
1210           @loc = ();
1211           @com = ();
1212           @pubmed = ();
1213           @medline = ();
1214           # create the new reference object
1215           $ref = Bio::Annotation::Reference->new(-tagname => 'reference');
1216           # check whether start and end base is given
1217           if (/^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
1218               $ref->start($1);
1219               $ref->end($2);
1220           } elsif (/^REFERENCE\s+\d+\s+\((.*)\)/) {
1221               $ref->gb_reference($1);
1222           }
1223       };
1224
1225       /^(FEATURES)|(COMMENT)/o && last;
1226
1227       $_ = undef;              # Empty $_ to trigger read of next line
1228   }
1229
1230     # store last reference
1231     $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1232
1233     $$buffer = $_;
1234
1235     #print "\nnumber of references found: ", $#refs+1,"\n";
1236
1237     return @refs;
1238 }
1239
1240 #
1241 # This is undocumented as it shouldn't be called by anywhere else as
1242 # read_GenBank_References. For those who still want to know:
1243 #
1244 # Purpose: adds a Reference object to an array of Reference objects, takes
1245 #     care of possible cleanups to be done (currently, only author and title
1246 #     will be chopped of trailing semicolons).
1247 # Parameters:
1248 #     a reference to an array of Reference objects
1249 #     the Reference object to be added
1250 # Returns: nothing
1251 #
1252 sub _add_ref_to_array {
1253     my ($self, $refs, $ref) = @_;
1254
1255     # first, polish author and title by removing possible trailing semicolons
1256     my $au = $ref->authors();
1257     my $title = $ref->title();
1258     $au =~ s/;\s*$//g if $au;
1259     $title =~ s/;\s*$//g if $title;
1260     $ref->authors($au);
1261     $ref->title($title);
1262     # the rest should be clean already, so go ahead and add it
1263     push(@{$refs}, $ref);
1264 }
1265
1266 =head2 _read_GenBank_Species
1267
1268  Title   : _read_GenBank_Species
1269  Usage   :
1270  Function: Reads the GenBank Organism species and classification
1271            lines. Able to deal with unconvential Organism naming
1272            formats, and varietas in plants
1273  Example : ORGANISM  unknown marine gamma proteobacterium NOR5
1274            $genus = undef
1275            $species = unknown marine gamma proteobacterium NOR5
1276
1277            ORGANISM  Drosophila sp. 'white tip scutellum'
1278            $genus = Drosophila
1279            $species = sp. 'white tip scutellum'
1280            (yes, this really is a species and that is its name)
1281            $subspecies = undef
1282
1283            ORGANISM  Ajellomyces capsulatus var. farciminosus
1284            $genus = Ajellomyces
1285            $species = capsulatus
1286            $subspecies = var. farciminosus
1287
1288            ORGANISM  Hepatitis delta virus
1289            $genus = undef (though this virus has a genus in its lineage, we
1290                            cannot know that without a database lookup)
1291            $species = Hepatitis delta virus
1292
1293  Returns : A Bio::Species object
1294  Args    : A reference to the current line buffer
1295
1296 =cut
1297
1298 sub _read_GenBank_Species {
1299     my ($self, $buffer) = @_;
1300
1301     my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
1302                       'Unspecified', 'Unknown', 'None', 'unclassified',
1303                       'unidentified organism', 'not supplied');
1304     # dictionary of synonyms for taxid 32644
1305     my @unkn_genus = ('unknown','unclassified','uncultured','unidentified');
1306     # all above can be part of valid species name
1307
1308     my $line = $$buffer;
1309
1310     my( $sub_species, $species, $genus, $sci_name, $common, $class_lines,
1311         $source_flag, $abbr_name, $organelle, $sl );
1312     my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
1313     # upon first entering the loop, we must not read a new line -- the SOURCE
1314     # line is already in the buffer (HL 05/10/2000)
1315     my ($ann, $tag, $data);
1316     while (defined($line) || defined($line = $self->_readline())) {
1317         # de-HTMLify (links that may be encountered here don't contain
1318         # escaped '>', so a simple-minded approach suffices)
1319         $line =~ s{<[^>]+>}{}g;
1320         if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
1321             ($tag, $data) = ($1, $2 || '');
1322             last if ($tag && !exists $source{$tag});
1323         } else {
1324             return unless $tag;
1325             ($data = $line) =~ s{^\s+}{};
1326             chomp $data;
1327             $tag = 'CLASSIFICATION' if ($tag ne 'CLASSIFICATION' && $tag eq 'ORGANISM' &&  $line =~ m{[;\.]+});
1328         }
1329         (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
1330         $line = undef;
1331     }
1332
1333     ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
1334
1335     $$buffer = $line;
1336
1337     $sci_name || return;
1338
1339     # parse out organelle, common name, abbreviated name if present;
1340     # this should catch everything, but falls back to
1341     # entire SOURCE line just in case
1342     if ($sl =~ m{^
1343                      (mitochondrion|chloroplast|plastid)?
1344                      \s*(.*?)
1345                      \s*(?: \( (.*?) \) )?\.?
1346                      $
1347                  }xms ){
1348         ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional
1349     } else {
1350         $abbr_name = $sl;       # nothing caught; this is a backup!
1351     }
1352
1353     # Convert data in classification lines into classification array.
1354     # only split on ';' or '.' so that classification that is 2 or more words will
1355     # still get matched, use map() to remove trailing/leading/intervening spaces
1356     my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /(?<!subgen)[;\.]+/, $class_lines;
1357
1358     # do we have a genus?
1359     my $possible_genus =  quotemeta($class[-1])
1360        . ($class[-2] ? "|" . quotemeta($class[-2]) : '');
1361     if ($sci_name =~ /^($possible_genus)/) {
1362         $genus = $1;
1363         ($species) = $sci_name =~ /^$genus\s+(.+)/;
1364     }
1365     else {
1366         $species = $sci_name;
1367     }
1368
1369     # is this organism of rank species or is it lower?
1370     # (we don't catch everything lower than species, but it doesn't matter -
1371     # this is just so we abide by previous behaviour whilst not calling a
1372     # species a subspecies)
1373     if ($species && $species =~ /subsp\.|var\./) {
1374         ($species, $sub_species) = $species =~ /(.+)\s+((?:subsp\.|var\.).+)/;
1375     }
1376
1377     # Don't make a species object if it's empty or "Unknown" or "None"
1378     # return unless $genus and  $genus !~ /^(Unknown|None)$/oi;
1379     # Don't make a species object if it belongs to taxid 32644
1380 #       my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names;
1381     my $unkn = grep { $_ eq $sl } @unkn_names;
1382     return unless ($species || $genus) and $unkn == 0;
1383
1384     # Bio::Species array needs array in Species -> Kingdom direction
1385     push(@class, $sci_name);
1386     @class = reverse @class;
1387
1388     my $make = Bio::Species->new();
1389     $make->scientific_name($sci_name);
1390     $make->classification(@class) if @class > 0;
1391     $make->common_name( $common ) if $common;
1392     $make->name('abbreviated', $abbr_name) if $abbr_name;
1393     $make->organelle($organelle) if $organelle;
1394     #$make->sub_species( $sub_species ) if $sub_species;
1395     return $make;
1396 }
1397
1398 =head2 _read_FTHelper_GenBank
1399
1400  Title   : _read_FTHelper_GenBank
1401  Usage   : _read_FTHelper_GenBank($buffer)
1402  Function: reads the next FT key line
1403  Example :
1404  Returns : Bio::SeqIO::FTHelper object
1405  Args    : filehandle and reference to a scalar
1406
1407 =cut
1408
1409 sub _read_FTHelper_GenBank {
1410     my ($self,$buffer) = @_;
1411
1412     my ($key,                   # The key of the feature
1413         $loc                    # The location line from the feature
1414         );
1415     my @qual = ();        # An array of lines making up the qualifiers
1416
1417     if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
1418         $key = $1;
1419         $loc = $2;
1420         # Read all the lines up to the next feature
1421         while ( defined($_ = $self->_readline) ) {
1422             if (/^(\s+)(.+?)\s*$/o) {
1423                                 # Lines inside features are preceded by 21 spaces
1424                                 # A new feature is preceded by 5 spaces
1425                 if (length($1) > 6) {
1426                     # Add to qualifiers if we're in the qualifiers, or if it's
1427                     # the first qualifier
1428                     if (@qual || (index($2,'/') == 0)) {
1429                         push(@qual, $2);
1430                     }
1431                     # We're still in the location line, so append to location
1432                     else {
1433                         $loc .= $2;
1434                     }
1435                 } else {
1436                     # We've reached the start of the next feature
1437                     last;
1438                 }
1439             } else {
1440                                 # We're at the end of the feature table
1441                 last;
1442             }
1443         }
1444     } else {
1445         # No feature key
1446         $self->debug("no feature key!\n");
1447         # change suggested by JDiggans to avoid infinite loop-
1448         # see bugreport 1062.
1449         # reset buffer to prevent infinite loop
1450         $$buffer = $self->_readline();
1451         return;
1452     }
1453
1454     # Put the first line of the next feature into the buffer
1455     $$buffer = $_;
1456
1457     # Make the new FTHelper object
1458     my $out = Bio::SeqIO::FTHelper->new();
1459     $out->verbose($self->verbose());
1460     $out->key($key);
1461     $out->loc($loc);
1462
1463     # Now parse and add any qualifiers.  (@qual is kept
1464     # intact to provide informative error messages.)
1465   QUAL:
1466     for (my $i = 0; $i < @qual; $i++) {
1467         $_ = $qual[$i];
1468         my( $qualifier, $value ) = (m{^/([^=]+)(?:=(.+))?})
1469             or $self->warn("cannot see new qualifier in feature $key: ".
1470                            $qual[$i]);
1471         $qualifier = '' unless( defined $qualifier);
1472         if (defined $value) {
1473             # Do we have a quoted value?
1474             if (substr($value, 0, 1) eq '"') {
1475                                 # Keep adding to value until we find the trailing quote
1476                                 # and the quotes are balanced
1477                 while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
1478                     if($i >= $#qual) {
1479                         $self->warn("Unbalanced quote in:\n" .
1480                                     join("\n", @qual) .
1481                                     "No further qualifiers will " .
1482                                     "be added for this feature");
1483                         last QUAL;
1484                     }
1485                     $i++; # modifying a for-loop variable inside of the loop
1486                     # is not the best programming style ...
1487                     my $next = $qual[$i];
1488
1489                     # add to value with a space unless the value appears
1490                     # to be a sequence (translation for example)
1491                     # if(($value.$next) =~ /[^A-Za-z\"\-]/o) {
1492                     # changed to explicitly look for translation tag - cjf 06/8/29
1493                     if ($qualifier ne 'translation') {
1494                         $value .= " ";
1495                     }
1496                     $value .= $next;
1497                 }
1498                                 # Trim leading and trailing quotes
1499                 $value =~ s/^"|"$//g;
1500                                 # Undouble internal quotes
1501                 $value =~ s/""/\"/g;
1502             } elsif ( $value =~ /^\(/ ) { # values quoted by ()s
1503                                 # Keep adding to value until we find the trailing bracket
1504                                 # and the ()s are balanced
1505                 my $left = ($value =~ tr/\(/\(/); # count left parens
1506                 my $right = ($value =~ tr/\)/\)/); # count right parens
1507
1508                 while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right"
1509                     if( $i >= $#qual) {
1510                         $self->warn("Unbalanced parens in:\n".
1511                                     join("\n", @qual).
1512                                     "\nNo further qualifiers will ".
1513                                     "be added for this feature");
1514                         last QUAL;
1515                     }
1516                     $i++;
1517                     my $next = $qual[$i];
1518                     $value .= $next;
1519                     $left += ($next =~ tr/\(/\(/);
1520                     $right += ($next =~ tr/\)/\)/);
1521                 }
1522             }
1523         } else {
1524             $value = '_no_value';
1525         }
1526         # Store the qualifier
1527         $out->field->{$qualifier} ||= [];
1528         push(@{$out->field->{$qualifier}},$value);
1529     }
1530     return $out;
1531 }
1532
1533 =head2 _write_line_GenBank
1534
1535  Title   : _write_line_GenBank
1536  Usage   :
1537  Function: internal function
1538  Example :
1539  Returns :
1540  Args    :
1541
1542 =cut
1543
1544 sub _write_line_GenBank {
1545     my ($self,$pre1,$pre2,$line,$length) = @_;
1546
1547     $length || $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1548     my $subl = $length - length $pre2;
1549     my $linel = length $line;
1550     my $i;
1551
1552     my $subr = substr($line,0,$length - length $pre1);
1553
1554     $self->_print("$pre1$subr\n");
1555     for($i= ($length - length $pre1);$i < $linel;  $i += $subl) {
1556         $subr = substr($line,$i,$subl);
1557         $self->_print("$pre2$subr\n");
1558     }
1559
1560 }
1561
1562 =head2 _write_line_GenBank_regex
1563
1564  Title   : _write_line_GenBank_regex
1565  Usage   :
1566  Function: internal function for writing lines of specified
1567            length, with different first and the next line
1568            left hand headers and split at specific points in the
1569            text
1570  Example :
1571  Returns : nothing
1572  Args    : file handle,
1573            first header,
1574            second header,
1575            text-line,
1576            regex for line breaks,
1577            total line length
1578
1579
1580 =cut
1581
1582 sub _write_line_GenBank_regex {
1583     my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
1584
1585     #print STDOUT "Going to print with $line!\n";
1586
1587     $length || $self->throw( "Miscalled write_line_GenBank without length. Programming error!");
1588
1589     my $subl = $length - (length $pre1) - 2;
1590     my @lines = ();
1591
1592   CHUNK: while($line) {
1593       foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
1594           if($line =~ m/^(.{1,$subl})($pat)(.*)/ ) {
1595               my $l = $1.$2;
1596               my $newl = $3;
1597               $line = substr($line,length($l));
1598               # be strict about not padding spaces according to
1599               # genbank format
1600               $l =~ s/\s+$//;
1601               push(@lines, $l);
1602               next CHUNK;
1603           }
1604       }
1605       # if we get here none of the patterns matched $subl or less chars
1606       $self->warn("trouble dissecting \"$line\"\n     into chunks ".
1607                   "of $subl chars or less - this tag won't print right");
1608       # insert a space char to prevent infinite loops
1609       $line = substr($line,0,$subl) . " " . substr($line,$subl);
1610   }
1611     my $s = shift @lines;
1612     $self->_print("$pre1$s\n") if $s;
1613     foreach my $s ( @lines ) {
1614         $self->_print("$pre2$s\n");
1615     }
1616 }
1617
1618 =head2 _post_sort
1619
1620  Title   : _post_sort
1621  Usage   : $obj->_post_sort($newval)
1622  Function:
1623  Returns : value of _post_sort
1624  Args    : newvalue (optional)
1625
1626
1627 =cut
1628
1629 sub _post_sort {
1630     my ($obj,$value) = @_;
1631     if( defined $value) {
1632         $obj->{'_post_sort'} = $value;
1633     }
1634     return $obj->{'_post_sort'};
1635 }
1636
1637
1638 =head2 _show_dna
1639
1640  Title   : _show_dna
1641  Usage   : $obj->_show_dna($newval)
1642  Function:
1643  Returns : value of _show_dna
1644  Args    : newvalue (optional)
1645
1646
1647 =cut
1648
1649 sub _show_dna {
1650     my ($obj,$value) = @_;
1651     if( defined $value) {
1652         $obj->{'_show_dna'} = $value;
1653     }
1654     return $obj->{'_show_dna'};
1655 }
1656
1657 =head2 _id_generation_func
1658
1659  Title   : _id_generation_func
1660  Usage   : $obj->_id_generation_func($newval)
1661  Function:
1662  Returns : value of _id_generation_func
1663  Args    : newvalue (optional)
1664
1665
1666 =cut
1667
1668 sub _id_generation_func {
1669     my ($obj,$value) = @_;
1670     if( defined $value ) {
1671         $obj->{'_id_generation_func'} = $value;
1672     }
1673     return $obj->{'_id_generation_func'};
1674 }
1675
1676
1677 =head2 _ac_generation_func
1678
1679  Title   : _ac_generation_func
1680  Usage   : $obj->_ac_generation_func($newval)
1681  Function:
1682  Returns : value of _ac_generation_func
1683  Args    : newvalue (optional)
1684
1685 =cut
1686
1687 sub _ac_generation_func {
1688     my ($obj,$value) = @_;
1689     if( defined $value ) {
1690         $obj->{'_ac_generation_func'} = $value;
1691     }
1692     return $obj->{'_ac_generation_func'};
1693 }
1694
1695
1696 =head2 _sv_generation_func
1697
1698  Title   : _sv_generation_func
1699  Usage   : $obj->_sv_generation_func($newval)
1700  Function:
1701  Returns : value of _sv_generation_func
1702  Args    : newvalue (optional)
1703
1704
1705 =cut
1706
1707 sub _sv_generation_func {
1708     my ($obj,$value) = @_;
1709     if( defined $value ) {
1710         $obj->{'_sv_generation_func'} = $value;
1711     }
1712     return $obj->{'_sv_generation_func'};
1713
1714 }
1715
1716
1717 =head2 _kw_generation_func
1718
1719  Title   : _kw_generation_func
1720  Usage   : $obj->_kw_generation_func($newval)
1721  Function:
1722  Returns : value of _kw_generation_func
1723  Args    : newvalue (optional)
1724
1725 =cut
1726
1727
1728 sub _kw_generation_func {
1729     my ($obj,$value) = @_;
1730     if( defined $value ) {
1731         $obj->{'_kw_generation_func'} = $value;
1732     }
1733     return $obj->{'_kw_generation_func'};
1734 }
1735
1736 1;