Bio/SeqIO/genbank.pm

   1 #
   2 # BioPerl module for Bio::SeqIO::genbank
   3 #
   4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
   5 #
   6 # Cared for by Bioperl project bioperl-l(at)bioperl.org
   7 #
   8 # Copyright Elia Stupka and contributors see AUTHORS section
   9 #
  10 # You may distribute this module under the same terms as perl itself
  11
  12 # POD documentation - main docs before the code
  13
  14 =head1 NAME
  15
  16 Bio::SeqIO::genbank - GenBank sequence input/output stream
  17
  18 =head1 SYNOPSIS
  19
  20 It is probably best not to use this object directly, but
  21 rather go through the SeqIO handler:
  22
  23     $stream = Bio::SeqIO->new(-file => $filename,
  24                               -format => 'GenBank');
  25
  26     while ( my $seq = $stream->next_seq() ) {
  27             # do something with $seq
  28     }
  29
  30
  31 =head1 DESCRIPTION
  32
  33 This object can transform Bio::Seq objects to and from GenBank flat
  34 file databases.
  35
  36 There is some flexibility here about how to write GenBank output
  37 that is not fully documented.
  38
  39 =head2 Optional functions
  40
  41 =over 3
  42
  43 =item _show_dna()
  44
  45 (output only) shows the dna or not
  46
  47 =item _post_sort()
  48
  49 (output only) provides a sorting func which is applied to the FTHelpers
  50 before printing
  51
  52 =item _id_generation_func()
  53
  54 This is function which is called as
  55
  56    print "ID   ", $func($seq), "\n";
  57
  58 To generate the ID line. If it is not there, it generates a sensible ID
  59 line using a number of tools.
  60
  61 If you want to output annotations in Genbank format they need to be
  62 stored in a Bio::Annotation::Collection object which is accessible
  63 through the Bio::SeqI interface method L<annotation()|annotation>.
  64
  65 The following are the names of the keys which are pulled from a
  66 L<Bio::Annotation::Collection> object:
  67
  68  reference       - Should contain Bio::Annotation::Reference objects
  69  comment         - Should contain Bio::Annotation::Comment objects
  70  dblink          - Should contain a Bio::Annotation::DBLink object
  71  segment         - Should contain a Bio::Annotation::SimpleValue object
  72  origin          - Should contain a Bio::Annotation::SimpleValue object
  73  wgs             - Should contain a Bio::Annotation::SimpleValue object
  74
  75 =back
  76
  77 =head1 Where does the data go?
  78
  79 Data parsed in Bio::SeqIO::genbank is stored in a variety of data
  80 fields in the sequence object that is returned. Here is a partial list
  81 of fields.
  82
  83 Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
  84 the top level object which defines a function called NAME() which
  85 stores this information.
  86
  87 Items listed as Annotation 'NAME' tell you the data is stored the
  88 associated Bio::AnnotationCollectionI object which is associated with
  89 Bio::Seq objects.  If it is explicitly requested that no annotations
  90 should be stored when parsing a record of course they will not be
  91 available when you try and get them.  If you are having this problem
  92 look at the type of SeqBuilder that is being used to contruct your
  93 sequence object.
  94
  95  Comments             Annotation 'comment'
  96  References           Annotation 'reference'
  97  Segment              Annotation 'segment'
  98  Origin               Annotation 'origin'
  99  Dbsource             Annotation 'dblink'
 100
 101  Accessions           PrimarySeq accession_number()
 102  Secondary accessions RichSeq get_secondary_accessions()
 103  GI number            PrimarySeq primary_id()
 104  LOCUS                PrimarySeq display_id()
 105  Keywords             RichSeq get_keywords()
 106  Dates                RichSeq get_dates()
 107  Molecule             RichSeq molecule()
 108  Seq Version          RichSeq seq_version()
 109  PID                  RichSeq pid()
 110  Division             RichSeq division()
 111  Features             Seq get_SeqFeatures()
 112  Alphabet             PrimarySeq alphabet()
 113  Definition           PrimarySeq description() or desc()
 114  Version              PrimarySeq version()
 115
 116  Sequence             PrimarySeq seq()
 117
 118 There is more information in the Feature-Annotation HOWTO about each
 119 field and how it is mapped to the Sequence object
 120 L<http://bioperl.open-bio.org/wiki/HOWTO:Feature-Annotation>.
 121
 122 =head1 FEEDBACK
 123
 124 =head2 Mailing Lists
 125
 126 User feedback is an integral part of the evolution of this and other
 127 Bioperl modules. Send your comments and suggestions preferably to one
 128 of the Bioperl mailing lists.  Your participation is much appreciated.
 129
 130   bioperl-l@bioperl.org                  - General discussion
 131   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 132
 133 =head2 Support
 134
 135 Please direct usage questions or support issues to the mailing list:
 136
 137 I<bioperl-l@bioperl.org>
 138
 139 rather than to the module maintainer directly. Many experienced and
 140 reponsive experts will be able look at the problem and quickly
 141 address it. Please include a thorough description of the problem
 142 with code and data examples if at all possible.
 143
 144 =head2 Reporting Bugs
 145
 146 Report bugs to the Bioperl bug tracking system to help us keep track
 147 the bugs and their resolution. Bug reports can be submitted via the web:
 148
 149   https://redmine.open-bio.org/projects/bioperl/
 150
 151 =head1 AUTHOR - Bioperl Project
 152
 153 bioperl-l at bioperl.org
 154
 155 Original author Elia Stupka, elia -at- tigem.it
 156
 157 =head1 CONTRIBUTORS
 158
 159 Ewan Birney birney at ebi.ac.uk
 160 Jason Stajich jason at bioperl.org
 161 Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
 162 Lincoln Stein lstein at cshl.org
 163 Heikki Lehvaslaiho, heikki at ebi.ac.uk
 164 Hilmar Lapp, hlapp at gmx.net
 165 Donald G. Jackson, donald.jackson at bms.com
 166 James Wasmuth, james.wasmuth at ed.ac.uk
 167 Brian Osborne, bosborne at alum.mit.edu
 168 Chris Fields, cjfields at bioperl dot org
 169
 170 =head1 APPENDIX
 171
 172 The rest of the documentation details each of the object
 173 methods. Internal methods are usually preceded with a _
 174
 175 =cut
 176
 177 # Let the code begin...
 178
 179 package Bio::SeqIO::genbank;
 180 use strict;
 181
 182 use Bio::SeqIO::FTHelper;
 183 use Bio::SeqFeature::Generic;
 184 use Bio::Species;
 185 use Bio::Seq::SeqFactory;
 186 use Bio::Annotation::Collection;
 187 use Bio::Annotation::Comment;
 188 use Bio::Annotation::Reference;
 189 use Bio::Annotation::DBLink;
 190
 191 use base qw(Bio::SeqIO);
 192
 193 # Note that a qualifier that exceeds one line (i.e. a long label) will
 194 # automatically be quoted regardless:
 195
 196 our $FTQUAL_LINE_LENGTH = 60;
 197
 198 our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
 199     anticodon           citation
 200     codon               codon_start
 201     cons_splice         direction
 202     evidence            label
 203     mod_base            number
 204     rpt_type            rpt_unit
 205     transl_except       transl_table
 206     usedin
 207     );
 208
 209 our %DBSOURCE = map {$_ => 1} qw(
 210     EchoBASE     IntAct    SWISS-2DPAGE    ECO2DBASE    ECOGENE    TIGRFAMs
 211     TIGR    GO    InterPro    Pfam    PROSITE    SGD    GermOnline
 212     HSSP    PhosSite    Ensembl    RGD    AGD    ArrayExpress    KEGG
 213     H-InvDB    HGNC    LinkHub    PANTHER    PRINTS    SMART    SMR
 214     MGI    MIM    RZPD-ProtExp    ProDom    MEROPS    TRANSFAC    Reactome
 215     UniGene    GlycoSuiteDB    PIRSF    HSC-2DPAGE    PHCI-2DPAGE
 216     PMMA-2DPAGE    Siena-2DPAGE    Rat-heart-2DPAGE    Aarhus/Ghent-2DPAGE
 217     Biocyc    MetaCyc    Biocyc:Metacyc    GenomeReviews    FlyBase
 218     TMHOBP    COMPLUYEAST-2DPAGE    OGP    DictyBase    HAMAP
 219     PhotoList    Gramene    WormBase    WormPep    Genew    ZFIN
 220     PeroxiBase    MaizeDB    TAIR    DrugBank    REBASE    HPA
 221     swissprot    GenBank    GenPept    REFSEQ    embl    PDB    UniProtKB
 222     DIP    PeptideAtlas    PRIDE    CYGD    HOGENOME    Gene3D
 223     Project);
 224
 225 our %VALID_MOLTYPE = map {$_ => 1} qw(NA DNA RNA tRNA rRNA cDNA cRNA ms-DNA
 226     mRNA  uRNA  ss-RNA  ss-DNA  snRNA snoRNA PRT);
 227
 228 our %VALID_ALPHABET = (
 229     'bp' => 'dna',
 230     'aa' => 'protein',
 231     'rc' => '' # rc = release candidate; file has no sequences
 232 );
 233
 234 sub _initialize {
 235     my($self,@args) = @_;
 236
 237     $self->SUPER::_initialize(@args);
 238     # hash for functions for decoding keys.
 239     $self->{'_func_ftunit_hash'} = {};
 240     $self->_show_dna(1); # sets this to one by default. People can change it
 241     if( ! defined $self->sequence_factory ) {
 242             $self->sequence_factory(Bio::Seq::SeqFactory->new
 243                             (-verbose => $self->verbose(),
 244                              -type => 'Bio::Seq::RichSeq'));
 245     }
 246 }
 247
 248 =head2 next_seq
 249
 250  Title   : next_seq
 251  Usage   : $seq = $stream->next_seq()
 252  Function: returns the next sequence in the stream
 253  Returns : Bio::Seq object
 254  Args    :
 255
 256 =cut
 257
 258 sub next_seq {
 259     my ( $self, @args ) = @_;
 260     my %args    = @args;
 261     my $builder = $self->sequence_builder();
 262     my $seq;
 263     my %params;
 264
 265   RECORDSTART:
 266     while (1) {
 267         my $buffer;
 268         my ( @acc,        @features );
 269         my ( $display_id, $annotation );
 270         my $species;
 271
 272         # initialize; we may come here because of starting over
 273         @features   = ();
 274         $annotation = undef;
 275         @acc        = ();
 276         $species    = undef;
 277         %params     = ( -verbose => $self->verbose );    # reset hash
 278         local ($/) = "\n";
 279         while ( defined( $buffer = $self->_readline() ) ) {
 280             last if index( $buffer, 'LOCUS       ' ) == 0;
 281         }
 282         return unless defined $buffer;                   # end of file
 283         $buffer =~ /^LOCUS\s+(\S.*)$/o
 284           || $self->throw(
 285 "GenBank stream with bad LOCUS line. Not GenBank in my book. Got '$buffer'"
 286           );
 287
 288         my @tokens = split( ' ', $1 );
 289
 290         # this is important to have the id for display in e.g. FTHelper,
 291         # otherwise you won't know which entry caused an error
 292         $display_id = shift(@tokens);
 293         $params{'-display_id'} = $display_id;
 294
 295         # may still be useful if we don't want the seq
 296         my $seqlength = shift(@tokens);
 297         if ( exists $VALID_ALPHABET{$seqlength} ) {
 298
 299             # moved one token too far.  No locus name?
 300             $self->warn(
 301 "Bad LOCUS name?  Changing [$params{'-display_id'}] to 'unknown' and length to $display_id"
 302             );
 303             $params{'-display_id'} = 'unknown';
 304             $params{'-length'}     = $display_id;
 305
 306             # add token back...
 307             unshift @tokens, $seqlength;
 308         }
 309         else {
 310             $params{'-length'} = $seqlength;
 311         }
 312
 313 # the alphabet of the entry
 314 # shouldn't assign alphabet unless one is specifically designated (such as for rc files)
 315         my $alphabet = lc( shift @tokens );
 316         $params{'-alphabet'} =
 317           ( exists $VALID_ALPHABET{$alphabet} )
 318           ? $VALID_ALPHABET{$alphabet}
 319           : $self->warn("Unknown alphabet: $alphabet");
 320
 321         # for aa there is usually no 'molecule' (mRNA etc)
 322         if ( $params{'-alphabet'} eq 'protein' ) {
 323             $params{'-molecule'} = 'PRT';
 324         }
 325         else {
 326             $params{'-molecule'} = shift(@tokens);
 327         }
 328
 329         # take care of lower case issues
 330         if ( $params{'-molecule'} eq 'dna' || $params{'-molecule'} eq 'rna' ) {
 331             $params{'-molecule'} = uc $params{'-molecule'};
 332         }
 333         $self->debug( "Unrecognized molecule type:" . $params{'-molecule'} )
 334           if !exists( $VALID_MOLTYPE{ $params{'-molecule'} } );
 335         my $circ = shift(@tokens);
 336         if ( $circ eq 'circular' ) {
 337             $params{'-is_circular'} = 1;
 338             $params{'-division'}    = shift(@tokens);
 339         }
 340         else {
 341             # 'linear' or 'circular' may actually be omitted altogether
 342             $params{'-division'} =
 343               ( CORE::length($circ) == 3 ) ? $circ : shift(@tokens);
 344         }
 345         my $date = join( ' ', @tokens );    # we lump together the rest
 346
 347         # this is per request bug #1513
 348         # we can handle
 349         # 9-10-2003
 350         # 9-10-03
 351         # 09-10-2003
 352         # 09-10-03
 353         if ( $date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/ ) {
 354             if ( length($date) < 11 ) {
 355
 356                 # improperly formatted date
 357                 # But we'll be nice and fix it for them
 358                 my ( $d, $m, $y ) = ( $2, $3, $4 );
 359                 if ( length($d) == 1 ) {
 360                     $d = "0$d";
 361                 }
 362
 363                 # guess the century here
 364                 if ( length($y) == 2 ) {
 365                     if ( $y > 60 ) {    # arbitrarily guess that '60' means 1960
 366                         $y = "19$y";
 367                     }
 368                     else {
 369                         $y = "20$y";
 370                     }
 371                     $self->warn(
 372 "Date was malformed, guessing the century for $date to be $y\n"
 373                     );
 374                 }
 375                 $params{'-dates'} = [ join( '-', $d, $m, $y ) ];
 376             }
 377             else {
 378                 $params{'-dates'} = [$date];
 379             }
 380         }
 381
 382         # set them all at once
 383         $builder->add_slot_value(%params);
 384         %params = ();
 385
 386         # parse the rest if desired, otherwise start over
 387         if ( !$builder->want_object() ) {
 388             $builder->make_object();
 389             next RECORDSTART;
 390         }
 391
 392         # set up annotation depending on what the builder wants
 393         if ( $builder->want_slot('annotation') ) {
 394             $annotation = Bio::Annotation::Collection->new();
 395         }
 396         $buffer = $self->_readline();
 397         until ( !defined($buffer) ) {
 398             $_ = $buffer;
 399
 400             # Description line(s)
 401             if (/^DEFINITION\s+(\S.*\S)/) {
 402                 my @desc = ($1);
 403                 while ( defined( $_ = $self->_readline ) ) {
 404                     if (/^\s+(.*)/) { push( @desc, $1 ); next }
 405                     last;
 406                 }
 407                 $builder->add_slot_value( -desc => join( ' ', @desc ) );
 408
 409                 # we'll continue right here because DEFINITION always comes
 410                 # at the top of the entry
 411                 $buffer = $_;
 412             }
 413
 414             # accession number (there can be multiple accessions)
 415             if (/^ACCESSION\s+(\S.*\S)/) {
 416                 push( @acc, split( /\s+/, $1 ) );
 417                 while ( defined( $_ = $self->_readline ) ) {
 418                     /^\s+(.*)/ && do { push( @acc, split( /\s+/, $1 ) ); next };
 419                     last;
 420                 }
 421                 $buffer = $_;
 422                 next;
 423             }
 424
 425             # PID
 426             elsif (/^PID\s+(\S+)/) {
 427                 $params{'-pid'} = $1;
 428             }
 429
 430             # Version number
 431             elsif (/^VERSION\s+(\S.+)$/) {
 432                 my ( $acc, $gi ) = split( ' ', $1 );
 433                 if ( $acc =~ /^\w+\.(\d+)/ ) {
 434                     $params{'-version'}     = $1;
 435                     $params{'-seq_version'} = $1;
 436                 }
 437                 if ( $gi && ( index( $gi, "GI:" ) == 0 ) ) {
 438                     $params{'-primary_id'} = substr( $gi, 3 );
 439                 }
 440             }
 441
 442             # Keywords
 443             elsif (/^KEYWORDS\s+(\S.*)/) {
 444                 my @kw = split( /\s*\;\s*/, $1 );
 445                 while ( defined( $_ = $self->_readline ) ) {
 446                     chomp;
 447                     /^\s+(.*)/
 448                       && do { push( @kw, split( /\s*\;\s*/, $1 ) ); next };
 449                     last;
 450                 }
 451
 452                 @kw && $kw[-1] =~ s/\.$//;
 453                 $params{'-keywords'} = \@kw;
 454                 $buffer = $_;
 455                 next;
 456             }
 457
 458             # Organism name and phylogenetic information
 459             elsif (/^SOURCE\s+\S/) {
 460                 if ( $builder->want_slot('species') ) {
 461                     $species = $self->_read_GenBank_Species( \$buffer );
 462                     $builder->add_slot_value( -species => $species );
 463                 }
 464                 else {
 465                     while ( defined( $buffer = $self->_readline() ) ) {
 466                         last if substr( $buffer, 0, 1 ) ne ' ';
 467                     }
 468                 }
 469                 next;
 470             }
 471
 472             # References
 473             elsif (/^REFERENCE\s+\S/) {
 474                 if ($annotation) {
 475                     my @refs = $self->_read_GenBank_References( \$buffer );
 476                     foreach my $ref (@refs) {
 477                         $annotation->add_Annotation( 'reference', $ref );
 478                     }
 479                 }
 480                 else {
 481                     while ( defined( $buffer = $self->_readline() ) ) {
 482                         last if substr( $buffer, 0, 1 ) ne ' ';
 483                     }
 484                 }
 485                 next;
 486             }
 487
 488             # Project
 489             elsif (/^PROJECT\s+(\S.*)/) {
 490                 if ($annotation) {
 491                     my $project =
 492                       Bio::Annotation::SimpleValue->new( -value => $1 );
 493                     $annotation->add_Annotation( 'project', $project );
 494                 }
 495             }
 496
 497             # Comments
 498             elsif (/^COMMENT\s+(\S.*)/) {
 499                 if ($annotation) {
 500                     my $comment = $1;
 501                     while ( defined( $_ = $self->_readline ) ) {
 502                         last if (/^\S/);
 503                         $comment .= $_;
 504                     }
 505                     $comment =~ s/\n/ /g;
 506                     $comment =~ s/  +/ /g;
 507                     $annotation->add_Annotation(
 508                         'comment',
 509                         Bio::Annotation::Comment->new(
 510                             -text    => $comment,
 511                             -tagname => 'comment'
 512                         )
 513                     );
 514                     $buffer = $_;
 515                 }
 516                 else {
 517                     while ( defined( $buffer = $self->_readline() ) ) {
 518                         last if substr( $buffer, 0, 1 ) ne ' ';
 519                     }
 520                 }
 521                 next;
 522             }
 523
 524             # Corresponding Genbank nucleotide id, Genpept only
 525             elsif (/^DB(?:SOURCE|LINK)\s+(\S.+)/) {
 526                 if ($annotation) {
 527                     my $dbsource = $1;
 528                     while ( defined( $_ = $self->_readline ) ) {
 529                         last if (/^\S/);
 530                         $dbsource .= $_;
 531                     }
 532
 533                     # deal with UniProKB dbsources
 534                     if ( $dbsource =~
 535                         s/(UniProt(?:KB)?|swissprot):\s+locus\s+(\S+)\,.+\n// )
 536                     {
 537                         $annotation->add_Annotation(
 538                             'dblink',
 539                             Bio::Annotation::DBLink->new(
 540                                 -primary_id => $2,
 541                                 -database   => $1,
 542                                 -tagname    => 'dblink'
 543                             )
 544                         );
 545                         if ( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
 546                             $annotation->add_Annotation(
 547                                 'swissprot_dates',
 548                                 Bio::Annotation::SimpleValue->new(
 549                                     -tagname => 'date_created',
 550                                     -value   => $1
 551                                 )
 552                             );
 553                         }
 554                         while ( $dbsource =~
 555 s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g
 556                           )
 557                         {
 558                             $annotation->add_Annotation(
 559                                 'swissprot_dates',
 560                                 Bio::Annotation::SimpleValue->new(
 561                                     -tagname => 'date_updated',
 562                                     -value   => $2
 563                                 )
 564                             );
 565                         }
 566                         $dbsource =~ s/\n/ /g;
 567                         if ( $dbsource =~
 568                             s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/ )
 569                         {
 570                             # will use $i to determine even or odd
 571                             # for swissprot the accessions are paired
 572                             my $i = 0;
 573                             for my $dbsrc ( split( /,\s+/, $1 ) ) {
 574                                 if (   $dbsrc =~ /(\S+)\.(\d+)/
 575                                     || $dbsrc =~ /(\S+)/ )
 576                                 {
 577                                     my ( $id, $version ) = ( $1, $2 );
 578                                     $version = '' unless defined $version;
 579                                     my $db;
 580                                     if ( $id =~ /^\d\S{3}/ ) {
 581                                         $db = 'PDB';
 582                                     }
 583                                     else {
 584                                         $db =
 585                                           ( $i++ % 2 ) ? 'GenPept' : 'GenBank';
 586                                     }
 587                                     $annotation->add_Annotation(
 588                                         'dblink',
 589                                         Bio::Annotation::DBLink->new(
 590                                             -primary_id => $id,
 591                                             -version    => $version,
 592                                             -database   => $db,
 593                                             -tagname    => 'dblink'
 594                                         )
 595                                     );
 596                                 }
 597                             }
 598                         }
 599                         elsif (
 600                             $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i )
 601                         {
 602                  # download screwed up and ncbi didn't put acc in for gi numbers
 603                             my $i = 0;
 604                             for my $id ( split( /\,\s+/, $1 ) ) {
 605                                 my ( $acc, $db );
 606                                 if ( $id =~ /gi:\s+(\d+)/ ) {
 607                                     $acc = $1;
 608                                     $db = ( $i++ % 2 ) ? 'GenPept' : 'GenBank';
 609                                 }
 610                                 elsif ( $id =~ /pdb\s+accession\s+(\S+)/ ) {
 611                                     $acc = $1;
 612                                     $db  = 'PDB';
 613                                 }
 614                                 else {
 615                                     $acc = $id;
 616                                     $db  = '';
 617                                 }
 618                                 $annotation->add_Annotation(
 619                                     'dblink',
 620                                     Bio::Annotation::DBLink->new(
 621                                         -primary_id => $acc,
 622                                         -database   => $db,
 623                                         -tagname    => 'dblink'
 624                                     )
 625                                 );
 626                             }
 627                         }
 628                         else {
 629                             $self->debug("Cannot match $dbsource\n");
 630                         }
 631                         if (
 632                             $dbsource =~
 633                             s/xrefs\s+\(non\-sequence\s+databases\):\s+
 634                 ((?:\S+,\s+)+\S+)//x
 635                           )
 636                         {
 637                             for my $id ( split( /\,\s+/, $1 ) ) {
 638                                 my $db;
 639
 640                                 # this is because GenBank dropped the spaces!!!
 641                                 # I'm sure we're not going to get this right
 642                                 ##if( $id =~ s/^://i ) {
 643                                 ##    $db = $1;
 644                                 ##}
 645                                 $db = substr( $id, 0, index( $id, ':' ) );
 646                                 if ( !exists $DBSOURCE{$db} ) {
 647                                     $db = '';    # do we want 'GenBank' here?
 648                                 }
 649                                 $id = substr( $id, index( $id, ':' ) + 1 );
 650                                 $annotation->add_Annotation(
 651                                     'dblink',
 652                                     Bio::Annotation::DBLink->new(
 653                                         -primary_id => $id,
 654                                         -database   => $db,
 655                                         -tagname    => 'dblink'
 656                                     )
 657                                 );
 658                             }
 659                         }
 660
 661                     }
 662                     else {
 663                         if ( $dbsource =~
 664                             /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/ )
 665                         {
 666                             my ( $db, $id, $version ) = ( $1, $2, $3 );
 667                             $annotation->add_Annotation(
 668                                 'dblink',
 669                                 Bio::Annotation::DBLink->new(
 670                                     -primary_id => $id,
 671                                     -version    => $version,
 672                                     -database   => $db || 'GenBank',
 673                                     -tagname    => 'dblink'
 674                                 )
 675                             );
 676                         }
 677                         elsif ( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)/ ) {
 678                             my ( $db, $id ) = ( $1, $2 );
 679                             $annotation->add_Annotation(
 680                                 'dblink',
 681                                 Bio::Annotation::DBLink->new(
 682                                     -primary_id => $id,
 683                                     -database   => $db || 'GenBank',
 684                                     -tagname    => 'dblink'
 685                                 )
 686                             );
 687                         }
 688                         elsif ( $dbsource =~ /(\S+)([\.:])\s*(\S+)/ ) {
 689                             my ( $db, $version );
 690                             my @ids = ();
 691                             if ( $2 eq ':' ) {
 692                                 $db = $1;
 693
 694           # Genbank 192 release notes say this: "The second field can consist of
 695           #     multiple comma-separated identifiers, if a sequence record has
 696           #     multiple DBLINK cross-references of a given type."
 697           #     For example: DBLINK      Project:100,200,300"
 698                                 @ids = split( /,/, $3 );
 699                             }
 700                             else {
 701                                 ( $db, $version ) = ( 'GenBank', $3 );
 702                                 $ids[0] = $1;
 703                             }
 704
 705                             foreach my $id (@ids) {
 706                                 $annotation->add_Annotation(
 707                                     'dblink',
 708                                     Bio::Annotation::DBLink->new(
 709                                         -primary_id => $id,
 710                                         -version    => $version,
 711                                         -database   => $db,
 712                                         -tagname    => 'dblink'
 713                                     )
 714                                 );
 715                             }
 716                         }
 717                         else {
 718                             $self->warn(
 719                                 "Unrecognized DBSOURCE data: $dbsource\n");
 720                         }
 721                     }
 722
 723                     $buffer = $_;
 724                 }
 725                 else {
 726                     while ( defined( $buffer = $self->_readline() ) ) {
 727                         last if substr( $buffer, 0, 1 ) ne ' ';
 728                     }
 729                 }
 730                 next;
 731             }
 732
 733             # Exit at start of Feature table, or start of sequence
 734             last if (/^(FEATURES|ORIGIN)/);
 735
 736             # Get next line and loop again
 737             $buffer = $self->_readline;
 738         }
 739         return unless defined $buffer;
 740
 741         # add them all at once for efficiency
 742         $builder->add_slot_value(
 743             -accession_number     => shift(@acc),
 744             -secondary_accessions => \@acc,
 745             %params
 746         );
 747         $builder->add_slot_value( -annotation => $annotation ) if $annotation;
 748         %params = ();    # reset before possible re-use to avoid setting twice
 749
 750         # start over if we don't want to continue with this entry
 751         if ( !$builder->want_object() ) {
 752             $builder->make_object();
 753             next RECORDSTART;
 754         }
 755
 756         # some "minimal" formats may not necessarily have a feature table
 757         if ( $builder->want_slot('features') && defined($_) && /^FEATURES/o ) {
 758
 759             # need to read the first line of the feature table
 760             $buffer = $self->_readline;
 761
 762             # DO NOT read lines in the while condition -- this is done as a side
 763             # effect in _read_FTHelper_GenBank!
 764
 765             #       part of new circular spec:
 766             #       commented out for now until kinks worked out
 767             #my $sourceEnd = 0;
 768             #$sourceEnd = $2 if ($buffer =~ /(\d+?)\.\.(\d+?)$/);
 769
 770             while ( defined($buffer) ) {
 771
 772                 # check immediately -- not at the end of the loop
 773                 # note: GenPept entries obviously do not have a BASE line
 774                 last if ( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o );
 775
 776                 # slurp in one feature at a time -- at return, the start of
 777                 # the next feature will have been read already, so we need
 778                 # to pass a reference, and the called method must set this
 779                 # to the last line read before returning
 780
 781                 my $ftunit = $self->_read_FTHelper_GenBank( \$buffer );
 782
 783                 #               implement new circular spec: features that cross the origin are now
 784                 #               seamless instead of being 2 separate joined features
 785                 #               commented out until kinks get worked out
 786                 #if ((! $args{'-nojoin'}) && $ftunit->{'loc'} =~ /^join\((\d+?)\.\.(\d+?),(\d+?)..(\d+?)\)$/
 787                 #&& $sourceEnd == $2 && $3 == 1) {
 788                 #my $start = $1;
 789                 #my $end = $2 + $4;
 790                 #$ftunit->{'loc'} = "$start..$end";
 791                 #}
 792
 793                 # fix suggested by James Diggans
 794
 795                 if ( !defined $ftunit ) {
 796
 797                     # GRRRR. We have fallen over. Try to recover
 798                     $self->warn( "Unexpected error in feature table for "
 799                           . $params{'-display_id'}
 800                           . " Skipping feature, attempting to recover" );
 801                     unless ( ( $buffer =~ /^\s{5,5}\S+/o )
 802                         or ( $buffer =~ /^\S+/o ) )
 803                     {
 804                         $buffer = $self->_readline();
 805                     }
 806                     next;    # back to reading FTHelpers
 807                 }
 808
 809                 # process ftunit
 810                 my $feat =
 811                   $ftunit->_generic_seqfeature( $self->location_factory(),
 812                     $display_id );
 813
 814                 # add taxon_id from source if available
 815                 if (
 816                        $species
 817                     && ( $feat->primary_tag eq 'source' )
 818                     && $feat->has_tag('db_xref')
 819                     && (
 820                         !$species->ncbi_taxid()
 821                         || (   $species->ncbi_taxid
 822                             && $species->ncbi_taxid =~ /^list/ )
 823                     )
 824                   )
 825                 {
 826                     foreach my $tagval ( $feat->get_tag_values('db_xref') ) {
 827                         if ( index( $tagval, "taxon:" ) == 0 ) {
 828                             $species->ncbi_taxid( substr( $tagval, 6 ) );
 829                             last;
 830                         }
 831                     }
 832                 }
 833
 834                 # add feature to list of features
 835                 push( @features, $feat );
 836             }
 837             $builder->add_slot_value( -features => \@features );
 838             $_ = $buffer;
 839         }
 840
 841         if ( defined($_) ) {
 842             # CONTIG lines: TODO, this needs to be cleaned up
 843             if (/^CONTIG\s+(.*)/o) {
 844                 my $ctg = $1;
 845                 while ( defined( $_ = $self->_readline)) {
 846                     last if m{^ORIGIN|//}o;
 847                     s/\s+(.*)/$1/;
 848                     $ctg .= $_;
 849                 }
 850                 if ($ctg) {
 851                     $annotation->add_Annotation(
 852                         Bio::Annotation::SimpleValue->new(
 853                             -tagname => 'contig',
 854                             -value   => $ctg
 855                         )
 856                     );
 857                 }
 858             }
 859             elsif (/^WGS|WGS_SCAFLD\s+/o) {    # catch WGS/WGS_SCAFLD lines
 860                 while ( $_ =~ s/(^WGS|WGS_SCAFLD)\s+// ) {    # gulp lines
 861                     chomp;
 862                     $annotation->add_Annotation(
 863                         Bio::Annotation::SimpleValue->new(
 864                             -value   => $_,
 865                             -tagname => lc($1)
 866                         )
 867                     );
 868                     $_ = $self->_readline;
 869                 }
 870             }
 871             elsif ( !m{^ORIGIN|//}o ) {    # advance to the sequence, if any
 872                 while ( defined( $_ = $self->_readline ) ) {
 873                     last if m{^(ORIGIN|//)};
 874                 }
 875             }
 876         }
 877         if ( !$builder->want_object() ) {
 878             $builder->make_object();        # implicit end-of-object
 879             next RECORDSTART;
 880         }
 881         if ( $builder->want_slot('seq') ) {
 882             # the fact that we want a sequence does not necessarily mean that
 883             # there also is a sequence ...
 884             if ( defined($_) && s/^ORIGIN\s+// ) {
 885                 if ( $annotation && length($_) > 0 ) {
 886                     $annotation->add_Annotation(
 887                         'origin',
 888                         Bio::Annotation::SimpleValue->new(
 889                             -tagname => 'origin',
 890                             -value   => $_
 891                         )
 892                     );
 893                 }
 894                 my $seqc = '';
 895                 while ( defined( $_ = $self->_readline ) ) {
 896                     last if m{^//};
 897                     $_ = uc($_);
 898                     s/[^A-Za-z]//g;
 899                     $seqc .= $_;
 900                 }
 901
 902                 $builder->add_slot_value( -seq => $seqc );
 903             }
 904         }
 905         elsif ( defined($_) && ( substr( $_, 0, 2 ) ne '//' ) ) {
 906
 907             # advance to the end of the record
 908             while ( defined( $_ = $self->_readline ) ) {
 909                 last if substr( $_, 0, 2 ) eq '//';
 910             }
 911         }
 912
 913         # Unlikely, but maybe the sequence is so weird that we don't want it
 914         # anymore. We don't want to return undef if the stream's not exhausted
 915         # yet.
 916         $seq = $builder->make_object();
 917         next RECORDSTART unless $seq;
 918         last RECORDSTART;
 919     }    # end while RECORDSTART
 920
 921     return $seq;
 922 }
 923
 924 =head2 write_seq
 925
 926  Title   : write_seq
 927  Usage   : $stream->write_seq($seq)
 928  Function: writes the $seq object (must be seq) to the stream
 929  Returns : 1 for success and 0 for error
 930  Args    : array of 1 to n Bio::SeqI objects
 931
 932
 933 =cut
 934
 935 sub write_seq {
 936     my ($self,@seqs) = @_;
 937
 938     foreach my $seq ( @seqs ) {
 939         $self->throw("Attempting to write with no seq!") unless defined $seq;
 940
 941         if( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
 942             $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
 943         }
 944
 945         my $str = $seq->seq;
 946
 947         my ($div, $mol);
 948         my $len = $seq->length();
 949
 950         if ( $seq->can('division') ) {
 951             $div = $seq->division;
 952         }
 953         if( !defined $div || ! $div ) { $div = 'UNK'; }
 954         my $alpha = $seq->alphabet;
 955         if( !$seq->can('molecule') || ! defined ($mol = $seq->molecule()) ) {
 956             $mol =  $alpha || 'DNA';
 957         }
 958
 959         my $circular = 'linear  ';
 960         $circular = 'circular' if $seq->is_circular;
 961
 962         local($^W) = 0; # supressing warnings about uninitialized fields.
 963
 964         my $temp_line;
 965         if( $self->_id_generation_func ) {
 966             $temp_line = &{$self->_id_generation_func}($seq);
 967         } else {
 968             my $date = '';
 969             if( $seq->can('get_dates') ) {
 970                 ($date) = $seq->get_dates();
 971             }
 972
 973             $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
 974                 if $seq->display_id =~ /\s/;
 975
 976             $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s\n",
 977                                   'LOCUS', $seq->id(),$len,
 978                                   (lc($alpha) eq 'protein') ? ('aa','', '') :
 979                                   ('bp', '',$mol),$circular,$div,$date);
 980         }
 981
 982         $self->_print($temp_line);
 983         $self->_write_line_GenBank_regex("DEFINITION  ", "            ",
 984                                          $seq->desc(),"\\s\+\|\$",80);
 985
 986         # if there, write the accession line
 987
 988         if( $self->_ac_generation_func ) {
 989             $temp_line = &{$self->_ac_generation_func}($seq);
 990             $self->_print("ACCESSION   $temp_line\n");
 991         } else {
 992             my @acc = ();
 993             push(@acc, $seq->accession_number());
 994             if( $seq->isa('Bio::Seq::RichSeqI') ) {
 995                 push(@acc, $seq->get_secondary_accessions());
 996             }
 997             $self->_print("ACCESSION   ", join(" ", @acc), "\n");
 998             # otherwise - cannot print <sigh>
 999         }
1000
1001         # if PID defined, print it
1002         if($seq->isa('Bio::Seq::RichSeqI') && $seq->pid()) {
1003             $self->_print("PID         ", $seq->pid(), "\n");
1004         }
1005
1006         # if there, write the version line
1007
1008         if( defined $self->_sv_generation_func() ) {
1009             $temp_line = &{$self->_sv_generation_func}($seq);
1010             if( $temp_line ) {
1011                 $self->_print("VERSION     $temp_line\n");
1012             }
1013         } else {
1014             if($seq->isa('Bio::Seq::RichSeqI') && defined($seq->seq_version)) {
1015                 my $id = $seq->primary_id(); # this may be a GI number
1016                 $self->_print("VERSION     ",
1017                               $seq->accession_number(), ".", $seq->seq_version,
1018                               ($id && ($id =~ /^\d+$/) ? "  GI:".$id : ""),
1019                               "\n");
1020             }
1021         }
1022
1023     # if there, write the PROJECT line
1024         for my $proj ( $seq->annotation->get_Annotations('project') ) {
1025                 $self->_print("PROJECT     ".$proj->value."\n");
1026     }
1027
1028         # if there, write the DBSOURCE line
1029         foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
1030         my ($db, $id) = ($ref->database, $ref->primary_id);
1031         my $prefix = $db eq 'Project' ? 'DBLINK' : 'DBSOURCE';
1032         my $text =  $db eq 'GenBank' ? ''           :
1033                     $db eq 'Project' ? "$db:$id"    :  "$db accession $id";
1034             $self->_print(sprintf ("%-11s %s\n",$prefix, $text));
1035         }
1036
1037         # if there, write the keywords line
1038         if( defined $self->_kw_generation_func() ) {
1039             $temp_line = &{$self->_kw_generation_func}($seq);
1040             $self->_print("KEYWORDS    $temp_line\n");
1041         } else {
1042             if( $seq->can('keywords') ) {
1043                 my $kw = $seq->keywords;
1044                 $kw .= '.' if( $kw !~ /\.$/ );
1045                 $self->_print("KEYWORDS    $kw\n");
1046             }
1047         }
1048
1049         # SEGMENT if it exists
1050         foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
1051             $self->_print(sprintf ("%-11s %s\n",'SEGMENT',
1052                                    $ref->value));
1053         }
1054
1055         # Organism lines
1056         if (my $spec = $seq->species) {
1057             my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
1058                                   $spec->scientific_name,
1059                                   $spec->common_name);
1060             my @classification;
1061         if ($spec->isa('Bio::Species')) {
1062             @classification = $spec->classification;
1063             shift(@classification);
1064         } else {
1065             # Bio::Taxon should have a DB handle of some type attached, so
1066             # derive the classification from that
1067             my $node = $spec;
1068             while ($node) {
1069                 $node = $node->ancestor || last;
1070                 unshift(@classification, $node->node_name);
1071                 #$node eq $root && last;
1072             }
1073             @classification = reverse @classification;
1074         }
1075             my $abname = $spec->name('abbreviated') ? # from genbank file
1076                 $spec->name('abbreviated')->[0] : $sn;
1077             my $sl = $on ? "$on "            : '';
1078             $sl   .= $cn ? $abname." ($cn)" : "$abname";
1079
1080             $self->_write_line_GenBank_regex("SOURCE      ", ' 'x12, $sl, "\\s\+\|\$",80);
1081             $self->_print("  ORGANISM  ", $spec->scientific_name, "\n");
1082             my $OC = join('; ', (reverse(@classification))) .'.';
1083             $self->_write_line_GenBank_regex(' 'x12,' 'x12,
1084                                              $OC,"\\s\+\|\$",80);
1085         }
1086
1087         # Reference lines
1088         my $count = 1;
1089         foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
1090             $temp_line = "REFERENCE   $count";
1091             if ($ref->start) {
1092                 $temp_line .= sprintf ("  (%s %d to %d)",
1093                                        ($seq->alphabet() eq "protein" ?
1094                                         "residues" : "bases"),
1095                                        $ref->start,$ref->end);
1096             } elsif ($ref->gb_reference) {
1097                 $temp_line .= sprintf ("  (%s)", $ref->gb_reference);
1098             }
1099             $self->_print("$temp_line\n");
1100             $self->_write_line_GenBank_regex("  AUTHORS   ",' 'x12,
1101                                              $ref->authors,"\\s\+\|\$",80);
1102             $self->_write_line_GenBank_regex("  CONSRTM   ",' 'x12,
1103                                              $ref->consortium,"\\s\+\|\$",80) if $ref->consortium;
1104             $self->_write_line_GenBank_regex("  TITLE     "," "x12,
1105                                              $ref->title,"\\s\+\|\$",80);
1106             $self->_write_line_GenBank_regex("  JOURNAL   "," "x12,
1107                                              $ref->location,"\\s\+\|\$",80);
1108             if( $ref->medline) {
1109                 $self->_write_line_GenBank_regex("  MEDLINE   "," "x12,
1110                                                  $ref->medline, "\\s\+\|\$",80);
1111                 # I am assuming that pubmed entries only exist when there
1112                 # are also MEDLINE entries due to the indentation
1113             }
1114             # This could be a wrong assumption
1115             if( $ref->pubmed ) {
1116                 $self->_write_line_GenBank_regex("   PUBMED   "," "x12,
1117                                                  $ref->pubmed, "\\s\+\|\$",
1118                                                  80);
1119             }
1120             # put remark at the end
1121             if ($ref->comment) {
1122                 $self->_write_line_GenBank_regex("  REMARK    "," "x12,
1123                                                  $ref->comment,"\\s\+\|\$",80);
1124             }
1125             $count++;
1126         }
1127
1128         # Comment lines
1129         foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
1130             $self->_write_line_GenBank_regex("COMMENT     "," "x12,
1131                                              $comment->text,"\\s\+\|\$",80);
1132         }
1133
1134     # FEATURES section
1135         $self->_print("FEATURES             Location/Qualifiers\n");
1136
1137         if( defined $self->_post_sort ) {
1138             # we need to read things into an array. Process. Sort them. Print 'em
1139
1140             my $post_sort_func = $self->_post_sort();
1141             my @fth;
1142
1143             foreach my $sf ( $seq->top_SeqFeatures ) {
1144                   push(@fth,Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq));
1145             }
1146
1147             @fth = sort { &$post_sort_func($a,$b) } @fth;
1148
1149             foreach my $fth ( @fth ) {
1150                   $self->_print_GenBank_FTHelper($fth);
1151             }
1152         } else {
1153             # not post sorted. And so we can print as we get them.
1154             # lower memory load...
1155
1156             foreach my $sf ( $seq->top_SeqFeatures ) {
1157                   my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq);
1158                   foreach my $fth ( @fth ) {
1159                       if( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
1160                           $sf->throw("Cannot process FTHelper... $fth");
1161                       }
1162                       $self->_print_GenBank_FTHelper($fth);
1163                   }
1164            }
1165         }
1166
1167         # deal with WGS; WGS_SCAFLD present only if WGS is also present
1168         if($seq->annotation->get_Annotations('wgs')) {
1169             foreach my $wgs
1170                 (map {$seq->annotation->get_Annotations($_)} qw(wgs wgs_scaffold)) {
1171                     $self->_print(sprintf ("%-11s %s\n",uc($wgs->tagname),
1172                                            $wgs->value));
1173                 }
1174             $self->_show_dna(0);
1175         }
1176         if($seq->annotation->get_Annotations('contig')) {
1177             my $ct = 0;
1178             my $cline;
1179             foreach my $contig ($seq->annotation->get_Annotations('contig')) {
1180                 unless ($ct) {
1181                     $cline = uc($contig->tagname)."      ".$contig->value."\n";
1182                 } else {
1183                     $cline = "            ".$contig->value."\n";
1184                 }
1185                 $self->_print($cline);
1186                 $ct++;
1187             }
1188             $self->_show_dna(0);
1189         }
1190         if( $seq->length == 0 ) { $self->_show_dna(0) }
1191
1192         if( $self->_show_dna() == 0 ) {
1193             $self->_print("\n//\n");
1194             return;
1195         }
1196
1197         # finished printing features.
1198
1199         $str =~ tr/A-Z/a-z/;
1200
1201         my ($o) = $seq->annotation->get_Annotations('origin');
1202         $self->_print(sprintf("%-12s%s\n",
1203                               'ORIGIN', $o ? $o->value : ''));
1204         # print out the sequence
1205         my $nuc = 60;           # Number of nucleotides per line
1206         my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line
1207         my $out_pat   = 'A11' x 6; # Pattern for packing a line
1208         my $length = length($str);
1209
1210         # Calculate the number of nucleotides which fit on whole lines
1211         my $whole = int($length / $nuc) * $nuc;
1212
1213         # Print the whole lines
1214         my $i;
1215         for ($i = 0; $i < $whole; $i += $nuc) {
1216             my $blocks = pack $out_pat,
1217             unpack $whole_pat,
1218             substr($str, $i, $nuc);
1219             chop $blocks;
1220             $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
1221         }
1222
1223         # Print the last line
1224         if (my $last = substr($str, $i)) {
1225             my $last_len = length($last);
1226             my $last_pat = 'a10' x int($last_len / 10) .
1227                 'a'. $last_len % 10;
1228             my $blocks = pack $out_pat,
1229             unpack($last_pat, $last);
1230             $blocks =~ s/ +$//;
1231             $self->_print(sprintf("%9d $blocks\n",
1232                                   $length - $last_len + 1));
1233         }
1234
1235         $self->_print("//\n");
1236
1237         $self->flush if $self->_flush_on_write && defined $self->_fh;
1238         return 1;
1239     }
1240 }
1241
1242 =head2 _print_GenBank_FTHelper
1243
1244  Title   : _print_GenBank_FTHelper
1245  Usage   :
1246  Function:
1247  Example :
1248  Returns :
1249  Args    :
1250
1251 =cut
1252
1253 sub _print_GenBank_FTHelper {
1254     my ( $self, $fth ) = @_;
1255
1256     if ( !ref $fth || !$fth->isa('Bio::SeqIO::FTHelper') ) {
1257         $fth->warn(
1258             "$fth is not a FTHelper class. Attempting to print but there could be issues"
1259         );
1260     }
1261
1262     my $spacer = ( length $fth->key >= 15 ) ? ' ' : '';
1263     $self->_write_line_GenBank_regex(
1264         sprintf( "     %-16s%s", $fth->key, $spacer ),
1265         " " x 21, $fth->loc, "\,\|\$", 80 );
1266
1267     foreach my $tag ( keys %{ $fth->field } ) {
1268         # Account for hash structure in Annotation::DBLink, not the expected array
1269         if ( $tag eq 'db_xref' && grep /HASH/, @{ $fth->field->{$tag} }) {
1270             for my $ref ( @{ $fth->field->{$tag} } ) {
1271                 my $db = $ref->{'database'};
1272                 my $id = $ref->{'primary_id'};
1273                 $self->_write_line_GenBank_regex( " " x 21, " " x 21,
1274                         "/$tag=\"$db:$id\"", "\.\|\$", 80 );
1275             }
1276         }
1277         # The usual case, where all values are found in an array
1278         else {
1279             foreach my $value ( @{ $fth->field->{$tag} } ) {
1280                 $value =~ s/\"/\"\"/g;
1281                 if ( $value eq "_no_value" ) {
1282                     $self->_write_line_GenBank_regex( " " x 21, " " x 21,
1283                         "/$tag", "\.\|\$", 80 );
1284                 }
1285
1286                # There are almost 3x more quoted qualifier values and they
1287                # are more common too so we take quoted ones first.
1288                # Long qualifiers, that will be line wrapped, are always quoted
1289                 elsif ( !$FTQUAL_NO_QUOTE{$tag}
1290                     or length("/$tag=$value") >= $FTQUAL_LINE_LENGTH )
1291                 {
1292                     my ($pat) = ( $value =~ /\s/ ? '\s|$' : '.|$' );
1293                     $self->_write_line_GenBank_regex( " " x 21, " " x 21,
1294                         "/$tag=\"$value\"", $pat, 80 );
1295
1296                 }
1297                 else {
1298                     $self->_write_line_GenBank_regex( " " x 21, " " x 21,
1299                         "/$tag=$value", "\.\|\$", 80 );
1300                 }
1301             }
1302         }
1303     }
1304 }
1305
1306
1307 =head2 _read_GenBank_References
1308
1309  Title   : _read_GenBank_References
1310  Usage   :
1311  Function: Reads references from GenBank format. Internal function really
1312  Returns :
1313  Args    :
1314
1315 =cut
1316
1317 sub _read_GenBank_References {
1318     my ($self,$buffer) = @_;
1319     my (@refs);
1320     my $ref;
1321
1322     # assumme things are starting with RN
1323
1324     if( $$buffer !~ /^REFERENCE/ ) {
1325         warn("Not parsing line '$$buffer' which maybe important");
1326     }
1327
1328     $_ = $$buffer;
1329
1330     my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
1331
1332   REFLOOP: while( defined($_) || defined($_ = $self->_readline) ) {
1333       if (/^\s{2}AUTHORS\s+(.*)/o) {
1334           push (@authors, $1);
1335           while ( defined($_ = $self->_readline) ) {
1336               /^\s{9,}(.*)/o && do { push (@authors, $1);next;};
1337               last;
1338           }
1339           $ref->authors(join(' ', @authors));
1340       }
1341       if (/^\s{2}CONSRTM\s+(.*)/o) {
1342           push (@consort, $1);
1343           while ( defined($_ = $self->_readline) ) {
1344               /^\s{9,}(.*)/o && do { push (@consort, $1);next;};
1345               last;
1346           }
1347           $ref->consortium(join(' ', @consort));
1348       }
1349       if (/^\s{2}TITLE\s+(.*)/o)  {
1350           push (@title, $1);
1351           while ( defined($_ = $self->_readline) ) {
1352               /^\s{9,}(.*)/o && do { push (@title, $1);
1353                                      next;
1354                                  };
1355               last;
1356           }
1357           $ref->title(join(' ', @title));
1358       }
1359       if (/^\s{2}JOURNAL\s+(.*)/o) {
1360           push(@loc, $1);
1361           while ( defined($_ = $self->_readline) ) {
1362               # we only match when there are at least 4 spaces
1363               # there is probably a better way to match this
1364               # as it assumes that the describing tag is short enough
1365               /^\s{9,}(.*)/o && do { push(@loc, $1);
1366                                      next;
1367                                  };
1368               last;
1369           }
1370           $ref->location(join(' ', @loc));
1371           redo REFLOOP;
1372       }
1373       if (/^\s{2}REMARK\s+(.*)/o) {
1374           push (@com, $1);
1375           while ( defined($_ = $self->_readline) ) {
1376               /^\s{9,}(.*)/o && do { push(@com, $1);
1377                                      next;
1378                                  };
1379               last;
1380           }
1381           $ref->comment(join(' ', @com));
1382           redo REFLOOP;
1383       }
1384       if( /^\s{2}MEDLINE\s+(.*)/ ) {
1385           push(@medline,$1);
1386           while ( defined($_ = $self->_readline) ) {
1387               /^\s{9,}(.*)/ && do { push(@medline, $1);
1388                                     next;
1389                                 };
1390               last;
1391           }
1392           $ref->medline(join(' ', @medline));
1393           redo REFLOOP;
1394       }
1395       if( /^\s{3}PUBMED\s+(.*)/ ) {
1396           push(@pubmed,$1);
1397           while ( defined($_ = $self->_readline) ) {
1398               /^\s{9,}(.*)/ && do { push(@pubmed, $1);
1399                                     next;
1400                                 };
1401               last;
1402           }
1403           $ref->pubmed(join(' ', @pubmed));
1404           redo REFLOOP;
1405       }
1406
1407       /^REFERENCE/o && do {
1408           # store current reference
1409           $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1410           # reset
1411           @authors = ();
1412           @title = ();
1413           @loc = ();
1414           @com = ();
1415           @pubmed = ();
1416           @medline = ();
1417           # create the new reference object
1418           $ref = Bio::Annotation::Reference->new(-tagname => 'reference');
1419           # check whether start and end base is given
1420           if (/^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
1421               $ref->start($1);
1422               $ref->end($2);
1423           } elsif (/^REFERENCE\s+\d+\s+\((.*)\)/) {
1424               $ref->gb_reference($1);
1425           }
1426       };
1427
1428       /^(FEATURES)|(COMMENT)/o && last;
1429
1430       $_ = undef;              # Empty $_ to trigger read of next line
1431   }
1432
1433     # store last reference
1434     $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1435
1436     $$buffer = $_;
1437
1438     #print "\nnumber of references found: ", $#refs+1,"\n";
1439
1440     return @refs;
1441 }
1442
1443 =head2 _add_ref_to_array
1444
1445 Title: _add_ref_to_array
1446 Usage:
1447 Function: Adds a Reference object to an array of Reference objects, takes
1448           care of possible cleanups to be done (currently, only author and title
1449           will be chopped of trailing semicolons).
1450 Args:     A reference to an array of Reference objects and
1451           the Reference object to be added
1452 Returns: nothing
1453
1454 =cut
1455
1456 sub _add_ref_to_array {
1457     my ($self, $refs, $ref) = @_;
1458
1459     # first, polish author and title by removing possible trailing semicolons
1460     my $au = $ref->authors();
1461     my $title = $ref->title();
1462     $au =~ s/;\s*$//g if $au;
1463     $title =~ s/;\s*$//g if $title;
1464     $ref->authors($au);
1465     $ref->title($title);
1466     # the rest should be clean already, so go ahead and add it
1467     push(@{$refs}, $ref);
1468 }
1469
1470 =head2 _read_GenBank_Species
1471
1472  Title   : _read_GenBank_Species
1473  Usage   :
1474  Function: Reads the GenBank Organism species and classification
1475            lines. Able to deal with unconvential Organism naming
1476            formats, and varietas in plants
1477  Example : ORGANISM  unknown marine gamma proteobacterium NOR5
1478            $genus = undef
1479            $species = unknown marine gamma proteobacterium NOR5
1480
1481            ORGANISM  Drosophila sp. 'white tip scutellum'
1482            $genus = Drosophila
1483            $species = sp. 'white tip scutellum'
1484            (yes, this really is a species and that is its name)
1485            $subspecies = undef
1486
1487            ORGANISM  Ajellomyces capsulatus var. farciminosus
1488            $genus = Ajellomyces
1489            $species = capsulatus
1490            $subspecies = var. farciminosus
1491
1492            ORGANISM  Hepatitis delta virus
1493            $genus = undef (though this virus has a genus in its lineage, we
1494                            cannot know that without a database lookup)
1495            $species = Hepatitis delta virus
1496
1497  Returns : A Bio::Species object
1498  Args    : A reference to the current line buffer
1499
1500 =cut
1501
1502 sub _read_GenBank_Species {
1503     my ($self, $buffer) = @_;
1504
1505     my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
1506                       'Unspecified', 'Unknown', 'None', 'unclassified',
1507                       'unidentified organism', 'not supplied');
1508     # dictionary of synonyms for taxid 32644
1509     my @unkn_genus = ('unknown','unclassified','uncultured','unidentified');
1510     # all above can be part of valid species name
1511
1512     my $line = $$buffer;
1513
1514     my( $sub_species, $species, $genus, $sci_name, $common, $class_lines,
1515         $source_flag, $abbr_name, $organelle, $sl );
1516     my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
1517     # upon first entering the loop, we must not read a new line -- the SOURCE
1518     # line is already in the buffer (HL 05/10/2000)
1519     my ($ann, $tag, $data);
1520     while (defined($line) || defined($line = $self->_readline())) {
1521         # de-HTMLify (links that may be encountered here don't contain
1522         # escaped '>', so a simple-minded approach suffices)
1523         $line =~ s{<[^>]+>}{}g;
1524         if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
1525             ($tag, $data) = ($1, $2 || '');
1526             last if ($tag && !exists $source{$tag});
1527         } else {
1528             return unless $tag;
1529             ($data = $line) =~ s{^\s+}{};
1530             chomp $data;
1531             $tag = 'CLASSIFICATION' if ($tag ne 'CLASSIFICATION' && $tag eq 'ORGANISM' &&  $line =~ m{[;\.]+});
1532         }
1533         (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
1534         $line = undef;
1535     }
1536
1537     ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
1538
1539     $$buffer = $line;
1540
1541     $sci_name || return;
1542
1543     # parse out organelle, common name, abbreviated name if present;
1544     # this should catch everything, but falls back to
1545     # entire SOURCE line just in case
1546     if ($sl =~ m{^
1547                      (mitochondrion|chloroplast|plastid)?
1548                      \s*(.*?)
1549                      \s*(?: \( (.*?) \) )?\.?
1550                      $
1551                  }xms ){
1552         ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional
1553     } else {
1554         $abbr_name = $sl;       # nothing caught; this is a backup!
1555     }
1556
1557     # Convert data in classification lines into classification array.
1558     # only split on ';' or '.' so that classification that is 2 or more words will
1559     # still get matched, use map() to remove trailing/leading/intervening spaces
1560     my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /(?<!subgen)[;\.]+/, $class_lines;
1561
1562     # do we have a genus?
1563     my $possible_genus =  quotemeta($class[-1])
1564        . ($class[-2] ? "|" . quotemeta($class[-2]) : '');
1565     if ($sci_name =~ /^($possible_genus)/) {
1566         $genus = $1;
1567         ($species) = $sci_name =~ /^$genus\s+(.+)/;
1568     }
1569     else {
1570         $species = $sci_name;
1571     }
1572
1573     # is this organism of rank species or is it lower?
1574     # (we don't catch everything lower than species, but it doesn't matter -
1575     # this is just so we abide by previous behaviour whilst not calling a
1576     # species a subspecies)
1577     if ($species && $species =~ /(.+)\s+((?:subsp\.|var\.).+)/) {
1578         ($species, $sub_species) = ($1, $2);
1579     }
1580
1581     # Don't make a species object if it's empty or "Unknown" or "None"
1582     # return unless $genus and  $genus !~ /^(Unknown|None)$/oi;
1583     # Don't make a species object if it belongs to taxid 32644
1584 #       my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names;
1585     my $unkn = grep { $_ eq $sl } @unkn_names;
1586     return unless ($species || $genus) and $unkn == 0;
1587
1588     # Bio::Species array needs array in Species -> Kingdom direction
1589     push(@class, $sci_name);
1590     @class = reverse @class;
1591
1592     my $make = Bio::Species->new();
1593     $make->scientific_name($sci_name);
1594     $make->classification(@class) if @class > 0;
1595     $make->common_name( $common ) if $common;
1596     $make->name('abbreviated', $abbr_name) if $abbr_name;
1597     $make->organelle($organelle) if $organelle;
1598     #$make->sub_species( $sub_species ) if $sub_species;
1599     return $make;
1600 }
1601
1602 =head2 _read_FTHelper_GenBank
1603
1604  Title   : _read_FTHelper_GenBank
1605  Usage   : _read_FTHelper_GenBank($buffer)
1606  Function: reads the next FT key line
1607  Example :
1608  Returns : Bio::SeqIO::FTHelper object
1609  Args    : filehandle and reference to a scalar
1610
1611 =cut
1612
1613 sub _read_FTHelper_GenBank {
1614     my ($self,$buffer) = @_;
1615
1616     my ($key,                   # The key of the feature
1617         $loc                    # The location line from the feature
1618         );
1619     my @qual = ();        # An array of lines making up the qualifiers
1620
1621     if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
1622         $key = $1;
1623         $loc = $2;
1624         # Read all the lines up to the next feature
1625         while ( defined($_ = $self->_readline) ) {
1626             if (/^(\s+)(.+?)\s*$/o) {
1627                                 # Lines inside features are preceded by 21 spaces
1628                                 # A new feature is preceded by 5 spaces
1629                 if (length($1) > 6) {
1630                     # Add to qualifiers if we're in the qualifiers, or if it's
1631                     # the first qualifier
1632                     if (@qual || (index($2,'/') == 0)) {
1633                         push(@qual, $2);
1634                     }
1635                     # We're still in the location line, so append to location
1636                     else {
1637                         $loc .= $2;
1638                     }
1639                 } else {
1640                     # We've reached the start of the next feature
1641                     last;
1642                 }
1643             } else {
1644                                 # We're at the end of the feature table
1645                 last;
1646             }
1647         }
1648     } else {
1649         # No feature key
1650         $self->debug("no feature key!\n");
1651         # change suggested by JDiggans to avoid infinite loop-
1652         # see bugreport 1062.
1653         # reset buffer to prevent infinite loop
1654         $$buffer = $self->_readline();
1655         return;
1656     }
1657
1658     # Put the first line of the next feature into the buffer
1659     $$buffer = $_;
1660
1661     # Make the new FTHelper object
1662     my $out = Bio::SeqIO::FTHelper->new();
1663     $out->verbose($self->verbose());
1664     $out->key($key);
1665     $out->loc($loc);
1666
1667     # Now parse and add any qualifiers.  (@qual is kept
1668     # intact to provide informative error messages.)
1669   QUAL:
1670     for (my $i = 0; $i < @qual; $i++) {
1671         $_ = $qual[$i];
1672         my( $qualifier, $value ) = (m{^/([^=]+)(?:=(.+))?})
1673             or $self->warn("cannot see new qualifier in feature $key: ".
1674                            $qual[$i]);
1675         $qualifier = '' unless( defined $qualifier);
1676         if (defined $value) {
1677             # Do we have a quoted value?
1678             if (substr($value, 0, 1) eq '"') {
1679                                 # Keep adding to value until we find the trailing quote
1680                                 # and the quotes are balanced
1681                 while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
1682                     if($i >= $#qual) {
1683                         $self->warn("Unbalanced quote in:\n" .
1684                                     join("\n", @qual) .
1685                                     "No further qualifiers will " .
1686                                     "be added for this feature");
1687                         last QUAL;
1688                     }
1689                     $i++; # modifying a for-loop variable inside of the loop
1690                     # is not the best programming style ...
1691                     my $next = $qual[$i];
1692
1693                     # add to value with a space unless the value appears
1694                     # to be a sequence (translation for example)
1695                     # if(($value.$next) =~ /[^A-Za-z\"\-]/o) {
1696                     # changed to explicitly look for translation tag - cjf 06/8/29
1697                     if ($qualifier !~ /^translation$/i ) {
1698                         $value .= " ";
1699                     }
1700                     $value .= $next;
1701                 }
1702                                 # Trim leading and trailing quotes
1703                 $value =~ s/^"|"$//g;
1704                                 # Undouble internal quotes
1705                 $value =~ s/""/\"/g;
1706             } elsif ( $value =~ /^\(/ ) { # values quoted by ()s
1707                                 # Keep adding to value until we find the trailing bracket
1708                                 # and the ()s are balanced
1709                 my $left = ($value =~ tr/\(/\(/); # count left parens
1710                 my $right = ($value =~ tr/\)/\)/); # count right parens
1711
1712                 while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right"
1713                     if( $i >= $#qual) {
1714                         $self->warn("Unbalanced parens in:\n".
1715                                     join("\n", @qual).
1716                                     "\nNo further qualifiers will ".
1717                                     "be added for this feature");
1718                         last QUAL;
1719                     }
1720                     $i++;
1721                     my $next = $qual[$i];
1722                     $value .= $next;
1723                     $left += ($next =~ tr/\(/\(/);
1724                     $right += ($next =~ tr/\)/\)/);
1725                 }
1726             }
1727         } else {
1728             $value = '_no_value';
1729         }
1730         # Store the qualifier
1731         $out->field->{$qualifier} ||= [];
1732         push(@{$out->field->{$qualifier}},$value);
1733     }
1734     return $out;
1735 }
1736
1737 =head2 _write_line_GenBank
1738
1739  Title   : _write_line_GenBank
1740  Usage   :
1741  Function: internal function
1742  Example :
1743  Returns :
1744  Args    :
1745
1746 =cut
1747
1748 sub _write_line_GenBank {
1749     my ($self,$pre1,$pre2,$line,$length) = @_;
1750
1751     $length || $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1752     my $subl = $length - length $pre2;
1753     my $linel = length $line;
1754     my $i;
1755
1756     my $subr = substr($line,0,$length - length $pre1);
1757
1758     $self->_print("$pre1$subr\n");
1759     for($i= ($length - length $pre1);$i < $linel;  $i += $subl) {
1760         $subr = substr($line,$i,$subl);
1761         $self->_print("$pre2$subr\n");
1762     }
1763
1764 }
1765
1766 =head2 _write_line_GenBank_regex
1767
1768  Title   : _write_line_GenBank_regex
1769  Usage   :
1770  Function: internal function for writing lines of specified
1771            length, with different first and the next line
1772            left hand headers and split at specific points in the
1773            text
1774  Example :
1775  Returns : nothing
1776  Args    : file handle,
1777            first header,
1778            second header,
1779            text-line,
1780            regex for line breaks,
1781            total line length
1782
1783
1784 =cut
1785
1786 sub _write_line_GenBank_regex {
1787     my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
1788
1789     #print STDOUT "Going to print with $line!\n";
1790
1791     $length || $self->throw( "Miscalled write_line_GenBank without length. Programming error!");
1792
1793     my $subl = $length - (length $pre1) - 2;
1794     my @lines = ();
1795
1796     CHUNK: while($line) {
1797         foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
1798           if($line =~ m/^(.{0,$subl})($pat)(.*)/ ) {
1799               my $l = $1.$2;
1800               $line = substr($line,length($l));
1801               # be strict about not padding spaces according to
1802               # genbank format
1803               $l =~ s/\s+$//;
1804               next CHUNK if ($l eq '');
1805               push(@lines, $l);
1806               next CHUNK;
1807           }
1808         }
1809         # if we get here none of the patterns matched $subl or less chars
1810         $self->warn("trouble dissecting \"$line\"\n     into chunks ".
1811             "of $subl chars or less - this tag won't print right");
1812         # insert a space char to prevent infinite loops
1813         $line = substr($line,0,$subl) . " " . substr($line,$subl);
1814     }
1815     my $s = shift @lines;
1816     $self->_print("$pre1$s\n") if $s;
1817     foreach my $s ( @lines ) {
1818         $self->_print("$pre2$s\n");
1819     }
1820 }
1821
1822 =head2 _post_sort
1823
1824  Title   : _post_sort
1825  Usage   : $obj->_post_sort($newval)
1826  Function:
1827  Returns : value of _post_sort
1828  Args    : newvalue (optional)
1829
1830
1831 =cut
1832
1833 sub _post_sort {
1834     my ($obj,$value) = @_;
1835     if( defined $value) {
1836         $obj->{'_post_sort'} = $value;
1837     }
1838     return $obj->{'_post_sort'};
1839 }
1840
1841
1842 =head2 _show_dna
1843
1844  Title   : _show_dna
1845  Usage   : $obj->_show_dna($newval)
1846  Function:
1847  Returns : value of _show_dna
1848  Args    : newvalue (optional)
1849
1850
1851 =cut
1852
1853 sub _show_dna {
1854     my ($obj,$value) = @_;
1855     if( defined $value) {
1856         $obj->{'_show_dna'} = $value;
1857     }
1858     return $obj->{'_show_dna'};
1859 }
1860
1861 =head2 _id_generation_func
1862
1863  Title   : _id_generation_func
1864  Usage   : $obj->_id_generation_func($newval)
1865  Function:
1866  Returns : value of _id_generation_func
1867  Args    : newvalue (optional)
1868
1869
1870 =cut
1871
1872 sub _id_generation_func {
1873     my ($obj,$value) = @_;
1874     if( defined $value ) {
1875         $obj->{'_id_generation_func'} = $value;
1876     }
1877     return $obj->{'_id_generation_func'};
1878 }
1879
1880
1881 =head2 _ac_generation_func
1882
1883  Title   : _ac_generation_func
1884  Usage   : $obj->_ac_generation_func($newval)
1885  Function:
1886  Returns : value of _ac_generation_func
1887  Args    : newvalue (optional)
1888
1889 =cut
1890
1891 sub _ac_generation_func {
1892     my ($obj,$value) = @_;
1893     if( defined $value ) {
1894         $obj->{'_ac_generation_func'} = $value;
1895     }
1896     return $obj->{'_ac_generation_func'};
1897 }
1898
1899
1900 =head2 _sv_generation_func
1901
1902  Title   : _sv_generation_func
1903  Usage   : $obj->_sv_generation_func($newval)
1904  Function:
1905  Returns : value of _sv_generation_func
1906  Args    : newvalue (optional)
1907
1908
1909 =cut
1910
1911 sub _sv_generation_func {
1912     my ($obj,$value) = @_;
1913     if( defined $value ) {
1914         $obj->{'_sv_generation_func'} = $value;
1915     }
1916     return $obj->{'_sv_generation_func'};
1917
1918 }
1919
1920
1921 =head2 _kw_generation_func
1922
1923  Title   : _kw_generation_func
1924  Usage   : $obj->_kw_generation_func($newval)
1925  Function:
1926  Returns : value of _kw_generation_func
1927  Args    : newvalue (optional)
1928
1929 =cut
1930
1931
1932 sub _kw_generation_func {
1933     my ($obj,$value) = @_;
1934     if( defined $value ) {
1935         $obj->{'_kw_generation_func'} = $value;
1936     }
1937     return $obj->{'_kw_generation_func'};
1938 }
1939
1940 1;