genbank.pm: in "_read_GenBank_Species", adjusted a regex to
[bioperl-live.git] / Bio / SeqIO / genbank.pm
blob194a6a65fc67e2e0575a944e4e119191ee6fa13f
2 # BioPerl module for Bio::SeqIO::genbank
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Bioperl project bioperl-l(at)bioperl.org
8 # Copyright Elia Stupka and contributors see AUTHORS section
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 =head1 NAME
16 Bio::SeqIO::genbank - GenBank sequence input/output stream
18 =head1 SYNOPSIS
20 It is probably best not to use this object directly, but
21 rather go through the SeqIO handler:
23 $stream = Bio::SeqIO->new(-file => $filename,
24 -format => 'GenBank');
26 while ( my $seq = $stream->next_seq ) {
27 # do something with $seq
31 =head1 DESCRIPTION
33 This object can transform Bio::Seq objects to and from GenBank flat
34 file databases.
36 There is some flexibility here about how to write GenBank output
37 that is not fully documented.
39 =head2 Optional functions
41 =over 3
43 =item _show_dna()
45 (output only) shows the dna or not
47 =item _post_sort()
49 (output only) provides a sorting func which is applied to the FTHelpers
50 before printing
52 =item _id_generation_func()
54 This is function which is called as
56 print "ID ", $func($seq), "\n";
58 To generate the ID line. If it is not there, it generates a sensible ID
59 line using a number of tools.
61 If you want to output annotations in Genbank format they need to be
62 stored in a Bio::Annotation::Collection object which is accessible
63 through the Bio::SeqI interface method L<annotation()|annotation>.
65 The following are the names of the keys which are pulled from a
66 L<Bio::Annotation::Collection> object:
68 reference - Should contain Bio::Annotation::Reference objects
69 comment - Should contain Bio::Annotation::Comment objects
70 dblink - Should contain a Bio::Annotation::DBLink object
71 segment - Should contain a Bio::Annotation::SimpleValue object
72 origin - Should contain a Bio::Annotation::SimpleValue object
73 wgs - Should contain a Bio::Annotation::SimpleValue object
75 =back
77 =head1 Where does the data go?
79 Data parsed in Bio::SeqIO::genbank is stored in a variety of data
80 fields in the sequence object that is returned. Here is a partial list
81 of fields.
83 Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
84 the top level object which defines a function called NAME() which
85 stores this information.
87 Items listed as Annotation 'NAME' tell you the data is stored the
88 associated Bio::AnnotationCollectionI object which is associated with
89 Bio::Seq objects. If it is explicitly requested that no annotations
90 should be stored when parsing a record of course they will not be
91 available when you try and get them. If you are having this problem
92 look at the type of SeqBuilder that is being used to contruct your
93 sequence object.
95 Comments Annotation 'comment'
96 References Annotation 'reference'
97 Segment Annotation 'segment'
98 Origin Annotation 'origin'
99 Dbsource Annotation 'dblink'
101 Accessions PrimarySeq accession_number()
102 Secondary accessions RichSeq get_secondary_accessions()
103 GI number PrimarySeq primary_id()
104 LOCUS PrimarySeq display_id()
105 Keywords RichSeq get_keywords()
106 Dates RichSeq get_dates()
107 Molecule RichSeq molecule()
108 Seq Version RichSeq seq_version()
109 PID RichSeq pid()
110 Division RichSeq division()
111 Features Seq get_SeqFeatures()
112 Alphabet PrimarySeq alphabet()
113 Definition PrimarySeq description() or desc()
114 Version PrimarySeq version()
116 Sequence PrimarySeq seq()
118 There is more information in the Feature-Annotation HOWTO about each
119 field and how it is mapped to the Sequence object
120 L<http://bioperl.open-bio.org/wiki/HOWTO:Feature-Annotation>.
122 =head1 FEEDBACK
124 =head2 Mailing Lists
126 User feedback is an integral part of the evolution of this and other
127 Bioperl modules. Send your comments and suggestions preferably to one
128 of the Bioperl mailing lists. Your participation is much appreciated.
130 bioperl-l@bioperl.org - General discussion
131 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
133 =head2 Support
135 Please direct usage questions or support issues to the mailing list:
137 I<bioperl-l@bioperl.org>
139 rather than to the module maintainer directly. Many experienced and
140 reponsive experts will be able look at the problem and quickly
141 address it. Please include a thorough description of the problem
142 with code and data examples if at all possible.
144 =head2 Reporting Bugs
146 Report bugs to the Bioperl bug tracking system to help us keep track
147 the bugs and their resolution. Bug reports can be submitted via the web:
149 https://github.com/bioperl/bioperl-live/issues
151 =head1 AUTHOR - Bioperl Project
153 bioperl-l at bioperl.org
155 Original author Elia Stupka, elia -at- tigem.it
157 =head1 CONTRIBUTORS
159 Ewan Birney birney at ebi.ac.uk
160 Jason Stajich jason at bioperl.org
161 Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
162 Lincoln Stein lstein at cshl.org
163 Heikki Lehvaslaiho, heikki at ebi.ac.uk
164 Hilmar Lapp, hlapp at gmx.net
165 Donald G. Jackson, donald.jackson at bms.com
166 James Wasmuth, james.wasmuth at ed.ac.uk
167 Brian Osborne, bosborne at alum.mit.edu
168 Chris Fields, cjfields at bioperl dot org
170 =head1 APPENDIX
172 The rest of the documentation details each of the object
173 methods. Internal methods are usually preceded with a _
175 =cut
177 # Let the code begin...
179 package Bio::SeqIO::genbank;
180 use strict;
182 use Bio::SeqIO::FTHelper;
183 use Bio::SeqFeature::Generic;
184 use Bio::Species;
185 use Bio::Seq::SeqFactory;
186 use Bio::Annotation::Collection;
187 use Bio::Annotation::Comment;
188 use Bio::Annotation::Reference;
189 use Bio::Annotation::DBLink;
191 use base qw(Bio::SeqIO);
193 # Note that a qualifier that exceeds one line (i.e. a long label) will
194 # automatically be quoted regardless:
196 our $FTQUAL_LINE_LENGTH = 60;
198 our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
199 anticodon citation
200 codon codon_start
201 cons_splice direction
202 evidence label
203 mod_base number
204 rpt_type rpt_unit
205 transl_except transl_table
206 usedin
209 our %DBSOURCE = map {$_ => 1} qw(
210 EchoBASE IntAct SWISS-2DPAGE ECO2DBASE ECOGENE TIGRFAMs
211 TIGR GO InterPro Pfam PROSITE SGD GermOnline
212 HSSP PhosSite Ensembl RGD AGD ArrayExpress KEGG
213 H-InvDB HGNC LinkHub PANTHER PRINTS SMART SMR
214 MGI MIM RZPD-ProtExp ProDom MEROPS TRANSFAC Reactome
215 UniGene GlycoSuiteDB PIRSF HSC-2DPAGE PHCI-2DPAGE
216 PMMA-2DPAGE Siena-2DPAGE Rat-heart-2DPAGE Aarhus/Ghent-2DPAGE
217 Biocyc MetaCyc Biocyc:Metacyc GenomeReviews FlyBase
218 TMHOBP COMPLUYEAST-2DPAGE OGP DictyBase HAMAP
219 PhotoList Gramene WormBase WormPep Genew ZFIN
220 PeroxiBase MaizeDB TAIR DrugBank REBASE HPA
221 swissprot GenBank GenPept REFSEQ embl PDB UniProtKB
222 DIP PeptideAtlas PRIDE CYGD HOGENOME Gene3D
223 Project);
225 our %VALID_MOLTYPE = map {$_ => 1} qw(NA DNA RNA tRNA rRNA cDNA cRNA ms-DNA
226 mRNA uRNA ss-RNA ss-DNA snRNA snoRNA PRT);
228 our %VALID_ALPHABET = (
229 'bp' => 'dna',
230 'aa' => 'protein',
231 'rc' => '' # rc = release candidate; file has no sequences
234 sub _initialize {
235 my($self, @args) = @_;
237 $self->SUPER::_initialize(@args);
238 # hash for functions for decoding keys.
239 $self->{'_func_ftunit_hash'} = {};
240 $self->_show_dna(1); # sets this to one by default. People can change it
241 if ( not defined $self->sequence_factory ) {
242 $self->sequence_factory
243 (Bio::Seq::SeqFactory->new(-verbose => $self->verbose,
244 -type => 'Bio::Seq::RichSeq'));
248 =head2 next_seq
250 Title : next_seq
251 Usage : $seq = $stream->next_seq()
252 Function: returns the next sequence in the stream
253 Returns : Bio::Seq object
254 Args :
256 =cut
258 sub next_seq {
259 my ($self, @args) = @_;
260 my %args = @args;
261 my $builder = $self->sequence_builder;
262 my $seq;
263 my %params;
265 RECORDSTART:
266 while (1) {
267 my $buffer;
268 my ( @acc, @features );
269 my ( $display_id, $annotation );
270 my $species;
272 # initialize; we may come here because of starting over
273 @features = ();
274 $annotation = undef;
275 @acc = ();
276 $species = undef;
277 %params = ( -verbose => $self->verbose ); # reset hash
278 local ($/) = "\n";
279 while ( defined( $buffer = $self->_readline ) ) {
280 last if index( $buffer, 'LOCUS ' ) == 0;
282 return unless defined $buffer; # end of file
283 $buffer =~ /^LOCUS\s+(\S.*)$/o
284 or $self->throw( "GenBank stream with bad LOCUS line. "
285 . "Not GenBank in my book. Got '$buffer'");
286 my @tokens = split( ' ', $1 );
288 # this is important to have the id for display in e.g. FTHelper,
289 # otherwise you won't know which entry caused an error
290 $display_id = shift @tokens;
291 $params{'-display_id'} = $display_id;
293 # may still be useful if we don't want the seq
294 my $seqlength = shift @tokens;
295 if ( exists $VALID_ALPHABET{$seqlength} ) {
296 # moved one token too far. No locus name?
297 $self->warn( "Bad LOCUS name? Changing [$params{'-display_id'}] "
298 . "to 'unknown' and length to '$display_id'"
300 $params{'-display_id'} = 'unknown';
301 $params{'-length'} = $display_id;
303 # add token back...
304 unshift @tokens, $seqlength;
306 else {
307 $params{'-length'} = $seqlength;
310 # the alphabet of the entry
311 # shouldn't assign alphabet unless one
312 # is specifically designated (such as for rc files)
313 my $alphabet = lc( shift @tokens );
314 $params{'-alphabet'} =
315 ( exists $VALID_ALPHABET{$alphabet} )
316 ? $VALID_ALPHABET{$alphabet}
317 : $self->warn("Unknown alphabet: $alphabet");
319 # for aa there is usually no 'molecule' (mRNA etc)
320 if ( $params{'-alphabet'} eq 'protein' ) {
321 $params{'-molecule'} = 'PRT';
323 else {
324 $params{'-molecule'} = shift(@tokens);
327 # take care of lower case issues
328 if ( $params{'-molecule'} eq 'dna' or $params{'-molecule'} eq 'rna' ) {
329 $params{'-molecule'} = uc $params{'-molecule'};
331 $self->debug( "Unrecognized molecule type: " . $params{'-molecule'} )
332 if not exists( $VALID_MOLTYPE{ $params{'-molecule'} } );
334 my $circ = shift @tokens;
335 if ( $circ eq 'circular' ) {
336 $params{'-is_circular'} = 1;
337 $params{'-division'} = shift @tokens;
339 else {
340 # 'linear' or 'circular' may actually be omitted altogether
341 $params{'-division'} =
342 ( CORE::length($circ) == 3 ) ? $circ : shift @tokens;
344 my $date = join( ' ', @tokens ); # we lump together the rest
346 # this is per request bug #1513
347 # we can handle:
348 # 9-10-2003
349 # 9-10-03
350 # 09-10-2003
351 # 09-10-03
352 if ( $date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/ ) {
353 if ( length($date) < 11 ) {
354 # improperly formatted date
355 # But we'll be nice and fix it for them
356 my ( $d, $m, $y ) = ( $2, $3, $4 );
357 if ( length($d) == 1 ) {
358 $d = "0$d";
361 # guess the century here
362 if ( length($y) == 2 ) {
363 if ( $y > 60 ) { # arbitrarily guess that '60' means 1960
364 $y = "19$y";
366 else {
367 $y = "20$y";
369 $self->warn( "Date was malformed, guessing the "
370 . "century for $date to be $y\n"
373 $params{'-dates'} = [ join( '-', $d, $m, $y ) ];
375 else {
376 $params{'-dates'} = [$date];
380 # set them all at once
381 $builder->add_slot_value(%params);
382 %params = ();
384 # parse the rest if desired, otherwise start over
385 if ( not $builder->want_object ) {
386 $builder->make_object;
387 next RECORDSTART;
390 # set up annotation depending on what the builder wants
391 if ( $builder->want_slot('annotation') ) {
392 $annotation = Bio::Annotation::Collection->new;
395 $buffer = $self->_readline;
396 while ( defined( my $line = $buffer ) ) {
397 # Description line(s)
398 if ($line =~ /^DEFINITION\s+(\S.*\S)/) {
399 my @desc = ($1);
400 while ( defined( $line = $self->_readline ) ) {
401 if ($line =~ /^\s+(.*)/) {
402 push( @desc, $1 );
403 next;
405 last;
407 $builder->add_slot_value( -desc => join( ' ', @desc ) );
409 # we'll continue right here because DEFINITION
410 # always comes at the top of the entry
411 $buffer = $line;
414 # accession number (there can be multiple accessions)
415 if ($line =~ /^ACCESSION\s+(\S.*\S)/) {
416 push( @acc, split( /\s+/, $1 ) );
417 while ( defined( $line = $self->_readline ) ) {
418 if ($line =~ /^\s+(.*)/) {
419 push( @acc, split( /\s+/, $1 ) );
420 next;
422 last;
424 $buffer = $line;
425 next;
428 # PID
429 elsif ($line =~ /^PID\s+(\S+)/) {
430 $params{'-pid'} = $1;
433 # Version number
434 elsif ($line =~ /^VERSION\s+(\S.+)$/) {
435 my ( $acc, $gi ) = split( ' ', $1 );
436 if ( $acc =~ /^\w+\.(\d+)/ ) {
437 $params{'-version'} = $1;
438 $params{'-seq_version'} = $1;
440 if ( $gi && ( index( $gi, "GI:" ) == 0 ) ) {
441 $params{'-primary_id'} = substr( $gi, 3 );
445 # Keywords
446 elsif ($line =~ /^KEYWORDS\s+(\S.*)/) {
447 my @kw = split( /\s*\;\s*/, $1 );
448 while ( defined( $line = $self->_readline ) ) {
449 chomp $line;
450 if ($line =~ /^\s+(.*)/) {
451 push( @kw, split( /\s*\;\s*/, $1 ) );
452 next;
454 last;
457 @kw && $kw[-1] =~ s/\.$//;
458 $params{'-keywords'} = \@kw;
459 $buffer = $line;
460 next;
463 # Organism name and phylogenetic information
464 elsif ($line =~ /^SOURCE\s+\S/) {
465 if ( $builder->want_slot('species') ) {
466 $species = $self->_read_GenBank_Species( \$buffer );
467 $builder->add_slot_value( -species => $species );
469 else {
470 while ( defined( $buffer = $self->_readline ) ) {
471 last if substr( $buffer, 0, 1 ) ne ' ';
474 next;
477 # References
478 elsif ($line =~ /^REFERENCE\s+\S/) {
479 if ($annotation) {
480 my @refs = $self->_read_GenBank_References( \$buffer );
481 foreach my $ref (@refs) {
482 $annotation->add_Annotation( 'reference', $ref );
485 else {
486 while ( defined( $buffer = $self->_readline ) ) {
487 last if substr( $buffer, 0, 1 ) ne ' ';
490 next;
493 # Project
494 elsif ($line =~ /^PROJECT\s+(\S.*)/) {
495 if ($annotation) {
496 my $project =
497 Bio::Annotation::SimpleValue->new( -value => $1 );
498 $annotation->add_Annotation( 'project', $project );
502 # Comments
503 elsif ($line =~ /^COMMENT\s+(\S.*)/) {
504 if ($annotation) {
505 my $comment = $1;
506 while ( defined( $line = $self->_readline ) ) {
507 last if ($line =~ /^\S/);
508 $comment .= $line;
510 $comment =~ s/\n/ /g;
511 $comment =~ s/ +/ /g;
512 $annotation->add_Annotation(
513 'comment',
514 Bio::Annotation::Comment->new(
515 -text => $comment,
516 -tagname => 'comment'
519 $buffer = $line;
521 else {
522 while ( defined( $buffer = $self->_readline ) ) {
523 last if substr( $buffer, 0, 1 ) ne ' ';
526 next;
529 # Corresponding Genbank nucleotide id, Genpept only
530 elsif ($line =~ /^DB(?:SOURCE|LINK)\s+(\S.+)/) {
531 if ($annotation) {
532 my $dbsource = $1;
533 while ( defined( $line = $self->_readline ) ) {
534 last if ($line =~ /^\S/);
535 $dbsource .= $line;
538 # deal with UniProKB dbsources
539 if ( $dbsource =~
540 s/(UniProt(?:KB)?|swissprot):\s+locus\s+(\S+)\,.+\n//
542 $annotation->add_Annotation(
543 'dblink',
544 Bio::Annotation::DBLink->new(
545 -primary_id => $2,
546 -database => $1,
547 -tagname => 'dblink'
550 if ( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
551 $annotation->add_Annotation(
552 'swissprot_dates',
553 Bio::Annotation::SimpleValue->new(
554 -tagname => 'date_created',
555 -value => $1
559 while ( $dbsource =~
560 s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g
562 $annotation->add_Annotation(
563 'swissprot_dates',
564 Bio::Annotation::SimpleValue->new(
565 -tagname => 'date_updated',
566 -value => $2
570 $dbsource =~ s/\n/ /g;
571 if ( $dbsource =~
572 s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/
574 # will use $i to determine even or odd
575 # for swissprot the accessions are paired
576 my $i = 0;
577 for my $dbsrc ( split( /,\s+/, $1 ) ) {
578 if ( $dbsrc =~ /(\S+)\.(\d+)/
579 or $dbsrc =~ /(\S+)/
581 my ( $id, $version ) = ( $1, $2 );
582 $version = '' unless defined $version;
583 my $db = ( $id =~ /^\d\S{3}/ ) ? 'PDB'
584 : ( $i++ % 2 ) ? 'GenPept'
585 : 'GenBank';
587 $annotation->add_Annotation(
588 'dblink',
589 Bio::Annotation::DBLink->new(
590 -primary_id => $id,
591 -version => $version,
592 -database => $db,
593 -tagname => 'dblink'
599 elsif ( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) {
600 # download screwed up and ncbi didn't put acc in for gi numbers
601 my $i = 0;
602 for my $id ( split( /\,\s+/, $1 ) ) {
603 my ( $acc, $db );
604 if ( $id =~ /gi:\s+(\d+)/ ) {
605 $acc = $1;
606 $db = ( $i++ % 2 ) ? 'GenPept' : 'GenBank';
608 elsif ( $id =~ /pdb\s+accession\s+(\S+)/ ) {
609 $acc = $1;
610 $db = 'PDB';
612 else {
613 $acc = $id;
614 $db = '';
616 $annotation->add_Annotation(
617 'dblink',
618 Bio::Annotation::DBLink->new(
619 -primary_id => $acc,
620 -database => $db,
621 -tagname => 'dblink'
626 else {
627 $self->debug("Cannot match $dbsource\n");
629 if ( $dbsource =~ s/xrefs\s+
630 \(non\-sequence\s+databases\):\s+
631 ((?:\S+,\s+)+\S+)//x
633 for my $id ( split( /\,\s+/, $1 ) ) {
634 my $db;
636 # this is because GenBank dropped the spaces!!!
637 # I'm sure we're not going to get this right
638 ##if ( $id =~ s/^://i ) {
639 ## $db = $1;
641 $db = substr( $id, 0, index( $id, ':' ) );
642 if ( not exists $DBSOURCE{$db} ) {
643 $db = ''; # do we want 'GenBank' here?
645 $id = substr( $id, index( $id, ':' ) + 1 );
646 $annotation->add_Annotation(
647 'dblink',
648 Bio::Annotation::DBLink->new(
649 -primary_id => $id,
650 -database => $db,
651 -tagname => 'dblink'
657 else {
658 if ( $dbsource =~
659 /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/
661 my ( $db, $id, $version ) = ( $1, $2, $3 );
662 $annotation->add_Annotation(
663 'dblink',
664 Bio::Annotation::DBLink->new(
665 -primary_id => $id,
666 -version => $version,
667 -database => $db || 'GenBank',
668 -tagname => 'dblink'
672 elsif ( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)/ ) {
673 my ( $db, $id ) = ( $1, $2 );
674 $annotation->add_Annotation(
675 'dblink',
676 Bio::Annotation::DBLink->new(
677 -primary_id => $id,
678 -database => $db || 'GenBank',
679 -tagname => 'dblink'
683 elsif ( $dbsource =~ /(\S+)([\.:])\s*(\S+)/ ) {
684 my ( $db, $version );
685 my @ids = ();
686 if ( $2 eq ':' ) {
687 $db = $1;
689 # Genbank 192 release notes say this: "The second
690 # field can consist of multiple comma-separated
691 # identifiers, if a sequence record has multiple
692 # DBLINK cross-references of a given type."
693 # For example: DBLINK Project:100,200,300"
694 @ids = split( /,/, $3 );
696 else {
697 ( $db, $version ) = ( 'GenBank', $3 );
698 $ids[0] = $1;
701 foreach my $id (@ids) {
702 $annotation->add_Annotation(
703 'dblink',
704 Bio::Annotation::DBLink->new(
705 -primary_id => $id,
706 -version => $version,
707 -database => $db,
708 -tagname => 'dblink'
713 else {
714 $self->warn(
715 "Unrecognized DBSOURCE data: $dbsource\n");
718 $buffer = $line;
720 else {
721 while ( defined( $buffer = $self->_readline ) ) {
722 last if substr( $buffer, 0, 1 ) ne ' ';
725 next;
728 # Exit at start of Feature table, or start of sequence
729 if ($line =~ /^(FEATURES|ORIGIN)/) {
730 my $trap;
732 last if ($line =~ /^(FEATURES|ORIGIN)/);
734 # Get next line and loop again
735 $buffer = $self->_readline;
737 return unless defined $buffer;
739 # add them all at once for efficiency
740 $builder->add_slot_value(
741 -accession_number => shift(@acc),
742 -secondary_accessions => \@acc,
743 %params
745 $builder->add_slot_value( -annotation => $annotation ) if $annotation;
746 %params = (); # reset before possible re-use to avoid setting twice
748 # start over if we don't want to continue with this entry
749 if ( not $builder->want_object ) {
750 $builder->make_object;
751 next RECORDSTART;
754 # some "minimal" formats may not necessarily have a feature table
755 if ( $builder->want_slot('features')
756 and defined $buffer
757 and $buffer =~ /^FEATURES/o
759 # need to read the first line of the feature table
760 $buffer = $self->_readline;
762 # DO NOT read lines in the while condition -- this is done
763 # as a side effect in _read_FTHelper_GenBank!
765 # part of new circular spec:
766 # commented out for now until kinks worked out
767 #my $sourceEnd = 0;
768 #$sourceEnd = $2 if ($buffer =~ /(\d+?)\.\.(\d+?)$/);
770 while ( defined $buffer ) {
771 # check immediately -- not at the end of the loop
772 # note: GenPept entries obviously do not have a BASE line
773 last if ( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o );
775 # slurp in one feature at a time -- at return, the start of
776 # the next feature will have been read already, so we need
777 # to pass a reference, and the called method must set this
778 # to the last line read before returning
779 my $ftunit = $self->_read_FTHelper_GenBank( \$buffer );
781 # implement new circular spec: features that cross the origin are now
782 # seamless instead of being 2 separate joined features
783 # commented out until kinks get worked out
784 #if ((! $args{'-nojoin'}) && $ftunit->{'loc'} =~ /^join\((\d+?)\.\.(\d+?),(\d+?)..(\d+?)\)$/
785 #&& $sourceEnd == $2 && $3 == 1) {
786 #my $start = $1;
787 #my $end = $2 + $4;
788 #$ftunit->{'loc'} = "$start..$end";
791 # fix suggested by James Diggans
793 if ( not defined $ftunit ) {
794 # GRRRR. We have fallen over. Try to recover
795 $self->warn( "Unexpected error in feature table for "
796 . $params{'-display_id'}
797 . " Skipping feature, attempting to recover" );
799 unless ( $buffer =~ /^\s{5,5}\S+/o
800 or $buffer =~ /^\S+/o
802 $buffer = $self->_readline;
804 next; # back to reading FTHelpers
807 # process ftunit
808 my $feat =
809 $ftunit->_generic_seqfeature( $self->location_factory,
810 $display_id );
812 # add taxon_id from source if available
813 if ( $species
814 and $feat->primary_tag eq 'source'
815 and $feat->has_tag('db_xref')
816 and ( not $species->ncbi_taxid
817 or ( $species->ncbi_taxid
818 and $species->ncbi_taxid =~ /^list/ ) )
820 foreach my $tagval ( $feat->get_tag_values('db_xref') ) {
821 if ( index( $tagval, "taxon:" ) == 0 ) {
822 $species->ncbi_taxid( substr( $tagval, 6 ) );
823 last;
828 # add feature to list of features
829 push( @features, $feat );
831 $builder->add_slot_value( -features => \@features );
834 if ( defined $buffer ) {
835 # CONTIG lines: TODO, this needs to be cleaned up
836 if ($buffer =~/^CONTIG\s+(.*)/o) {
837 my $ctg = $1;
838 while ( defined( $buffer = $self->_readline ) ) {
839 last if $buffer =~ m{^ORIGIN|//}o;
840 $buffer =~ s/\s+(.*)/$1/;
841 $ctg .= $buffer;
843 if ($ctg) {
844 $annotation->add_Annotation(
845 Bio::Annotation::SimpleValue->new(
846 -tagname => 'contig',
847 -value => $ctg
852 elsif ($buffer =~ /^WGS|WGS_SCAFLD\s+/o) { # catch WGS/WGS_SCAFLD lines
853 while ( $buffer =~ s/(^WGS|WGS_SCAFLD)\s+// ) { # gulp lines
854 chomp $buffer;
855 $annotation->add_Annotation(
856 Bio::Annotation::SimpleValue->new(
857 -value => $buffer,
858 -tagname => lc $1
861 $buffer = $self->_readline;
864 elsif ( $buffer !~ m{^ORIGIN|//}o ) { # advance to the sequence, if any
865 while ( defined( $buffer = $self->_readline ) ) {
866 last if $buffer =~ m{^(ORIGIN|//)};
870 if ( not $builder->want_object ) {
871 $builder->make_object; # implicit end-of-object
872 next RECORDSTART;
874 if ( $builder->want_slot('seq') ) {
875 # the fact that we want a sequence does not necessarily mean that
876 # there also is a sequence ...
877 if ( defined $buffer and $buffer =~ s/^ORIGIN\s+// ) {
878 if ( $annotation and length($buffer) > 0 ) {
879 $annotation->add_Annotation(
880 'origin',
881 Bio::Annotation::SimpleValue->new(
882 -tagname => 'origin',
883 -value => $buffer
887 my $seqc = '';
888 while ( defined( $buffer = $self->_readline ) ) {
889 last if $buffer =~ m{^//};
890 $buffer = uc $buffer;
891 $buffer =~ s/[^A-Za-z]//g;
892 $seqc .= $buffer;
895 $builder->add_slot_value( -seq => $seqc );
898 elsif ( defined($buffer) and ( substr( $buffer, 0, 2 ) ne '//' ) ) {
899 # advance to the end of the record
900 while ( defined( $buffer = $self->_readline ) ) {
901 last if substr( $buffer, 0, 2 ) eq '//';
905 # Unlikely, but maybe the sequence is so weird that we don't want it
906 # anymore. We don't want to return undef if the stream's not exhausted
907 # yet.
908 $seq = $builder->make_object;
909 next RECORDSTART unless $seq;
910 last RECORDSTART;
911 } # end while RECORDSTART
913 return $seq;
916 =head2 write_seq
918 Title : write_seq
919 Usage : $stream->write_seq($seq)
920 Function: writes the $seq object (must be seq) to the stream
921 Returns : 1 for success and 0 for error
922 Args : array of 1 to n Bio::SeqI objects
924 =cut
926 sub write_seq {
927 my ($self,@seqs) = @_;
929 foreach my $seq ( @seqs ) {
930 $self->throw("Attempting to write with no seq!") unless defined $seq;
932 if ( not ref $seq or not $seq->isa('Bio::SeqI') ) {
933 $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
936 my $str = $seq->seq;
937 my $len = $seq->length;
938 my $alpha = $seq->alphabet;
940 my ($div, $mol);
941 if ( not $seq->can('division')
942 or not defined($div = $seq->division)
944 $div = 'UNK';
946 if ( not $seq->can('molecule')
947 or not defined ($mol = $seq->molecule)
949 $mol = $alpha || 'DNA';
952 my $circular = ($seq->is_circular) ? 'circular' : 'linear ';
954 local($^W) = 0; # supressing warnings about uninitialized fields.
956 my $temp_line;
957 if ( $self->_id_generation_func ) {
958 $temp_line = &{$self->_id_generation_func}($seq);
960 else {
961 my $date = '';
962 if ( $seq->can('get_dates') ) {
963 ($date) = $seq->get_dates;
966 $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
967 if $seq->display_id =~ /\s/;
969 my @data = ( lc($alpha) eq 'protein' ) ? ('aa', '', '') : ('bp', '', $mol);
970 $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s\n",
971 'LOCUS', $seq->id, $len,
972 @data, $circular, $div, $date);
975 $self->_print($temp_line);
976 $self->_write_line_GenBank_regex("DEFINITION ", " ",
977 $seq->desc, "\\s\+\|\$",80);
979 # if there, write the accession line
981 if ( $self->_ac_generation_func ) {
982 $temp_line = &{$self->_ac_generation_func}($seq);
983 $self->_print("ACCESSION $temp_line\n");
985 else {
986 my @acc = ();
987 push @acc, $seq->accession_number;
988 if ( $seq->isa('Bio::Seq::RichSeqI') ) {
989 push @acc, $seq->get_secondary_accessions;
991 $self->_print("ACCESSION ", join(" ", @acc), "\n");
992 # otherwise - cannot print <sigh>
995 # if PID defined, print it
996 if ($seq->isa('Bio::Seq::RichSeqI') and $seq->pid) {
997 $self->_print("PID ", $seq->pid, "\n");
1000 # if there, write the version line
1001 if ( defined $self->_sv_generation_func ) {
1002 $temp_line = &{$self->_sv_generation_func}($seq);
1003 if ( $temp_line ) {
1004 $self->_print("VERSION $temp_line\n");
1007 elsif ($seq->isa('Bio::Seq::RichSeqI') and defined($seq->seq_version)) {
1008 my $id = $seq->primary_id; # this may be a GI number
1009 my $data = (defined $id and $id =~ /^\d+$/) ? " GI:$id" : "";
1010 $self->_print("VERSION ",
1011 $seq->accession_number, ".",
1012 $seq->seq_version, $data, "\n");
1015 # if there, write the PROJECT line
1016 for my $proj ( $seq->annotation->get_Annotations('project') ) {
1017 $self->_print("PROJECT ".$proj->value."\n");
1020 # if there, write the DBSOURCE line
1021 foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
1022 my ($db, $id) = ($ref->database, $ref->primary_id);
1023 my $prefix = $db eq 'Project' ? 'DBLINK' : 'DBSOURCE';
1024 my $text = $db eq 'GenBank' ? ''
1025 : $db eq 'Project' ? "$db:$id"
1026 : "$db accession $id";
1027 $self->_print(sprintf ("%-11s %s\n", $prefix, $text));
1030 # if there, write the keywords line
1031 if ( defined $self->_kw_generation_func ) {
1032 $temp_line = &{$self->_kw_generation_func}($seq);
1033 $self->_print("KEYWORDS $temp_line\n");
1035 elsif ( $seq->can('keywords') ) {
1036 my $kw = $seq->keywords;
1037 $kw .= '.' if ( $kw !~ /\.$/ );
1038 $self->_print("KEYWORDS $kw\n");
1041 # SEGMENT if it exists
1042 foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
1043 $self->_print(sprintf ("%-11s %s\n",'SEGMENT',
1044 $ref->value));
1047 # Organism lines
1048 if (my $spec = $seq->species) {
1049 my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
1050 $spec->scientific_name,
1051 $spec->common_name);
1052 my @classification;
1053 if ($spec->isa('Bio::Species')) {
1054 @classification = $spec->classification;
1055 shift @classification;
1057 else {
1058 # Bio::Taxon should have a DB handle of some type attached, so
1059 # derive the classification from that
1060 my $node = $spec;
1061 while ($node) {
1062 $node = $node->ancestor || last;
1063 unshift @classification, $node->node_name;
1064 #$node eq $root && last;
1066 @classification = reverse @classification;
1068 my $abname = $spec->name('abbreviated') ? # from genbank file
1069 $spec->name('abbreviated')->[0] : $sn;
1070 my $sl = $on ? "$on " : '';
1071 $sl .= $cn ? "$abname ($cn)" : $abname;
1073 $self->_write_line_GenBank_regex("SOURCE ", ' 'x12, $sl, "\\s\+\|\$", 80);
1074 $self->_print(" ORGANISM ", $spec->scientific_name, "\n");
1075 my $OC = join('; ', reverse @classification) . '.';
1076 $self->_write_line_GenBank_regex(' 'x12,' 'x12, $OC, "\\s\+\|\$", 80);
1079 # Reference lines
1080 my $count = 1;
1081 foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
1082 $temp_line = "REFERENCE $count";
1083 if ($ref->start) {
1084 $temp_line .= sprintf (" (%s %d to %d)",
1085 ($seq->alphabet() eq "protein" ?
1086 "residues" : "bases"),
1087 $ref->start, $ref->end);
1089 elsif ($ref->gb_reference) {
1090 $temp_line .= sprintf (" (%s)", $ref->gb_reference);
1092 $self->_print("$temp_line\n");
1093 $self->_write_line_GenBank_regex(" AUTHORS ", ' 'x12,
1094 $ref->authors, "\\s\+\|\$", 80);
1095 $self->_write_line_GenBank_regex(" CONSRTM ", ' 'x12,
1096 $ref->consortium, "\\s\+\|\$", 80) if $ref->consortium;
1097 $self->_write_line_GenBank_regex(" TITLE ", ' 'x12,
1098 $ref->title, "\\s\+\|\$", 80);
1099 $self->_write_line_GenBank_regex(" JOURNAL ", ' 'x12,
1100 $ref->location, "\\s\+\|\$", 80);
1101 if ( $ref->medline) {
1102 $self->_write_line_GenBank_regex(" MEDLINE ", ' 'x12,
1103 $ref->medline, "\\s\+\|\$", 80);
1104 # I am assuming that pubmed entries only exist when there
1105 # are also MEDLINE entries due to the indentation
1107 # This could be a wrong assumption
1108 if ( $ref->pubmed ) {
1109 $self->_write_line_GenBank_regex(" PUBMED ", ' 'x12,
1110 $ref->pubmed, "\\s\+\|\$", 80);
1112 # put remark at the end
1113 if ($ref->comment) {
1114 $self->_write_line_GenBank_regex(" REMARK ", ' 'x12,
1115 $ref->comment, "\\s\+\|\$", 80);
1117 $count++;
1120 # Comment lines
1121 foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
1122 $self->_write_line_GenBank_regex("COMMENT ", ' 'x12,
1123 $comment->text, "\\s\+\|\$", 80);
1126 # FEATURES section
1127 $self->_print("FEATURES Location/Qualifiers\n");
1129 if ( defined $self->_post_sort ) {
1130 # we need to read things into an array. Process. Sort them. Print 'em
1131 my $post_sort_func = $self->_post_sort;
1132 my @fth;
1134 foreach my $sf ( $seq->top_SeqFeatures ) {
1135 push @fth, Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq);
1138 @fth = sort { &$post_sort_func($a, $b) } @fth;
1140 foreach my $fth ( @fth ) {
1141 $self->_print_GenBank_FTHelper($fth);
1144 else {
1145 # not post sorted. And so we can print as we get them.
1146 # lower memory load...
1147 foreach my $sf ( $seq->top_SeqFeatures ) {
1148 my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq);
1149 foreach my $fth ( @fth ) {
1150 if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
1151 $sf->throw("Cannot process FTHelper... $fth");
1153 $self->_print_GenBank_FTHelper($fth);
1158 # deal with WGS; WGS_SCAFLD present only if WGS is also present
1159 if ($seq->annotation->get_Annotations('wgs')) {
1160 foreach my $wgs (map {$seq->annotation->get_Annotations($_)}
1161 qw(wgs wgs_scaffold)
1163 $self->_print(sprintf ("%-11s %s\n",
1164 uc($wgs->tagname),
1165 $wgs->value));
1167 $self->_show_dna(0);
1169 if ($seq->annotation->get_Annotations('contig')) {
1170 my $ct = 0;
1171 my $cline;
1172 foreach my $contig ($seq->annotation->get_Annotations('contig')) {
1173 unless ($ct) {
1174 $cline = uc($contig->tagname) . " " . $contig->value . "\n";
1176 else {
1177 $cline = " " . $contig->value . "\n";
1179 $self->_print($cline);
1180 $ct++;
1182 $self->_show_dna(0);
1184 if ( $seq->length == 0 ) {
1185 $self->_show_dna(0);
1188 if ( $self->_show_dna == 0 ) {
1189 $self->_print("\n//\n");
1190 return;
1193 # finished printing features.
1195 $str =~ tr/A-Z/a-z/;
1197 my ($o) = $seq->annotation->get_Annotations('origin');
1198 $self->_print(sprintf("%-12s%s\n",
1199 'ORIGIN', $o ? $o->value : ''));
1200 # print out the sequence
1201 my $nuc = 60; # Number of nucleotides per line
1202 my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line
1203 my $out_pat = 'A11' x 6; # Pattern for packing a line
1204 my $length = length $str;
1206 # Calculate the number of nucleotides which fit on whole lines
1207 my $whole = int($length / $nuc) * $nuc;
1209 # Print the whole lines
1210 my $i;
1211 for ($i = 0; $i < $whole; $i += $nuc) {
1212 my $blocks = pack $out_pat,
1213 unpack $whole_pat,
1214 substr($str, $i, $nuc);
1215 chop $blocks;
1216 $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
1219 # Print the last line
1220 if (my $last = substr($str, $i)) {
1221 my $last_len = length($last);
1222 my $last_pat = 'a10' x int($last_len / 10)
1223 . 'a' . $last_len % 10;
1224 my $blocks = pack $out_pat,
1225 unpack($last_pat, $last);
1226 $blocks =~ s/ +$//;
1227 $self->_print(sprintf("%9d $blocks\n",
1228 $length - $last_len + 1));
1231 $self->_print("//\n");
1233 $self->flush if $self->_flush_on_write && defined $self->_fh;
1234 return 1;
1238 =head2 _print_GenBank_FTHelper
1240 Title : _print_GenBank_FTHelper
1241 Usage :
1242 Function:
1243 Example :
1244 Returns :
1245 Args :
1247 =cut
1249 sub _print_GenBank_FTHelper {
1250 my ( $self, $fth ) = @_;
1252 if ( not ref $fth or not $fth->isa('Bio::SeqIO::FTHelper') ) {
1253 $fth->warn(
1254 "$fth is not a FTHelper class. Attempting to print but there could be issues"
1258 my $spacer = ( length $fth->key >= 15 ) ? ' ' : '';
1259 $self->_write_line_GenBank_regex(
1260 sprintf( " %-16s%s", $fth->key, $spacer ),
1261 " " x 21, $fth->loc, "\,\|\$", 80 );
1263 foreach my $tag ( keys %{ $fth->field } ) {
1264 # Account for hash structure in Annotation::DBLink, not the expected array
1265 if ( $tag eq 'db_xref' and grep /HASH/, @{ $fth->field->{$tag} }) {
1266 for my $ref ( @{ $fth->field->{$tag} } ) {
1267 my $db = $ref->{'database'};
1268 my $id = $ref->{'primary_id'};
1269 $self->_write_line_GenBank_regex
1270 ( " " x 21, " " x 21,
1271 "/$tag=\"$db:$id\"", "\.\|\$", 80 );
1274 # The usual case, where all values are found in an array
1275 else {
1276 foreach my $value ( @{ $fth->field->{$tag} } ) {
1277 $value =~ s/\"/\"\"/g;
1278 if ( $value eq "_no_value" ) {
1279 $self->_write_line_GenBank_regex
1280 ( " " x 21, " " x 21,
1281 "/$tag", "\.\|\$", 80 );
1284 # There are almost 3x more quoted qualifier values and they
1285 # are more common too so we take quoted ones first.
1286 # Long qualifiers, that will be line wrapped, are always quoted
1287 elsif ( not $FTQUAL_NO_QUOTE{$tag}
1288 or length("/$tag=$value") >= $FTQUAL_LINE_LENGTH
1290 my ($pat) = ( $value =~ /\s/ ? '\s|$' : '.|$' );
1291 $self->_write_line_GenBank_regex
1292 ( " " x 21, " " x 21,
1293 "/$tag=\"$value\"", $pat, 80 );
1295 else {
1296 $self->_write_line_GenBank_regex
1297 ( " " x 21, " " x 21,
1298 "/$tag=$value", "\.\|\$", 80 );
1305 =head2 _read_GenBank_References
1307 Title : _read_GenBank_References
1308 Usage :
1309 Function: Reads references from GenBank format. Internal function really
1310 Returns :
1311 Args :
1313 =cut
1315 sub _read_GenBank_References {
1316 my ($self, $buffer) = @_;
1317 my (@refs);
1318 my $ref;
1320 # assumme things are starting with RN
1321 if ( $$buffer !~ /^REFERENCE/ ) {
1322 warn("Not parsing line '$$buffer' which maybe important");
1325 my $line = $$buffer;
1327 my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
1329 REFLOOP:
1330 while( defined($line) or defined($line = $self->_readline) ) {
1331 if ($line =~ /^\s{2}AUTHORS\s+(.*)/o) {
1332 push @authors, $1;
1333 while ( defined($line = $self->_readline) ) {
1334 if ($line =~ /^\s{9,}(.*)/o) {
1335 push @authors, $1;
1336 next;
1338 last;
1340 $ref->authors(join(' ', @authors));
1343 if ($line =~ /^\s{2}CONSRTM\s+(.*)/o) {
1344 push @consort, $1;
1345 while ( defined($line = $self->_readline) ) {
1346 if ($line =~ /^\s{9,}(.*)/o) {
1347 push @consort, $1;
1348 next;
1350 last;
1352 $ref->consortium(join(' ', @consort));
1355 if ($line =~ /^\s{2}TITLE\s+(.*)/o) {
1356 push @title, $1;
1357 while ( defined($line = $self->_readline) ) {
1358 if ($line =~ /^\s{9,}(.*)/o) {
1359 push @title, $1;
1360 next;
1362 last;
1364 $ref->title(join(' ', @title));
1367 if ($line =~ /^\s{2}JOURNAL\s+(.*)/o) {
1368 push @loc, $1;
1369 while ( defined($line = $self->_readline) ) {
1370 # we only match when there are at least 4 spaces
1371 # there is probably a better way to match this
1372 # as it assumes that the describing tag is short enough
1373 if ($line =~ /^\s{9,}(.*)/o) {
1374 push @loc, $1;
1375 next;
1377 last;
1379 $ref->location(join(' ', @loc));
1380 redo REFLOOP;
1383 if ($line =~ /^\s{2}REMARK\s+(.*)/o) {
1384 push @com, $1;
1385 while ( defined($line = $self->_readline) ) {
1386 if ($line =~ /^\s{9,}(.*)/o) {
1387 push @com, $1;
1388 next;
1390 last;
1392 $ref->comment(join(' ', @com));
1393 redo REFLOOP;
1396 if ( $line =~ /^\s{2}MEDLINE\s+(.*)/ ) {
1397 push @medline, $1;
1398 while ( defined($line = $self->_readline) ) {
1399 if ($line =~ /^\s{9,}(.*)/) {
1400 push @medline, $1;
1401 next;
1403 last;
1405 $ref->medline(join(' ', @medline));
1406 redo REFLOOP;
1409 if ( $line =~ /^\s{3}PUBMED\s+(.*)/ ) {
1410 push @pubmed, $1;
1411 while ( defined($line = $self->_readline) ) {
1412 if ($line =~ /^\s{9,}(.*)/) {
1413 push @pubmed, $1;
1414 next;
1416 last;
1418 $ref->pubmed(join(' ', @pubmed));
1419 redo REFLOOP;
1422 if ( $line =~ /^REFERENCE/o ) {
1423 # store current reference
1424 $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1426 # reset
1427 @authors = ();
1428 @title = ();
1429 @loc = ();
1430 @com = ();
1431 @pubmed = ();
1432 @medline = ();
1434 # create the new reference object
1435 $ref = Bio::Annotation::Reference->new(-tagname => 'reference');
1437 # check whether start and end base is given
1438 if ($line =~ /^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
1439 $ref->start($1);
1440 $ref->end($2);
1442 elsif ($line =~ /^REFERENCE\s+\d+\s+\((.*)\)/) {
1443 $ref->gb_reference($1);
1447 last if ($line =~ /^(FEATURES)|(COMMENT)/o);
1449 $line = undef; # Empty $line to trigger read of next line
1452 # store last reference
1453 $self->_add_ref_to_array(\@refs, $ref) if defined $ref;
1455 $$buffer = $line;
1457 #print "\nnumber of references found: ", $#refs+1,"\n";
1459 return @refs;
1462 =head2 _add_ref_to_array
1464 Title: _add_ref_to_array
1465 Usage:
1466 Function: Adds a Reference object to an array of Reference objects, takes
1467 care of possible cleanups to be done (currently, only author and title
1468 will be chopped of trailing semicolons).
1469 Args: A reference to an array of Reference objects and
1470 the Reference object to be added
1471 Returns: nothing
1473 =cut
1475 sub _add_ref_to_array {
1476 my ($self, $refs, $ref) = @_;
1478 # first, polish author and title by removing possible trailing semicolons
1479 my $au = $ref->authors;
1480 my $title = $ref->title;
1481 $au =~ s/;\s*$//g if $au;
1482 $title =~ s/;\s*$//g if $title;
1483 $ref->authors($au);
1484 $ref->title($title);
1485 # the rest should be clean already, so go ahead and add it
1486 push @{$refs}, $ref;
1489 =head2 _read_GenBank_Species
1491 Title : _read_GenBank_Species
1492 Usage :
1493 Function: Reads the GenBank Organism species and classification
1494 lines. Able to deal with unconvential Organism naming
1495 formats, and varietas in plants
1496 Example : ORGANISM unknown marine gamma proteobacterium NOR5
1497 $genus = undef
1498 $species = unknown marine gamma proteobacterium NOR5
1500 ORGANISM Drosophila sp. 'white tip scutellum'
1501 $genus = Drosophila
1502 $species = sp. 'white tip scutellum'
1503 (yes, this really is a species and that is its name)
1504 $subspecies = undef
1506 ORGANISM Ajellomyces capsulatus var. farciminosus
1507 $genus = Ajellomyces
1508 $species = capsulatus
1509 $subspecies = var. farciminosus
1511 ORGANISM Hepatitis delta virus
1512 $genus = undef (though this virus has a genus in its lineage, we
1513 cannot know that without a database lookup)
1514 $species = Hepatitis delta virus
1516 Returns : A Bio::Species object
1517 Args : A reference to the current line buffer
1519 =cut
1521 sub _read_GenBank_Species {
1522 my ($self, $buffer) = @_;
1524 my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
1525 'Unspecified', 'Unknown', 'None', 'unclassified',
1526 'unidentified organism', 'not supplied');
1527 # dictionary of synonyms for taxid 32644
1528 my @unkn_genus = ('unknown', 'unclassified', 'uncultured', 'unidentified');
1529 # all above can be part of valid species name
1531 my $line = $$buffer;
1533 my( $sub_species, $species, $genus, $sci_name, $common,
1534 $class_lines, $source_flag, $abbr_name, $organelle, $sl );
1535 my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
1537 # upon first entering the loop, we must not read a new line -- the SOURCE
1538 # line is already in the buffer (HL 05/10/2000)
1539 my ($ann, $tag, $data);
1540 while (defined($line) or defined($line = $self->_readline)) {
1541 # de-HTMLify (links that may be encountered here don't contain
1542 # escaped '>', so a simple-minded approach suffices)
1543 $line =~ s{<[^>]+>}{}g;
1544 if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
1545 ($tag, $data) = ($1, $2 || '');
1546 last if ($tag and not exists $source{$tag});
1548 else {
1549 return unless $tag;
1550 ($data = $line) =~ s{^\s+}{};
1551 chomp $data;
1552 $tag = 'CLASSIFICATION' if ( $tag ne 'CLASSIFICATION'
1553 and $tag eq 'ORGANISM'
1554 # Don't match "str." or "var." (NC_021815)
1555 and $line =~ m{(?<!\bstr|\bvar)[;\.]+});
1557 (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
1558 $line = undef;
1561 ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
1563 $$buffer = $line;
1565 $sci_name or return;
1567 # parse out organelle, common name, abbreviated name if present;
1568 # this should catch everything, but falls back to
1569 # entire SOURCE line just in case
1570 if ($sl =~ m{^(mitochondrion|chloroplast|plastid)?
1571 \s*(.*?)
1572 \s*(?: \( (.*?) \) )?\.?
1574 }xms
1576 ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional
1578 else {
1579 $abbr_name = $sl; # nothing caught; this is a backup!
1582 # Convert data in classification lines into classification array.
1583 # only split on ';' or '.' so that classification that is 2 or more words will
1584 # still get matched, use map() to remove trailing/leading/intervening spaces
1585 my @class = map { $_ =~ s/^\s+//;
1586 $_ =~ s/\s+$//;
1587 $_ =~ s/\s{2,}/ /g;
1588 $_; }
1589 split /(?<!subgen)[;\.]+/, $class_lines;
1591 # do we have a genus?
1592 my $possible_genus = quotemeta($class[-1])
1593 . ($class[-2] ? "|" . quotemeta($class[-2]) : '');
1594 if ($sci_name =~ /^($possible_genus)/) {
1595 $genus = $1;
1596 ($species) = $sci_name =~ /^$genus\s+(.+)/;
1598 else {
1599 $species = $sci_name;
1602 # is this organism of rank species or is it lower?
1603 # (we don't catch everything lower than species, but it doesn't matter -
1604 # this is just so we abide by previous behaviour whilst not calling a
1605 # species a subspecies)
1606 if ($species and $species =~ /(.+)\s+((?:subsp\.|var\.).+)/) {
1607 ($species, $sub_species) = ($1, $2);
1610 # Don't make a species object if it's empty or "Unknown" or "None"
1611 # return unless $genus and $genus !~ /^(Unknown|None)$/oi;
1612 # Don't make a species object if it belongs to taxid 32644
1613 # my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names;
1614 my $unkn = grep { $_ eq $sl } @unkn_names;
1615 return unless (defined $species or defined $genus) and $unkn == 0;
1617 # Bio::Species array needs array in Species -> Kingdom direction
1618 push @class, $sci_name;
1619 @class = reverse @class;
1621 my $make = Bio::Species->new;
1622 $make->scientific_name($sci_name);
1623 $make->classification(@class) if @class > 0;
1624 $make->common_name( $common ) if $common;
1625 $make->name('abbreviated', $abbr_name) if $abbr_name;
1626 $make->organelle($organelle) if $organelle;
1627 #$make->sub_species( $sub_species ) if $sub_species;
1628 return $make;
1631 =head2 _read_FTHelper_GenBank
1633 Title : _read_FTHelper_GenBank
1634 Usage : _read_FTHelper_GenBank($buffer)
1635 Function: reads the next FT key line
1636 Example :
1637 Returns : Bio::SeqIO::FTHelper object
1638 Args : filehandle and reference to a scalar
1640 =cut
1642 sub _read_FTHelper_GenBank {
1643 my ($self, $buffer) = @_;
1645 my ($key, # The key of the feature
1646 $loc # The location line from the feature
1648 my @qual = (); # An array of lines making up the qualifiers
1650 if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
1651 $key = $1;
1652 $loc = $2;
1653 # Read all the lines up to the next feature
1654 while ( defined(my $line = $self->_readline) ) {
1655 if ($line =~ /^(\s+)(.+?)\s*$/o) {
1656 # Lines inside features are preceded by 21 spaces
1657 # A new feature is preceded by 5 spaces
1658 if (length($1) > 6) {
1659 # Add to qualifiers if we're in the qualifiers, or if it's
1660 # the first qualifier
1661 if (@qual or (index($2,'/') == 0)) {
1662 push @qual, $2;
1664 # We're still in the location line, so append to location
1665 else {
1666 $loc .= $2;
1669 else {
1670 # We've reached the start of the next feature
1671 # Put the first line of the next feature into the buffer
1672 $$buffer = $line;
1673 last;
1676 else {
1677 # We're at the end of the feature table
1678 # Put the first line of the next feature into the buffer
1679 $$buffer = $line;
1680 last;
1684 else {
1685 # No feature key
1686 $self->debug("no feature key!\n");
1687 # change suggested by JDiggans to avoid infinite loop-
1688 # see bugreport 1062.
1689 # reset buffer to prevent infinite loop
1690 $$buffer = $self->_readline;
1691 return;
1694 # Make the new FTHelper object
1695 my $out = Bio::SeqIO::FTHelper->new;
1696 $out->verbose($self->verbose);
1697 $out->key($key);
1698 $out->loc($loc);
1700 # Now parse and add any qualifiers. (@qual is kept
1701 # intact to provide informative error messages.)
1702 QUAL:
1703 for (my $i = 0; $i < @qual; $i++) {
1704 my $data = $qual[$i];
1705 my ( $qualifier, $value ) = ($data =~ m{^/([^=]+)(?:=(.+))?})
1706 or $self->warn( "cannot see new qualifier in feature $key: "
1707 . $qual[$i]);
1708 $qualifier = '' unless( defined $qualifier );
1710 if (defined $value) {
1711 # Do we have a quoted value?
1712 if (substr($value, 0, 1) eq '"') {
1713 # Keep adding to value until we find the trailing quote
1714 # and the quotes are balanced
1715 while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
1716 if ($i >= $#qual) {
1717 $self->warn( "Unbalanced quote in:\n"
1718 . join("\n", @qual)
1719 . "No further qualifiers will "
1720 . "be added for this feature");
1721 last QUAL;
1723 # modifying a for-loop variable inside of the loop
1724 # is not the best programming style ...
1725 $i++;
1726 my $next = $qual[$i];
1728 # add to value with a space unless the value appears
1729 # to be a sequence (translation for example)
1730 # if (($value.$next) =~ /[^A-Za-z\"\-]/o) {
1731 # changed to explicitly look for translation tag - cjf 06/8/29
1732 if ($qualifier !~ /^translation$/i ) {
1733 $value .= " ";
1735 $value .= $next;
1737 # Trim leading and trailing quotes
1738 $value =~ s/^"|"$//g;
1739 # Undouble internal quotes
1740 $value =~ s/""/\"/g;
1742 elsif ( $value =~ /^\(/ ) { # values quoted by ()s
1743 # Keep adding to value until we find the trailing bracket
1744 # and the ()s are balanced
1745 my $left = ($value =~ tr/\(/\(/); # count left parens
1746 my $right = ($value =~ tr/\)/\)/); # count right parens
1748 while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right"
1749 if ( $i >= $#qual) {
1750 $self->warn( "Unbalanced parens in:\n"
1751 . join("\n", @qual)
1752 . "\nNo further qualifiers will "
1753 . "be added for this feature");
1754 last QUAL;
1756 $i++;
1757 my $next = $qual[$i];
1758 $value .= $next;
1759 $left += ($next =~ tr/\(/\(/);
1760 $right += ($next =~ tr/\)/\)/);
1764 else {
1765 $value = '_no_value';
1767 # Store the qualifier
1768 $out->field->{$qualifier} ||= [];
1769 push @{$out->field->{$qualifier}}, $value;
1771 return $out;
1774 =head2 _write_line_GenBank
1776 Title : _write_line_GenBank
1777 Usage :
1778 Function: internal function
1779 Example :
1780 Returns :
1781 Args :
1783 =cut
1785 sub _write_line_GenBank {
1786 my ($self, $pre1, $pre2, $line, $length) = @_;
1788 $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1789 my $subl = $length - length $pre2;
1790 my $linel = length $line;
1791 my $i;
1793 my $subr = substr($line,0,$length - length $pre1);
1795 $self->_print("$pre1$subr\n");
1796 for($i = ($length - length $pre1); $i < $linel; $i += $subl) {
1797 $subr = substr($line, $i, $subl);
1798 $self->_print("$pre2$subr\n");
1802 =head2 _write_line_GenBank_regex
1804 Title : _write_line_GenBank_regex
1805 Usage :
1806 Function: internal function for writing lines of specified
1807 length, with different first and the next line
1808 left hand headers and split at specific points in the
1809 text
1810 Example :
1811 Returns : nothing
1812 Args : file handle,
1813 first header,
1814 second header,
1815 text-line,
1816 regex for line breaks,
1817 total line length
1819 =cut
1821 sub _write_line_GenBank_regex {
1822 my ($self, $pre1, $pre2, $line, $regex, $length) = @_;
1824 #print STDOUT "Going to print with $line!\n";
1826 $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1828 my $subl = $length - (length $pre1) - 2;
1829 my @lines = ();
1831 CHUNK:
1832 while ($line) {
1833 foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
1834 if ($line =~ m/^(.{0,$subl})($pat)(.*)/ ) {
1835 my $l = $1 . $2;
1836 $line = substr($line, length $l);
1837 # be strict about not padding spaces according to
1838 # genbank format
1839 $l =~ s/\s+$//;
1840 next CHUNK if ($l eq '');
1841 push @lines, $l;
1842 next CHUNK;
1845 # if we get here none of the patterns matched $subl or less chars
1846 $self->warn( "trouble dissecting \"$line\"\n into chunks "
1847 . "of $subl chars or less - this tag won't print right");
1848 # insert a space char to prevent infinite loops
1849 $line = substr($line, 0, $subl) . " " . substr($line, $subl);
1851 my $s = shift @lines;
1852 $self->_print("$pre1$s\n") if $s;
1853 foreach my $s ( @lines ) {
1854 $self->_print("$pre2$s\n");
1858 =head2 _post_sort
1860 Title : _post_sort
1861 Usage : $obj->_post_sort($newval)
1862 Function:
1863 Returns : value of _post_sort
1864 Args : newvalue (optional)
1866 =cut
1868 sub _post_sort {
1869 my ($obj,$value) = @_;
1870 if ( defined $value) {
1871 $obj->{'_post_sort'} = $value;
1873 return $obj->{'_post_sort'};
1876 =head2 _show_dna
1878 Title : _show_dna
1879 Usage : $obj->_show_dna($newval)
1880 Function:
1881 Returns : value of _show_dna
1882 Args : newvalue (optional)
1884 =cut
1886 sub _show_dna {
1887 my ($obj,$value) = @_;
1888 if ( defined $value) {
1889 $obj->{'_show_dna'} = $value;
1891 return $obj->{'_show_dna'};
1894 =head2 _id_generation_func
1896 Title : _id_generation_func
1897 Usage : $obj->_id_generation_func($newval)
1898 Function:
1899 Returns : value of _id_generation_func
1900 Args : newvalue (optional)
1902 =cut
1904 sub _id_generation_func {
1905 my ($obj,$value) = @_;
1906 if ( defined $value ) {
1907 $obj->{'_id_generation_func'} = $value;
1909 return $obj->{'_id_generation_func'};
1912 =head2 _ac_generation_func
1914 Title : _ac_generation_func
1915 Usage : $obj->_ac_generation_func($newval)
1916 Function:
1917 Returns : value of _ac_generation_func
1918 Args : newvalue (optional)
1920 =cut
1922 sub _ac_generation_func {
1923 my ($obj,$value) = @_;
1924 if ( defined $value ) {
1925 $obj->{'_ac_generation_func'} = $value;
1927 return $obj->{'_ac_generation_func'};
1930 =head2 _sv_generation_func
1932 Title : _sv_generation_func
1933 Usage : $obj->_sv_generation_func($newval)
1934 Function:
1935 Returns : value of _sv_generation_func
1936 Args : newvalue (optional)
1938 =cut
1940 sub _sv_generation_func {
1941 my ($obj,$value) = @_;
1942 if ( defined $value ) {
1943 $obj->{'_sv_generation_func'} = $value;
1945 return $obj->{'_sv_generation_func'};
1948 =head2 _kw_generation_func
1950 Title : _kw_generation_func
1951 Usage : $obj->_kw_generation_func($newval)
1952 Function:
1953 Returns : value of _kw_generation_func
1954 Args : newvalue (optional)
1956 =cut
1958 sub _kw_generation_func {
1959 my ($obj,$value) = @_;
1960 if ( defined $value ) {
1961 $obj->{'_kw_generation_func'} = $value;
1963 return $obj->{'_kw_generation_func'};