sync w/ main trunk
[bioperl-live.git] / Bio / SeqIO / genbank.pm
blob6aa6cacf35d12a06b10a272ccf1129126a78940a
1 # $Id$
3 # BioPerl module for Bio::SeqIO::genbank
5 # Please direct questions and support issues to <bioperl-l@bioperl.org>
7 # Cared for by Bioperl project bioperl-l(at)bioperl.org
9 # Copyright Elia Stupka and contributors see AUTHORS section
11 # You may distribute this module under the same terms as perl itself
13 # POD documentation - main docs before the code
15 =head1 NAME
17 Bio::SeqIO::genbank - GenBank sequence input/output stream
19 =head1 SYNOPSIS
21 It is probably best not to use this object directly, but
22 rather go through the SeqIO handler:
24 $stream = Bio::SeqIO->new(-file => $filename,
25 -format => 'GenBank');
27 while ( my $seq = $stream->next_seq() ) {
28 # do something with $seq
32 =head1 DESCRIPTION
34 This object can transform Bio::Seq objects to and from GenBank flat
35 file databases.
37 There is some flexibility here about how to write GenBank output
38 that is not fully documented.
40 =head2 Optional functions
42 =over 3
44 =item _show_dna()
46 (output only) shows the dna or not
48 =item _post_sort()
50 (output only) provides a sorting func which is applied to the FTHelpers
51 before printing
53 =item _id_generation_func()
55 This is function which is called as
57 print "ID ", $func($seq), "\n";
59 To generate the ID line. If it is not there, it generates a sensible ID
60 line using a number of tools.
62 If you want to output annotations in Genbank format they need to be
63 stored in a Bio::Annotation::Collection object which is accessible
64 through the Bio::SeqI interface method L<annotation()|annotation>.
66 The following are the names of the keys which are pulled from a
67 L<Bio::Annotation::Collection> object:
69 reference - Should contain Bio::Annotation::Reference objects
70 comment - Should contain Bio::Annotation::Comment objects
71 dblink - Should contain a Bio::Annotation::DBLink object
72 segment - Should contain a Bio::Annotation::SimpleValue object
73 origin - Should contain a Bio::Annotation::SimpleValue object
74 wgs - Should contain a Bio::Annotation::SimpleValue object
76 =back
78 =head1 Where does the data go?
80 Data parsed in Bio::SeqIO::genbank is stored in a variety of data
81 fields in the sequence object that is returned. Here is a partial list
82 of fields.
84 Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
85 the top level object which defines a function called NAME() which
86 stores this information.
88 Items listed as Annotation 'NAME' tell you the data is stored the
89 associated Bio::AnnotationCollectionI object which is associated with
90 Bio::Seq objects. If it is explictly requested that no annotations
91 should be stored when parsing a record of course they will not be
92 available when you try and get them. If you are having this problem
93 look at the type of SeqBuilder that is being used to contruct your
94 sequence object.
96 Comments Annotation 'comment'
97 References Annotation 'reference'
98 Segment Annotation 'segment'
99 Origin Annotation 'origin'
100 Dbsource Annotation 'dblink'
102 Accessions PrimarySeq accession_number()
103 Secondary accessions RichSeq get_secondary_accessions()
104 GI number PrimarySeq primary_id()
105 LOCUS PrimarySeq display_id()
106 Keywords RichSeq get_keywords()
107 Dates RichSeq get_dates()
108 Molecule RichSeq molecule()
109 Seq Version RichSeq seq_version()
110 PID RichSeq pid()
111 Division RichSeq division()
112 Features Seq get_SeqFeatures()
113 Alphabet PrimarySeq alphabet()
114 Definition PrimarySeq description() or desc()
115 Version PrimarySeq version()
117 Sequence PrimarySeq seq()
119 There is more information in the Feature-Annotation HOWTO about each
120 field and how it is mapped to the Sequence object
121 L<http://bioperl.open-bio.org/wiki/HOWTO:Feature-Annotation>.
123 =head1 FEEDBACK
125 =head2 Mailing Lists
127 User feedback is an integral part of the evolution of this and other
128 Bioperl modules. Send your comments and suggestions preferably to one
129 of the Bioperl mailing lists. Your participation is much appreciated.
131 bioperl-l@bioperl.org - General discussion
132 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
134 =head2 Support
136 Please direct usage questions or support issues to the mailing list:
138 L<bioperl-l@bioperl.org>
140 rather than to the module maintainer directly. Many experienced and
141 reponsive experts will be able look at the problem and quickly
142 address it. Please include a thorough description of the problem
143 with code and data examples if at all possible.
145 =head2 Reporting Bugs
147 Report bugs to the Bioperl bug tracking system to help us keep track
148 the bugs and their resolution. Bug reports can be submitted via the web:
150 http://bugzilla.open-bio.org/
152 =head1 AUTHOR - Bioperl Project
154 bioperl-l at bioperl.org
156 Original author Elia Stupka, elia -at- tigem.it
158 =head1 CONTRIBUTORS
160 Ewan Birney birney at ebi.ac.uk
161 Jason Stajich jason at bioperl.org
162 Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
163 Lincoln Stein lstein at cshl.org
164 Heikki Lehvaslaiho, heikki at ebi.ac.uk
165 Hilmar Lapp, hlapp at gmx.net
166 Donald G. Jackson, donald.jackson at bms.com
167 James Wasmuth, james.wasmuth at ed.ac.uk
168 Brian Osborne, bosborne at alum.mit.edu
169 Chris Fields, cjfields at bioperl dot org
171 =head1 APPENDIX
173 The rest of the documentation details each of the object
174 methods. Internal methods are usually preceded with a _
176 =cut
178 # Let the code begin...
180 package Bio::SeqIO::genbank;
181 use strict;
183 use Bio::SeqIO::FTHelper;
184 use Bio::SeqFeature::Generic;
185 use Bio::Species;
186 use Bio::Seq::SeqFactory;
187 use Bio::Annotation::Collection;
188 use Bio::Annotation::Comment;
189 use Bio::Annotation::Reference;
190 use Bio::Annotation::DBLink;
192 use base qw(Bio::SeqIO);
194 our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
195 anticodon citation
196 codon codon_start
197 cons_splice direction
198 evidence label
199 mod_base number
200 rpt_type rpt_unit
201 transl_except transl_table
202 usedin
205 our %DBSOURCE = map {$_ => 1} qw(
206 EchoBASE IntAct SWISS-2DPAGE ECO2DBASE ECOGENE TIGRFAMs
207 TIGR GO InterPro Pfam PROSITE SGD GermOnline
208 HSSP PhosSite Ensembl RGD AGD ArrayExpress KEGG
209 H-InvDB HGNC LinkHub PANTHER PRINTS SMART SMR
210 MGI MIM RZPD-ProtExp ProDom MEROPS TRANSFAC Reactome
211 UniGene GlycoSuiteDB PIRSF HSC-2DPAGE PHCI-2DPAGE
212 PMMA-2DPAGE Siena-2DPAGE Rat-heart-2DPAGE Aarhus/Ghent-2DPAGE
213 Biocyc MetaCyc Biocyc:Metacyc GenomeReviews FlyBase
214 TMHOBP COMPLUYEAST-2DPAGE OGP DictyBase HAMAP
215 PhotoList Gramene WormBase WormPep Genew ZFIN
216 PeroxiBase MaizeDB TAIR DrugBank REBASE HPA
217 swissprot GenBank GenPept REFSEQ embl PDB UniProtKB
218 DIP PeptideAtlas PRIDE CYGD HOGENOME Gene3D);
220 our %VALID_ALPHABET = (
221 'bp' => 'dna',
222 'aa' => 'protein',
223 'rc' => '' # rc = release candidate; file has no sequences
226 sub _initialize {
227 my($self,@args) = @_;
229 $self->SUPER::_initialize(@args);
230 # hash for functions for decoding keys.
231 $self->{'_func_ftunit_hash'} = {};
232 $self->_show_dna(1); # sets this to one by default. People can change it
233 if( ! defined $self->sequence_factory ) {
234 $self->sequence_factory(Bio::Seq::SeqFactory->new
235 (-verbose => $self->verbose(),
236 -type => 'Bio::Seq::RichSeq'));
240 =head2 next_seq
242 Title : next_seq
243 Usage : $seq = $stream->next_seq()
244 Function: returns the next sequence in the stream
245 Returns : Bio::Seq object
246 Args :
248 =cut
250 sub next_seq {
251 my ($self,@args) = @_;
252 my $builder = $self->sequence_builder();
253 my $seq;
254 my %params;
256 RECORDSTART:
257 while (1) {
258 my $buffer;
259 my (@acc, @features);
260 my ($display_id, $annotation);
261 my $species;
263 # initialize; we may come here because of starting over
264 @features = ();
265 $annotation = undef;
266 @acc = ();
267 $species = undef;
268 %params = (-verbose => $self->verbose); # reset hash
269 local($/) = "\n";
270 while(defined($buffer = $self->_readline())) {
271 last if index($buffer,'LOCUS ') == 0;
273 return unless defined $buffer; # end of file
274 $buffer =~ /^LOCUS\s+(\S.*)$/o ||
275 $self->throw("GenBank stream with bad LOCUS line. Not GenBank in my book. Got '$buffer'");
277 my @tokens = split(' ', $1);
279 # this is important to have the id for display in e.g. FTHelper,
280 # otherwise you won't know which entry caused an error
281 $display_id = shift(@tokens);
282 $params{'-display_id'} = $display_id;
283 # may still be useful if we don't want the seq
284 my $seqlength = shift(@tokens);
285 if (exists $VALID_ALPHABET{$seqlength}) {
286 # moved one token too far. No locus name?
287 $self->warn("Bad LOCUS name? Changing [$params{'-display_id'}] to 'unknown' and length to $display_id");
288 $params{'-display_id'} = 'unknown';
289 $params{'-length'} = $display_id;
290 # add token back...
291 unshift @tokens, $seqlength;
292 } else {
293 $params{'-length'} = $seqlength;
295 # the alphabet of the entry
296 # shouldn't assign alphabet unless one is specifically designated (such as for rc files)
297 my $alphabet = lc(shift @tokens);
298 $params{'-alphabet'} = (exists $VALID_ALPHABET{$alphabet}) ? $VALID_ALPHABET{$alphabet} :
299 $self->warn("Unknown alphabet: $alphabet");
300 # for aa there is usually no 'molecule' (mRNA etc)
301 if (($params{'-alphabet'} eq 'dna') || (@tokens > 2)) {
302 $params{'-molecule'} = shift(@tokens);
303 my $circ = shift(@tokens);
304 if ($circ eq 'circular') {
305 $params{'-is_circular'} = 1;
306 $params{'-division'} = shift(@tokens);
307 } else {
308 # 'linear' or 'circular' may actually be omitted altogether
309 $params{'-division'} =
310 (CORE::length($circ) == 3 ) ? $circ : shift(@tokens);
312 } else {
313 $params{'-molecule'} = 'PRT' if($params{'-alphabet'} eq 'aa');
314 $params{'-division'} = shift(@tokens);
316 my $date = join(' ', @tokens); # we lump together the rest
318 # this is per request bug #1513
319 # we can handle
320 # 9-10-2003
321 # 9-10-03
322 # 09-10-2003
323 # 09-10-03
324 if($date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/) {
325 if( length($date) < 11 ) {
326 # improperly formatted date
327 # But we'll be nice and fix it for them
328 my ($d,$m,$y) = ($2,$3,$4);
329 if( length($d) == 1 ) {
330 $d = "0$d";
332 # guess the century here
333 if( length($y) == 2 ) {
334 if( $y > 60 ) { # arbitrarily guess that '60' means 1960
335 $y = "19$y";
336 } else {
337 $y = "20$y";
339 $self->warn("Date was malformed, guessing the century for $date to be $y\n");
341 $params{'-dates'} = [join('-',$d,$m,$y)];
342 } else {
343 $params{'-dates'} = [$date];
346 # set them all at once
347 $builder->add_slot_value(%params);
348 %params = ();
350 # parse the rest if desired, otherwise start over
351 if(! $builder->want_object()) {
352 $builder->make_object();
353 next RECORDSTART;
356 # set up annotation depending on what the builder wants
357 if($builder->want_slot('annotation')) {
358 $annotation = Bio::Annotation::Collection->new();
360 $buffer = $self->_readline();
361 until( !defined ($buffer) ) {
362 $_ = $buffer;
363 # Description line(s)
364 if (/^DEFINITION\s+(\S.*\S)/) {
365 my @desc = ($1);
366 while ( defined($_ = $self->_readline) ) {
367 if( /^\s+(.*)/ ) { push (@desc, $1); next };
368 last;
370 $builder->add_slot_value(-desc => join(' ', @desc));
371 # we'll continue right here because DEFINITION always comes
372 # at the top of the entry
373 $buffer= $_;
375 # accession number (there can be multiple accessions)
376 if( /^ACCESSION\s+(\S.*\S)/ ) {
377 push(@acc, split(/\s+/,$1));
378 while( defined($_ = $self->_readline) ) {
379 /^\s+(.*)/ && do { push (@acc, split(/\s+/,$1)); next };
380 last;
382 $buffer = $_;
383 next;
385 # PID
386 elsif( /^PID\s+(\S+)/ ) {
387 $params{'-pid'} = $1;
389 # Version number
390 elsif( /^VERSION\s+(\S.+)$/ ) {
391 my ($acc,$gi) = split(' ',$1);
392 if($acc =~ /^\w+\.(\d+)/) {
393 $params{'-version'} = $1;
394 $params{'-seq_version'} = $1;
396 if($gi && (index($gi,"GI:") == 0)) {
397 $params{'-primary_id'} = substr($gi,3);
400 # Keywords
401 elsif( /^KEYWORDS\s+(\S.*)/ ) {
402 my @kw = split(/\s*\;\s*/,$1);
403 while( defined($_ = $self->_readline) ) {
404 chomp;
405 /^\s+(.*)/ && do { push (@kw, split(/\s*\;\s*/,$1)); next };
406 last;
409 @kw && $kw[-1] =~ s/\.$//;
410 $params{'-keywords'} = \@kw;
411 $buffer = $_;
412 next;
414 # Organism name and phylogenetic information
415 elsif (/^SOURCE\s+\S/) {
416 if($builder->want_slot('species')) {
417 $species = $self->_read_GenBank_Species(\$buffer);
418 $builder->add_slot_value(-species => $species);
419 } else {
420 while(defined($buffer = $self->_readline())) {
421 last if substr($buffer,0,1) ne ' ';
424 next;
426 # References
427 elsif (/^REFERENCE\s+\S/) {
428 if($annotation) {
429 my @refs = $self->_read_GenBank_References(\$buffer);
430 foreach my $ref ( @refs ) {
431 $annotation->add_Annotation('reference',$ref);
433 } else {
434 while(defined($buffer = $self->_readline())) {
435 last if substr($buffer,0,1) ne ' ';
438 next;
440 # Project
441 elsif (/^PROJECT\s+(\S.*)/) {
442 if ($annotation) {
443 my $project = new Bio::Annotation::SimpleValue->new(-value => $1);
444 $annotation->add_Annotation('project',$project);
447 # Comments
448 elsif (/^COMMENT\s+(\S.*)/) {
449 if($annotation) {
450 my $comment = $1;
451 while (defined($_ = $self->_readline)) {
452 last if (/^\S/);
453 $comment .= $_;
455 $comment =~ s/\n/ /g;
456 $comment =~ s/ +/ /g;
457 $annotation->add_Annotation('comment',
458 Bio::Annotation::Comment->new(-text => $comment,
459 -tagname => 'comment'));
460 $buffer = $_;
461 } else {
462 while(defined($buffer = $self->_readline())) {
463 last if substr($buffer,0,1) ne ' ';
466 next;
468 # Corresponding Genbank nucleotide id, Genpept only
469 elsif( /^DBSOURCE\s+(\S.+)/ ) {
470 if ($annotation) {
471 my $dbsource = $1;
472 while (defined($_ = $self->_readline)) {
473 last if (/^\S/);
474 $dbsource .= $_;
476 # deal with UniProKB dbsources
477 if( $dbsource =~ s/(UniProtKB|swissprot):\s+locus\s+(\S+)\,.+\n// ) {
478 $annotation->add_Annotation
479 ('dblink',
480 Bio::Annotation::DBLink->new
481 (-primary_id => $2,
482 -database => $1,
483 -tagname => 'dblink'));
484 if( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
485 $annotation->add_Annotation
486 ('swissprot_dates',
487 Bio::Annotation::SimpleValue->new
488 (-tagname => 'date_created',
489 -value => $1));
491 while( $dbsource =~ s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g ) {
492 $annotation->add_Annotation
493 ('swissprot_dates',
494 Bio::Annotation::SimpleValue->new
495 (-tagname => 'date_updated',
496 -value => $2));
498 $dbsource =~ s/\n/ /g;
499 if( $dbsource =~ s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/ ) {
500 # will use $i to determine even or odd
501 # for swissprot the accessions are paired
502 my $i = 0;
503 for my $dbsrc ( split(/,\s+/,$1) ) {
504 if( $dbsrc =~ /(\S+)\.(\d+)/ ||
505 $dbsrc =~ /(\S+)/ ) {
506 my ($id,$version) = ($1,$2);
507 $version ='' unless defined $version;
508 my $db;
509 if( $id =~ /^\d\S{3}/) {
510 $db = 'PDB';
511 } else {
512 $db = ($i++ % 2 ) ? 'GenPept' : 'GenBank';
514 $annotation->add_Annotation
515 ('dblink',
516 Bio::Annotation::DBLink->new
517 (-primary_id => $id,
518 -version => $version,
519 -database => $db,
520 -tagname => 'dblink'));
523 } elsif( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) {
524 # download screwed up and ncbi didn't put acc in for gi numbers
525 my $i = 0;
526 for my $id ( split(/\,\s+/,$1) ) {
527 my ($acc,$db);
528 if( $id =~ /gi:\s+(\d+)/ ) {
529 $acc= $1;
530 $db = ($i++ % 2 ) ? 'GenPept' : 'GenBank';
531 } elsif( $id =~ /pdb\s+accession\s+(\S+)/ ) {
532 $acc= $1;
533 $db = 'PDB';
534 } else {
535 $acc= $id;
536 $db = '';
538 $annotation->add_Annotation
539 ('dblink',
540 Bio::Annotation::DBLink->new
541 (-primary_id => $acc,
542 -database => $db,
543 -tagname => 'dblink'));
545 } else {
546 $self->debug("Cannot match $dbsource\n");
548 if( $dbsource =~ s/xrefs\s+\(non\-sequence\s+databases\):\s+
549 ((?:\S+,\s+)+\S+)//x ) {
550 for my $id ( split(/\,\s+/,$1) ) {
551 my $db;
552 # this is because GenBank dropped the spaces!!!
553 # I'm sure we're not going to get this right
554 ##if( $id =~ s/^://i ) {
555 ## $db = $1;
557 $db = substr($id,0,index($id,':'));
558 if (! exists $DBSOURCE{ $db }) {
559 $db = ''; # do we want 'GenBank' here?
561 $id = substr($id,index($id,':')+1);
562 $annotation->add_Annotation
563 ('dblink',
564 Bio::Annotation::DBLink->new
565 (-primary_id => $id,
566 -database => $db,
567 -tagname => 'dblink'));
571 } else {
572 if( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/ ) {
573 my ($db,$id,$version) = ($1,$2,$3);
574 $annotation->add_Annotation
575 ('dblink',
576 Bio::Annotation::DBLink->new
577 (-primary_id => $id,
578 -version => $version,
579 -database => $db || 'GenBank',
580 -tagname => 'dblink'));
581 } elsif ( $dbsource =~ /(\S+)\.(\d+)/ ) {
582 my ($id,$version) = ($1,$2);
583 $annotation->add_Annotation
584 ('dblink',
585 Bio::Annotation::DBLink->new
586 (-primary_id => $id,
587 -version => $version,
588 -database => 'GenBank',
589 -tagname => 'dblink'));
590 } else {
591 $self->warn("Unrecognized DBSOURCE data: $dbsource\n");
595 $buffer = $_;
596 } else {
597 while(defined($buffer = $self->_readline())) {
598 last if substr($buffer,0,1) ne ' ';
601 next;
603 # Exit at start of Feature table, or start of sequence
604 last if( /^(FEATURES|ORIGIN)/ );
605 # Get next line and loop again
606 $buffer = $self->_readline;
608 return unless defined $buffer;
610 # add them all at once for efficiency
611 $builder->add_slot_value(-accession_number => shift(@acc),
612 -secondary_accessions => \@acc,
613 %params);
614 $builder->add_slot_value(-annotation => $annotation) if $annotation;
615 %params = (); # reset before possible re-use to avoid setting twice
617 # start over if we don't want to continue with this entry
618 if(! $builder->want_object()) {
619 $builder->make_object();
620 next RECORDSTART;
622 # some "minimal" formats may not necessarily have a feature table
623 if($builder->want_slot('features') && defined($_) && /^FEATURES/o) {
624 # need to read the first line of the feature table
625 $buffer = $self->_readline;
626 # DO NOT read lines in the while condition -- this is done as a side
627 # effect in _read_FTHelper_GenBank!
628 while( defined($buffer) ) {
629 # check immediately -- not at the end of the loop
630 # note: GenPept entries obviously do not have a BASE line
631 last if( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o);
633 # slurp in one feature at a time -- at return, the start of
634 # the next feature will have been read already, so we need
635 # to pass a reference, and the called method must set this
636 # to the last line read before returning
638 my $ftunit = $self->_read_FTHelper_GenBank(\$buffer);
640 # fix suggested by James Diggans
642 if( !defined $ftunit ) {
643 # GRRRR. We have fallen over. Try to recover
644 $self->warn("Unexpected error in feature table for ".$params{'-display_id'}." Skipping feature, attempting to recover");
645 unless( ($buffer =~ /^\s{5,5}\S+/o) or
646 ($buffer =~ /^\S+/o)) {
647 $buffer = $self->_readline();
649 next; # back to reading FTHelpers
652 # process ftunit
653 my $feat =
654 $ftunit->_generic_seqfeature($self->location_factory(),
655 $display_id);
656 # add taxon_id from source if available
657 if($species && ($feat->primary_tag eq 'source') &&
658 $feat->has_tag('db_xref') && (! $species->ncbi_taxid() ||
659 ($species->ncbi_taxid && $species->ncbi_taxid =~ /^list/))) {
660 foreach my $tagval ($feat->get_tag_values('db_xref')) {
661 if(index($tagval,"taxon:") == 0) {
662 $species->ncbi_taxid(substr($tagval,6));
663 last;
667 # add feature to list of features
668 push(@features, $feat);
670 $builder->add_slot_value(-features => \@features);
671 $_ = $buffer;
673 if( defined ($_) ) {
674 if( /^CONTIG/o ) {
675 my @contig;
676 while($_ !~ m{^//}) { # end of file
677 $_ =~ /^(?:CONTIG)?\s+(.*)/;
678 $annotation->add_Annotation(
679 Bio::Annotation::SimpleValue->new(-value => $1,
680 -tagname => 'contig'));
681 $_ = $self->_readline;
683 $self->_pushback($_);
684 } elsif( /^WGS|WGS_SCAFLD\s+/o ) { # catch WGS/WGS_SCAFLD lines
685 while($_ =~ s/(^WGS|WGS_SCAFLD)\s+//){ # gulp lines
686 chomp;
687 $annotation->add_Annotation(
688 Bio::Annotation::SimpleValue->new(-value => $_,
689 -tagname => lc($1)));
690 $_ = $self->_readline;
692 } elsif(! m{^(ORIGIN|//)} ) { # advance to the sequence, if any
693 while (defined( $_ = $self->_readline) ) {
694 last if m{^(ORIGIN|//)};
698 if(! $builder->want_object()) {
699 $builder->make_object(); # implicit end-of-object
700 next RECORDSTART;
702 if($builder->want_slot('seq')) {
703 # the fact that we want a sequence does not necessarily mean that
704 # there also is a sequence ...
705 if(defined($_) && s/^ORIGIN\s+//) {
706 chomp;
707 if( $annotation && length($_) > 0 ) {
708 $annotation->add_Annotation('origin',
709 Bio::Annotation::SimpleValue->new(-tagname => 'origin',
710 -value => $_));
712 my $seqc = '';
713 while( defined($_ = $self->_readline) ) {
714 m{^//} && last;
715 $_ = uc($_);
716 s/[^A-Za-z]//g;
717 $seqc .= $_;
719 #$self->debug("sequence length is ". length($seqc) ."\n");
720 $builder->add_slot_value(-seq => $seqc);
722 } elsif ( defined($_) && (substr($_,0,2) ne '//')) {
723 # advance to the end of the record
724 while( defined($_ = $self->_readline) ) {
725 last if substr($_,0,2) eq '//';
728 # Unlikely, but maybe the sequence is so weird that we don't want it
729 # anymore. We don't want to return undef if the stream's not exhausted
730 # yet.
731 $seq = $builder->make_object();
732 next RECORDSTART unless $seq;
733 last RECORDSTART;
734 } # end while RECORDSTART
736 return $seq;
739 =head2 write_seq
741 Title : write_seq
742 Usage : $stream->write_seq($seq)
743 Function: writes the $seq object (must be seq) to the stream
744 Returns : 1 for success and 0 for error
745 Args : array of 1 to n Bio::SeqI objects
748 =cut
750 sub write_seq {
751 my ($self,@seqs) = @_;
753 foreach my $seq ( @seqs ) {
754 $self->throw("Attempting to write with no seq!") unless defined $seq;
756 if( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
757 $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
760 my $str = $seq->seq;
762 my ($div, $mol);
763 my $len = $seq->length();
765 if ( $seq->can('division') ) {
766 $div=$seq->division;
768 if( !defined $div || ! $div ) { $div = 'UNK'; }
769 my $alpha = $seq->alphabet;
770 if( !$seq->can('molecule') || ! defined ($mol = $seq->molecule()) ) {
771 $mol = $alpha || 'DNA';
774 my $circular = 'linear ';
775 $circular = 'circular' if $seq->is_circular;
777 local($^W) = 0; # supressing warnings about uninitialized fields.
779 my $temp_line;
780 if( $self->_id_generation_func ) {
781 $temp_line = &{$self->_id_generation_func}($seq);
782 } else {
783 my $date = '';
784 if( $seq->can('get_dates') ) {
785 ($date) = $seq->get_dates();
788 $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
789 if $seq->display_id =~ /\s/;
791 $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s",
792 'LOCUS', $seq->id(),$len,
793 (lc($alpha) eq 'protein') ? ('aa','', '') :
794 ('bp', '',$mol),$circular,
795 $div,$date);
798 $self->_print("$temp_line\n");
799 $self->_write_line_GenBank_regex("DEFINITION ", " ",
800 $seq->desc(),"\\s\+\|\$",80);
802 # if there, write the accession line
804 if( $self->_ac_generation_func ) {
805 $temp_line = &{$self->_ac_generation_func}($seq);
806 $self->_print("ACCESSION $temp_line\n");
807 } else {
808 my @acc = ();
809 push(@acc, $seq->accession_number());
810 if( $seq->isa('Bio::Seq::RichSeqI') ) {
811 push(@acc, $seq->get_secondary_accessions());
813 $self->_print("ACCESSION ", join(" ", @acc), "\n");
814 # otherwise - cannot print <sigh>
817 # if PID defined, print it
818 if($seq->isa('Bio::Seq::RichSeqI') && $seq->pid()) {
819 $self->_print("PID ", $seq->pid(), "\n");
822 # if there, write the version line
824 if( defined $self->_sv_generation_func() ) {
825 $temp_line = &{$self->_sv_generation_func}($seq);
826 if( $temp_line ) {
827 $self->_print("VERSION $temp_line\n");
829 } else {
830 if($seq->isa('Bio::Seq::RichSeqI') && defined($seq->seq_version)) {
831 my $id = $seq->primary_id(); # this may be a GI number
832 $self->_print("VERSION ",
833 $seq->accession_number(), ".", $seq->seq_version,
834 ($id && ($id =~ /^\d+$/) ? " GI:".$id : ""),
835 "\n");
839 # if there, write the PROJECT line
840 for my $proj ( $seq->annotation->get_Annotations('project') ) {
841 $self->_print("PROJECT ".$proj->value."\n");
844 # if there, write the DBSOURCE line
845 foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
846 # if ($ref->comment eq 'DBSOURCE') {
847 my $text = $ref->display_text(
848 sub{($ref->database eq 'GenBank' ? '' : $_[0]->database.' ').
849 'accession '.$_[0]->primary_id});
850 $self->_print("DBSOURCE $text\n");
854 # if there, write the keywords line
855 if( defined $self->_kw_generation_func() ) {
856 $temp_line = &{$self->_kw_generation_func}($seq);
857 $self->_print("KEYWORDS $temp_line\n");
858 } else {
859 if( $seq->can('keywords') ) {
860 my $kw = $seq->keywords;
861 $kw .= '.' if( $kw !~ /\.$/ );
862 $self->_print("KEYWORDS $kw\n");
866 # SEGMENT if it exists
867 foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
868 $self->_print(sprintf ("%-11s %s\n",'SEGMENT',
869 $ref->value));
872 # Organism lines
873 if (my $spec = $seq->species) {
874 my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
875 $spec->scientific_name,
876 $spec->common_name);
877 my @classification;
878 if ($spec->isa('Bio::Species')) {
879 @classification = $spec->classification;
880 shift(@classification);
881 } else {
882 # Bio::Taxon should have a DB handle of some type attached, so
883 # derive the classification from that
884 my $node = $spec;
885 while ($node) {
886 $node = $node->ancestor || last;
887 unshift(@classification, $node->node_name);
888 #$node eq $root && last;
890 @classification = reverse @classification;
892 my $abname = $spec->name('abbreviated') ? # from genbank file
893 $spec->name('abbreviated')->[0] : $sn;
894 my $sl = $on ? "$on " : '';
895 $sl .= $cn ? $abname." ($cn)" : "$abname";
897 $self->_write_line_GenBank_regex("SOURCE ", ' 'x12, $sl, "\\s\+\|\$",80);
898 $self->_print(" ORGANISM ", $spec->scientific_name, "\n");
899 my $OC = join('; ', (reverse(@classification))) .'.';
900 $self->_write_line_GenBank_regex(' 'x12,' 'x12,
901 $OC,"\\s\+\|\$",80);
904 # Reference lines
905 my $count = 1;
906 foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
907 $temp_line = "REFERENCE $count";
908 if ($ref->start) {
909 $temp_line .= sprintf (" (%s %d to %d)",
910 ($seq->alphabet() eq "protein" ?
911 "residues" : "bases"),
912 $ref->start,$ref->end);
913 } elsif ($ref->gb_reference) {
914 $temp_line .= sprintf (" (%s)", $ref->gb_reference);
916 $self->_print("$temp_line\n");
917 $self->_write_line_GenBank_regex(" AUTHORS ",' 'x12,
918 $ref->authors,"\\s\+\|\$",80);
919 $self->_write_line_GenBank_regex(" CONSRTM ",' 'x12,
920 $ref->consortium,"\\s\+\|\$",80) if $ref->consortium;
921 $self->_write_line_GenBank_regex(" TITLE "," "x12,
922 $ref->title,"\\s\+\|\$",80);
923 $self->_write_line_GenBank_regex(" JOURNAL "," "x12,
924 $ref->location,"\\s\+\|\$",80);
925 if( $ref->medline) {
926 $self->_write_line_GenBank_regex(" MEDLINE "," "x12,
927 $ref->medline, "\\s\+\|\$",80);
928 # I am assuming that pubmed entries only exist when there
929 # are also MEDLINE entries due to the indentation
931 # This could be a wrong assumption
932 if( $ref->pubmed ) {
933 $self->_write_line_GenBank_regex(" PUBMED "," "x12,
934 $ref->pubmed, "\\s\+\|\$",
935 80);
937 # put remark at the end
938 if ($ref->comment) {
939 $self->_write_line_GenBank_regex(" REMARK "," "x12,
940 $ref->comment,"\\s\+\|\$",80);
942 $count++;
945 # Comment lines
946 foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
947 $self->_write_line_GenBank_regex("COMMENT "," "x12,
948 $comment->text,"\\s\+\|\$",80);
950 $self->_print("FEATURES Location/Qualifiers\n");
952 if( defined $self->_post_sort ) {
953 # we need to read things into an array. Process. Sort them. Print 'em
955 my $post_sort_func = $self->_post_sort();
956 my @fth;
958 foreach my $sf ( $seq->top_SeqFeatures ) {
959 push(@fth,Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq));
962 @fth = sort { &$post_sort_func($a,$b) } @fth;
964 foreach my $fth ( @fth ) {
965 $self->_print_GenBank_FTHelper($fth);
967 } else {
968 # not post sorted. And so we can print as we get them.
969 # lower memory load...
971 foreach my $sf ( $seq->top_SeqFeatures ) {
972 my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq);
973 foreach my $fth ( @fth ) {
974 if( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
975 $sf->throw("Cannot process FTHelper... $fth");
977 $self->_print_GenBank_FTHelper($fth);
982 # deal with WGS; WGS_SCAFLD present only if WGS is also present
983 if($seq->annotation->get_Annotations('wgs')) {
984 foreach my $wgs
985 (map {$seq->annotation->get_Annotations($_)} qw(wgs wgs_scaffold)) {
986 $self->_print(sprintf ("%-11s %s\n",uc($wgs->tagname),
987 $wgs->value));
989 $self->_show_dna(0);
991 if($seq->annotation->get_Annotations('contig')) {
992 my $ct = 0;
993 my $cline;
994 foreach my $contig ($seq->annotation->get_Annotations('contig')) {
995 unless ($ct) {
996 $cline = uc($contig->tagname)." ".$contig->value."\n";
997 } else {
998 $cline = " ".$contig->value."\n";
1000 $self->_print($cline);
1001 $ct++;
1003 $self->_show_dna(0);
1005 if( $seq->length == 0 ) { $self->_show_dna(0) }
1007 if( $self->_show_dna() == 0 ) {
1008 $self->_print("\n//\n");
1009 return;
1012 # finished printing features.
1014 $str =~ tr/A-Z/a-z/;
1016 my ($o) = $seq->annotation->get_Annotations('origin');
1017 $self->_print(sprintf("%-12s%s\n",
1018 'ORIGIN', $o ? $o->value : ''));
1019 # print out the sequence
1020 my $nuc = 60; # Number of nucleotides per line
1021 my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line
1022 my $out_pat = 'A11' x 6; # Pattern for packing a line
1023 my $length = length($str);
1025 # Calculate the number of nucleotides which fit on whole lines
1026 my $whole = int($length / $nuc) * $nuc;
1028 # Print the whole lines
1029 my $i;
1030 for ($i = 0; $i < $whole; $i += $nuc) {
1031 my $blocks = pack $out_pat,
1032 unpack $whole_pat,
1033 substr($str, $i, $nuc);
1034 chop $blocks;
1035 $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
1038 # Print the last line
1039 if (my $last = substr($str, $i)) {
1040 my $last_len = length($last);
1041 my $last_pat = 'a10' x int($last_len / 10) .
1042 'a'. $last_len % 10;
1043 my $blocks = pack $out_pat,
1044 unpack($last_pat, $last);
1045 $blocks =~ s/ +$//;
1046 $self->_print(sprintf("%9d $blocks\n",
1047 $length - $last_len + 1));
1050 $self->_print("//\n");
1052 $self->flush if $self->_flush_on_write && defined $self->_fh;
1053 return 1;
1057 =head2 _print_GenBank_FTHelper
1059 Title : _print_GenBank_FTHelper
1060 Usage :
1061 Function:
1062 Example :
1063 Returns :
1064 Args :
1067 =cut
1069 sub _print_GenBank_FTHelper {
1070 my ($self,$fth) = @_;
1072 if( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {
1073 $fth->warn("$fth is not a FTHelper class. Attempting to print, but there could be tears!");
1075 my $spacer = (length $fth->key >= 15) ? ' ' : '';
1076 $self->_write_line_GenBank_regex(sprintf(" %-16s%s",$fth->key,$spacer),
1077 " "x21,
1078 $fth->loc,"\,\|\$",80);
1079 foreach my $tag ( keys %{$fth->field} ) {
1080 foreach my $value ( @{$fth->field->{$tag}} ) {
1081 $value =~ s/\"/\"\"/g;
1082 if ($value eq "_no_value") {
1083 $self->_write_line_GenBank_regex(" "x21,
1084 " "x21,
1085 "/$tag","\.\|\$",80);
1087 # there are almost 3x more quoted qualifier values and they
1088 # are more common too so we take quoted ones first
1089 elsif (!$FTQUAL_NO_QUOTE{$tag}) {
1090 my ($pat) = ($value =~ /\s/ ? '\s|$' : '.|$');
1091 $self->_write_line_GenBank_regex(" "x21,
1092 " "x21,
1093 "/$tag=\"$value\"",$pat,80);
1095 } else {
1096 $self->_write_line_GenBank_regex(" "x21,
1097 " "x21,
1098 "/$tag=$value","\.\|\$",80);
1104 =head2 _read_GenBank_References
1106 Title : _read_GenBank_References
1107 Usage :
1108 Function: Reads references from GenBank format. Internal function really
1109 Returns :
1110 Args :
1112 =cut
1114 sub _read_GenBank_References {
1115 my ($self,$buffer) = @_;
1116 my (@refs);
1117 my $ref;
1119 # assumme things are starting with RN
1121 if( $$buffer !~ /^REFERENCE/ ) {
1122 warn("Not parsing line '$$buffer' which maybe important");
1125 $_ = $$buffer;
1127 my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
1129 REFLOOP: while( defined($_) || defined($_ = $self->_readline) ) {
1130 if (/^\s{2}AUTHORS\s+(.*)/o) {
1131 push (@authors, $1);
1132 while ( defined($_ = $self->_readline) ) {
1133 /^\s{9,}(.*)/o && do { push (@authors, $1);next;};
1134 last;
1136 $ref->authors(join(' ', @authors));
1138 if (/^\s{2}CONSRTM\s+(.*)/o) {
1139 push (@consort, $1);
1140 while ( defined($_ = $self->_readline) ) {
1141 /^\s{9,}(.*)/o && do { push (@consort, $1);next;};
1142 last;
1144 $ref->consortium(join(' ', @consort));
1146 if (/^\s{2}TITLE\s+(.*)/o) {
1147 push (@title, $1);
1148 while ( defined($_ = $self->_readline) ) {
1149 /^\s{9,}(.*)/o && do { push (@title, $1);
1150 next;
1152 last;
1154 $ref->title(join(' ', @title));
1156 if (/^\s{2}JOURNAL\s+(.*)/o) {
1157 push(@loc, $1);
1158 while ( defined($_ = $self->_readline) ) {
1159 # we only match when there are at least 4 spaces
1160 # there is probably a better way to match this
1161 # as it assumes that the describing tag is short enough
1162 /^\s{9,}(.*)/o && do { push(@loc, $1);
1163 next;
1165 last;
1167 $ref->location(join(' ', @loc));
1168 redo REFLOOP;
1170 if (/^\s{2}REMARK\s+(.*)/o) {
1171 push (@com, $1);
1172 while ( defined($_ = $self->_readline) ) {
1173 /^\s{9,}(.*)/o && do { push(@com, $1);
1174 next;
1176 last;
1178 $ref->comment(join(' ', @com));
1179 redo REFLOOP;
1181 if( /^\s{2}MEDLINE\s+(.*)/ ) {
1182 push(@medline,$1);
1183 while ( defined($_ = $self->_readline) ) {
1184 /^\s{9,}(.*)/ && do { push(@medline, $1);
1185 next;
1187 last;
1189 $ref->medline(join(' ', @medline));
1190 redo REFLOOP;
1192 if( /^\s{3}PUBMED\s+(.*)/ ) {
1193 push(@pubmed,$1);
1194 while ( defined($_ = $self->_readline) ) {
1195 /^\s{9,}(.*)/ && do { push(@pubmed, $1);
1196 next;
1198 last;
1200 $ref->pubmed(join(' ', @pubmed));
1201 redo REFLOOP;
1204 /^REFERENCE/o && do {
1205 # store current reference
1206 $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1207 # reset
1208 @authors = ();
1209 @title = ();
1210 @loc = ();
1211 @com = ();
1212 @pubmed = ();
1213 @medline = ();
1214 # create the new reference object
1215 $ref = Bio::Annotation::Reference->new(-tagname => 'reference');
1216 # check whether start and end base is given
1217 if (/^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
1218 $ref->start($1);
1219 $ref->end($2);
1220 } elsif (/^REFERENCE\s+\d+\s+\((.*)\)/) {
1221 $ref->gb_reference($1);
1225 /^(FEATURES)|(COMMENT)/o && last;
1227 $_ = undef; # Empty $_ to trigger read of next line
1230 # store last reference
1231 $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1233 $$buffer = $_;
1235 #print "\nnumber of references found: ", $#refs+1,"\n";
1237 return @refs;
1241 # This is undocumented as it shouldn't be called by anywhere else as
1242 # read_GenBank_References. For those who still want to know:
1244 # Purpose: adds a Reference object to an array of Reference objects, takes
1245 # care of possible cleanups to be done (currently, only author and title
1246 # will be chopped of trailing semicolons).
1247 # Parameters:
1248 # a reference to an array of Reference objects
1249 # the Reference object to be added
1250 # Returns: nothing
1252 sub _add_ref_to_array {
1253 my ($self, $refs, $ref) = @_;
1255 # first, polish author and title by removing possible trailing semicolons
1256 my $au = $ref->authors();
1257 my $title = $ref->title();
1258 $au =~ s/;\s*$//g if $au;
1259 $title =~ s/;\s*$//g if $title;
1260 $ref->authors($au);
1261 $ref->title($title);
1262 # the rest should be clean already, so go ahead and add it
1263 push(@{$refs}, $ref);
1266 =head2 _read_GenBank_Species
1268 Title : _read_GenBank_Species
1269 Usage :
1270 Function: Reads the GenBank Organism species and classification
1271 lines. Able to deal with unconvential Organism naming
1272 formats, and varietas in plants
1273 Example : ORGANISM unknown marine gamma proteobacterium NOR5
1274 $genus = undef
1275 $species = unknown marine gamma proteobacterium NOR5
1277 ORGANISM Drosophila sp. 'white tip scutellum'
1278 $genus = Drosophila
1279 $species = sp. 'white tip scutellum'
1280 (yes, this really is a species and that is its name)
1281 $subspecies = undef
1283 ORGANISM Ajellomyces capsulatus var. farciminosus
1284 $genus = Ajellomyces
1285 $species = capsulatus
1286 $subspecies = var. farciminosus
1288 ORGANISM Hepatitis delta virus
1289 $genus = undef (though this virus has a genus in its lineage, we
1290 cannot know that without a database lookup)
1291 $species = Hepatitis delta virus
1293 Returns : A Bio::Species object
1294 Args : A reference to the current line buffer
1296 =cut
1298 sub _read_GenBank_Species {
1299 my ($self, $buffer) = @_;
1301 my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
1302 'Unspecified', 'Unknown', 'None', 'unclassified',
1303 'unidentified organism', 'not supplied');
1304 # dictionary of synonyms for taxid 32644
1305 my @unkn_genus = ('unknown','unclassified','uncultured','unidentified');
1306 # all above can be part of valid species name
1308 my $line = $$buffer;
1310 my( $sub_species, $species, $genus, $sci_name, $common, $class_lines,
1311 $source_flag, $abbr_name, $organelle, $sl );
1312 my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
1313 # upon first entering the loop, we must not read a new line -- the SOURCE
1314 # line is already in the buffer (HL 05/10/2000)
1315 my ($ann, $tag, $data);
1316 while (defined($line) || defined($line = $self->_readline())) {
1317 # de-HTMLify (links that may be encountered here don't contain
1318 # escaped '>', so a simple-minded approach suffices)
1319 $line =~ s{<[^>]+>}{}g;
1320 if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
1321 ($tag, $data) = ($1, $2 || '');
1322 last if ($tag && !exists $source{$tag});
1323 } else {
1324 return unless $tag;
1325 ($data = $line) =~ s{^\s+}{};
1326 chomp $data;
1327 $tag = 'CLASSIFICATION' if ($tag ne 'CLASSIFICATION' && $tag eq 'ORGANISM' && $line =~ m{[;\.]+});
1329 (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
1330 $line = undef;
1333 ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
1335 $$buffer = $line;
1337 $sci_name || return;
1339 # parse out organelle, common name, abbreviated name if present;
1340 # this should catch everything, but falls back to
1341 # entire SOURCE line just in case
1342 if ($sl =~ m{^
1343 (mitochondrion|chloroplast|plastid)?
1344 \s*(.*?)
1345 \s*(?: \( (.*?) \) )?\.?
1347 }xms ){
1348 ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional
1349 } else {
1350 $abbr_name = $sl; # nothing caught; this is a backup!
1353 # Convert data in classification lines into classification array.
1354 # only split on ';' or '.' so that classification that is 2 or more words will
1355 # still get matched, use map() to remove trailing/leading/intervening spaces
1356 my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /(?<!subgen)[;\.]+/, $class_lines;
1358 # do we have a genus?
1359 my $possible_genus = quotemeta($class[-1])
1360 . ($class[-2] ? "|" . quotemeta($class[-2]) : '');
1361 if ($sci_name =~ /^($possible_genus)/) {
1362 $genus = $1;
1363 ($species) = $sci_name =~ /^$genus\s+(.+)/;
1365 else {
1366 $species = $sci_name;
1369 # is this organism of rank species or is it lower?
1370 # (we don't catch everything lower than species, but it doesn't matter -
1371 # this is just so we abide by previous behaviour whilst not calling a
1372 # species a subspecies)
1373 if ($species && $species =~ /subsp\.|var\./) {
1374 ($species, $sub_species) = $species =~ /(.+)\s+((?:subsp\.|var\.).+)/;
1377 # Don't make a species object if it's empty or "Unknown" or "None"
1378 # return unless $genus and $genus !~ /^(Unknown|None)$/oi;
1379 # Don't make a species object if it belongs to taxid 32644
1380 # my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names;
1381 my $unkn = grep { $_ eq $sl } @unkn_names;
1382 return unless ($species || $genus) and $unkn == 0;
1384 # Bio::Species array needs array in Species -> Kingdom direction
1385 push(@class, $sci_name);
1386 @class = reverse @class;
1388 my $make = Bio::Species->new();
1389 $make->scientific_name($sci_name);
1390 $make->classification(@class) if @class > 0;
1391 $make->common_name( $common ) if $common;
1392 $make->name('abbreviated', $abbr_name) if $abbr_name;
1393 $make->organelle($organelle) if $organelle;
1394 #$make->sub_species( $sub_species ) if $sub_species;
1395 return $make;
1398 =head2 _read_FTHelper_GenBank
1400 Title : _read_FTHelper_GenBank
1401 Usage : _read_FTHelper_GenBank($buffer)
1402 Function: reads the next FT key line
1403 Example :
1404 Returns : Bio::SeqIO::FTHelper object
1405 Args : filehandle and reference to a scalar
1407 =cut
1409 sub _read_FTHelper_GenBank {
1410 my ($self,$buffer) = @_;
1412 my ($key, # The key of the feature
1413 $loc # The location line from the feature
1415 my @qual = (); # An array of lines making up the qualifiers
1417 if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
1418 $key = $1;
1419 $loc = $2;
1420 # Read all the lines up to the next feature
1421 while ( defined($_ = $self->_readline) ) {
1422 if (/^(\s+)(.+?)\s*$/o) {
1423 # Lines inside features are preceded by 21 spaces
1424 # A new feature is preceded by 5 spaces
1425 if (length($1) > 6) {
1426 # Add to qualifiers if we're in the qualifiers, or if it's
1427 # the first qualifier
1428 if (@qual || (index($2,'/') == 0)) {
1429 push(@qual, $2);
1431 # We're still in the location line, so append to location
1432 else {
1433 $loc .= $2;
1435 } else {
1436 # We've reached the start of the next feature
1437 last;
1439 } else {
1440 # We're at the end of the feature table
1441 last;
1444 } else {
1445 # No feature key
1446 $self->debug("no feature key!\n");
1447 # change suggested by JDiggans to avoid infinite loop-
1448 # see bugreport 1062.
1449 # reset buffer to prevent infinite loop
1450 $$buffer = $self->_readline();
1451 return;
1454 # Put the first line of the next feature into the buffer
1455 $$buffer = $_;
1457 # Make the new FTHelper object
1458 my $out = Bio::SeqIO::FTHelper->new();
1459 $out->verbose($self->verbose());
1460 $out->key($key);
1461 $out->loc($loc);
1463 # Now parse and add any qualifiers. (@qual is kept
1464 # intact to provide informative error messages.)
1465 QUAL:
1466 for (my $i = 0; $i < @qual; $i++) {
1467 $_ = $qual[$i];
1468 my( $qualifier, $value ) = (m{^/([^=]+)(?:=(.+))?})
1469 or $self->warn("cannot see new qualifier in feature $key: ".
1470 $qual[$i]);
1471 $qualifier = '' unless( defined $qualifier);
1472 if (defined $value) {
1473 # Do we have a quoted value?
1474 if (substr($value, 0, 1) eq '"') {
1475 # Keep adding to value until we find the trailing quote
1476 # and the quotes are balanced
1477 while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
1478 if($i >= $#qual) {
1479 $self->warn("Unbalanced quote in:\n" .
1480 join("\n", @qual) .
1481 "No further qualifiers will " .
1482 "be added for this feature");
1483 last QUAL;
1485 $i++; # modifying a for-loop variable inside of the loop
1486 # is not the best programming style ...
1487 my $next = $qual[$i];
1489 # add to value with a space unless the value appears
1490 # to be a sequence (translation for example)
1491 # if(($value.$next) =~ /[^A-Za-z\"\-]/o) {
1492 # changed to explicitly look for translation tag - cjf 06/8/29
1493 if ($qualifier ne 'translation') {
1494 $value .= " ";
1496 $value .= $next;
1498 # Trim leading and trailing quotes
1499 $value =~ s/^"|"$//g;
1500 # Undouble internal quotes
1501 $value =~ s/""/\"/g;
1502 } elsif ( $value =~ /^\(/ ) { # values quoted by ()s
1503 # Keep adding to value until we find the trailing bracket
1504 # and the ()s are balanced
1505 my $left = ($value =~ tr/\(/\(/); # count left parens
1506 my $right = ($value =~ tr/\)/\)/); # count right parens
1508 while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right"
1509 if( $i >= $#qual) {
1510 $self->warn("Unbalanced parens in:\n".
1511 join("\n", @qual).
1512 "\nNo further qualifiers will ".
1513 "be added for this feature");
1514 last QUAL;
1516 $i++;
1517 my $next = $qual[$i];
1518 $value .= $next;
1519 $left += ($next =~ tr/\(/\(/);
1520 $right += ($next =~ tr/\)/\)/);
1523 } else {
1524 $value = '_no_value';
1526 # Store the qualifier
1527 $out->field->{$qualifier} ||= [];
1528 push(@{$out->field->{$qualifier}},$value);
1530 return $out;
1533 =head2 _write_line_GenBank
1535 Title : _write_line_GenBank
1536 Usage :
1537 Function: internal function
1538 Example :
1539 Returns :
1540 Args :
1542 =cut
1544 sub _write_line_GenBank {
1545 my ($self,$pre1,$pre2,$line,$length) = @_;
1547 $length || $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1548 my $subl = $length - length $pre2;
1549 my $linel = length $line;
1550 my $i;
1552 my $subr = substr($line,0,$length - length $pre1);
1554 $self->_print("$pre1$subr\n");
1555 for($i= ($length - length $pre1);$i < $linel; $i += $subl) {
1556 $subr = substr($line,$i,$subl);
1557 $self->_print("$pre2$subr\n");
1562 =head2 _write_line_GenBank_regex
1564 Title : _write_line_GenBank_regex
1565 Usage :
1566 Function: internal function for writing lines of specified
1567 length, with different first and the next line
1568 left hand headers and split at specific points in the
1569 text
1570 Example :
1571 Returns : nothing
1572 Args : file handle,
1573 first header,
1574 second header,
1575 text-line,
1576 regex for line breaks,
1577 total line length
1580 =cut
1582 sub _write_line_GenBank_regex {
1583 my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
1585 #print STDOUT "Going to print with $line!\n";
1587 $length || $self->throw( "Miscalled write_line_GenBank without length. Programming error!");
1589 my $subl = $length - (length $pre1) - 2;
1590 my @lines = ();
1592 CHUNK: while($line) {
1593 foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
1594 if($line =~ m/^(.{1,$subl})($pat)(.*)/ ) {
1595 my $l = $1.$2;
1596 my $newl = $3;
1597 $line = substr($line,length($l));
1598 # be strict about not padding spaces according to
1599 # genbank format
1600 $l =~ s/\s+$//;
1601 push(@lines, $l);
1602 next CHUNK;
1605 # if we get here none of the patterns matched $subl or less chars
1606 $self->warn("trouble dissecting \"$line\"\n into chunks ".
1607 "of $subl chars or less - this tag won't print right");
1608 # insert a space char to prevent infinite loops
1609 $line = substr($line,0,$subl) . " " . substr($line,$subl);
1611 my $s = shift @lines;
1612 $self->_print("$pre1$s\n") if $s;
1613 foreach my $s ( @lines ) {
1614 $self->_print("$pre2$s\n");
1618 =head2 _post_sort
1620 Title : _post_sort
1621 Usage : $obj->_post_sort($newval)
1622 Function:
1623 Returns : value of _post_sort
1624 Args : newvalue (optional)
1627 =cut
1629 sub _post_sort {
1630 my ($obj,$value) = @_;
1631 if( defined $value) {
1632 $obj->{'_post_sort'} = $value;
1634 return $obj->{'_post_sort'};
1638 =head2 _show_dna
1640 Title : _show_dna
1641 Usage : $obj->_show_dna($newval)
1642 Function:
1643 Returns : value of _show_dna
1644 Args : newvalue (optional)
1647 =cut
1649 sub _show_dna {
1650 my ($obj,$value) = @_;
1651 if( defined $value) {
1652 $obj->{'_show_dna'} = $value;
1654 return $obj->{'_show_dna'};
1657 =head2 _id_generation_func
1659 Title : _id_generation_func
1660 Usage : $obj->_id_generation_func($newval)
1661 Function:
1662 Returns : value of _id_generation_func
1663 Args : newvalue (optional)
1666 =cut
1668 sub _id_generation_func {
1669 my ($obj,$value) = @_;
1670 if( defined $value ) {
1671 $obj->{'_id_generation_func'} = $value;
1673 return $obj->{'_id_generation_func'};
1677 =head2 _ac_generation_func
1679 Title : _ac_generation_func
1680 Usage : $obj->_ac_generation_func($newval)
1681 Function:
1682 Returns : value of _ac_generation_func
1683 Args : newvalue (optional)
1685 =cut
1687 sub _ac_generation_func {
1688 my ($obj,$value) = @_;
1689 if( defined $value ) {
1690 $obj->{'_ac_generation_func'} = $value;
1692 return $obj->{'_ac_generation_func'};
1696 =head2 _sv_generation_func
1698 Title : _sv_generation_func
1699 Usage : $obj->_sv_generation_func($newval)
1700 Function:
1701 Returns : value of _sv_generation_func
1702 Args : newvalue (optional)
1705 =cut
1707 sub _sv_generation_func {
1708 my ($obj,$value) = @_;
1709 if( defined $value ) {
1710 $obj->{'_sv_generation_func'} = $value;
1712 return $obj->{'_sv_generation_func'};
1717 =head2 _kw_generation_func
1719 Title : _kw_generation_func
1720 Usage : $obj->_kw_generation_func($newval)
1721 Function:
1722 Returns : value of _kw_generation_func
1723 Args : newvalue (optional)
1725 =cut
1728 sub _kw_generation_func {
1729 my ($obj,$value) = @_;
1730 if( defined $value ) {
1731 $obj->{'_kw_generation_func'} = $value;
1733 return $obj->{'_kw_generation_func'};