2 # BioPerl module for Bio::SeqIO::bsml
4 # Cared for by Charles Tilford (tilfordc@bms.com)
5 # Copyright (C) Charles Tilford 2001
7 # This library is free software; you can redistribute it and/or
8 # modify it under the terms of the GNU Lesser General Public
9 # License as published by the Free Software Foundation; either
10 # version 2.1 of the License, or (at your option) any later version.
12 # This library is distributed in the hope that it will be useful,
13 # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
15 # Lesser General Public License for more details.
17 # You should have received a copy of the GNU Lesser General Public
18 # License along with this library; if not, write to the Free Software
19 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
20 # Also at: http://www.gnu.org/copyleft/lesser.html
23 # Much of the basic documentation in this module has been
24 # cut-and-pasted from the embl.pm (Ewan Birney) SeqIO module.
29 Bio::SeqIO::bsml - BSML sequence input/output stream
33 It is probably best not to use this object directly, but rather go
34 through the SeqIO handler system. To read a BSML file:
36 $stream = Bio::SeqIO->new( -file => $filename, -format => 'bsml');
38 while ( my $bioSeqObj = $stream->next_seq() ) {
39 # do something with $bioSeqObj
42 To write a Seq object to the current file handle in BSML XML format:
44 $stream->write_seq( -seq => $seqObj);
46 If instead you would like a XML::DOM object containing the BSML, use:
48 my $newXmlObject = $stream->to_bsml( -seq => $seqObj);
52 In addition to parts of the Bio:: hierarchy, this module uses:
58 This object can transform Bio::Seq objects to and from BSML (XML)
63 2/1/02 - I have changed the API to more closely match argument
64 passing used by other BioPerl methods ( -tag => value ). Internal
65 methods are using the same API, but you should not be calling those
72 User feedback is an integral part of the evolution of this and other
73 Bioperl modules. Send your comments and suggestions preferably to one
74 of the Bioperl mailing lists. Your participation is much appreciated.
76 bioperl-l@bioperl.org - General discussion
77 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
81 Report bugs to the Bioperl bug tracking system to help us keep track
82 the bugs and their resolution.
83 Bug reports can be submitted via the web:
85 http://bugzilla.open-bio.org/
87 =head2 Things Still to Do
89 * The module now uses the new Collection.pm system. However,
90 Annotations associated with a Feature object still seem to use the
91 old system, so parsing with the old methods are included..
93 * Generate Seq objects with no sequence data but an assigned
94 length. This appears to be an issue with Bio::Seq. It is possible
95 (and reasonable) to make a BSML document with features but no
98 * Support <Seq-data-import>. Do not know how commonly this is used.
100 * Some features are awaiting implementation in later versions of
103 * Nested feature support
105 * Complex feature (ie joins)
107 * Unambiguity in strand (ie -1,0,1, not just 'complement' )
109 * More friendly dblink structures
111 * Location.pm (or RangeI::union?) appears to have a bug when 'expand'
114 * More intelligent hunting for sequence and feature titles? It is not
115 terribly clear where the most appropriate field is located, better
116 grepping (eg looking for a reasonable count for spaces and numbers)
117 may allow for titles better than "AE008041".
119 =head1 AUTHOR - Charles Tilford
121 Bristol-Myers Squibb Bioinformatics
123 Email tilfordc@bms.com
125 I have developed the BSML specific code for this package, but have used
126 code from other SeqIO packages for much of the nuts-and-bolts. In particular
127 I have used code from the embl.pm module either directly or as a framework
128 for many of the subroutines that are common to SeqIO modules.
132 package Bio
::SeqIO
::bsml
;
135 use Bio
::SeqFeature
::Generic
;
138 use Bio
::Seq
::SeqFactory
;
139 use Bio
::Annotation
::Collection
;
140 use Bio
::Annotation
::Comment
;
141 use Bio
::Annotation
::Reference
;
142 use Bio
::Annotation
::DBLink
;
144 use base
qw(Bio::SeqIO);
146 my $idcounter = {}; # Used to generate unique id values
147 my $nvtoken = ": "; # The token used if a name/value pair has to be stuffed
154 # LS: this seems to get overwritten on line 1317, generating a redefinition error. Dead code?
155 # CAT: This was inappropriately added in revision 1.10 - I added the check for existance of a sequence factory to the actual _initialize
157 # my($self,@args) = @_;
158 # $self->SUPER::_initialize(@args);
159 # if( ! defined $self->sequence_factory ) {
160 # $self->sequence_factory(Bio::Seq::SeqFactory->new(-verbose => $self->verbose(), -type => 'Bio::Seq::RichSeq'));
167 Usage : my $bioSeqObj = $stream->next_seq
168 Function: Retrieves the next sequence from a SeqIO::bsml stream.
169 Returns : A reference to a Bio::Seq::RichSeq object
177 my $bioSeq = $self->sequence_factory->create(-verbose
=>$self->verbose());
179 unless (exists $self->{'domtree'}) {
180 $self->throw("A BSML document has not yet been parsed.");
183 my $dom = $self->{'domtree'};
184 my $seqElements = $dom->getElementsByTagName ("Sequence");
185 if ($self->{'current_node'} == $seqElements->getLength ) {
186 # There are no more <Sequence>s to process
189 my $xmlSeq = $seqElements->item($self->{'current_node'});
191 # Assume that title attribute contains the best display id
192 if (my $val = $xmlSeq->getAttribute( "title")) {
193 $bioSeq->display_id($val);
196 # Set the molecule type
197 if (my $val = $xmlSeq->getAttribute( "molecule" )) {
198 my %mol = ('dna' => 'DNA', 'rna' => 'RNA', 'aa' => 'protein');
199 $bioSeq->molecule($mol{ lc($val) });
202 # Set the accession number
203 if (my $val = $xmlSeq->getAttribute( "ic-acckey" )) {
204 $bioSeq->accession_number($val);
207 # Get the sequence data for the element
208 if (my $seqData = &FIRSTDATA
($xmlSeq->getElementsByTagName("Seq-data")
210 # Sequence data exists, transfer to the Seq object
211 # Remove white space and CRs (not neccesary?)
212 $seqData =~ s/[\s\n\r]//g;
213 $bioSeq->seq($seqData);
214 } elsif (my $import = $xmlSeq->getElementsByTagName("Seq-dataimport")
216 #>>>> # What about <Seq-data-import> ??
218 } elsif (my $val = $xmlSeq->getAttribute("length")) {
219 # No sequence defined, set the length directly
221 #>>>> # This does not appear to work - length is apparently calculated
222 # from the sequence. How to make a "virtual" sequence??? Such
223 # creatures are common in BSML...
224 $bioSeq->length($val);
227 my $species = Bio
::Species
->new();
228 my @classification = ();
230 # Peruse the generic <Attributes> - those that are direct children of
231 # the <Sequence> or the <Feature-tables> element
232 # Sticky wicket here - data not controlled by schema, could be anything
234 my %specs = ('common_name' => 'y',
237 'sub_species' => 'y',
240 'add_date' => [ qw(date date-created date-last-updated)],
241 'keywords' => [ 'keyword', ],
242 'seq_version' => [ 'version' ],
243 'division' => [ 'division' ],
244 'add_secondary_accession' => ['accession'],
246 'primary_id' => [ 'primary.id', 'primary_id' ],
249 my $floppies = &GETFLOPPIES
($xmlSeq);
250 for my $attr (@
{$floppies}) {
251 # Don't want to get attributes from <Feature> or <Table> elements yet
252 my $parent = $attr->getParentNode->getNodeName;
253 next unless($parent eq "Sequence" || $parent eq "Feature-tables");
255 my ($name, $content) = &FLOPPYVALS
($attr);
257 if (exists $specs{$name}) { # It looks like part of species...
258 $species->$name($content);
262 # Cycle through the Seq methods:
263 for my $method (keys %seqMap) {
264 # Cycle through potential matching attributes:
265 for my $match (@
{$seqMap{$method}}) {
266 # If the <Attribute> name matches one of the keys,
267 # set $value, unless it has already been set
268 $value ||= $content if ($name =~ /$match/i);
272 if( $method eq 'seq_version'&& $value =~ /\S+\.(\d+)/ ) {
273 # hack for the fact that data in version is actually
277 $bioSeq->$method($value);
281 if( $name eq 'database-xref' ) {
282 my ($link_id,$link_db) = split(/:/,$value);
283 push @links, Bio
::Annotation
::DBLink
->new(-primary_id
=> $link_id,
284 -database
=> $link_db);
286 next if ($value ne "");
288 if ($name =~ /^species$/i) { # Uh, it's the species designation?
289 if ($content =~ / /) {
290 # Assume that a full species name has been provided
291 # This will screw up if the last word is the subspecies...
292 my @break = split " ", $content;
293 @classification = reverse @break;
295 $classification[0] = $content;
299 if ($name =~ /sub[_ ]?species/i) { # Should be the subspecies...
300 $species->sub_species( $content );
303 if ($name =~ /classification/i) { # Should be species classification
304 # We will assume that there are spaces separating the terms:
305 my @bits = split " ", $content;
306 # Now make sure there is not other cruft as well (eg semi-colons)
307 for my $i (0..$#bits) {
308 $bits[$i] =~ /(\w+)/;
311 $species->classification( @bits );
314 if ($name =~ /comment/) {
315 my $com = Bio
::Annotation
::Comment
->new('-text' => $content);
316 # $bioSeq->annotation->add_Comment($com);
317 $bioSeq->annotation->add_Annotation('comment', $com);
320 # Description line - collect all descriptions for later assembly
321 if ($name =~ /descr/) {
322 push @seqDesc, $content;
325 # Ok, we have no idea what this attribute is. Dump to SimpleValue
326 my $simp = Bio
::Annotation
::SimpleValue
->new( -value
=> $content);
327 $bioSeq->annotation->add_Annotation($name, $simp);
329 unless ($#seqDesc < 0) {
330 $bioSeq->desc( join "; ", @seqDesc);
333 #>>>> This should be modified so that any IDREF associated with the
334 # <Reference> is then used to associate the reference with the
335 # appropriate Feature
337 # Extract out <Reference>s associated with the sequence
340 -title
=> "RefTitle",
341 -authors
=> "RefAuthors",
342 -location
=> "RefJournal",
344 for my $ref ( $xmlSeq->getElementsByTagName ("Reference") ) {
346 for my $tag (keys %tags) {
347 my $rt = &FIRSTDATA
($ref->getElementsByTagName($tags{$tag})
350 $rt =~ s/^[\s\r\n]+//; # Kill leading space
351 $rt =~ s/[\s\r\n]+$//; # Kill trailing space
352 $rt =~ s/[\s\r\n]+/ /; # Collapse internal space runs
353 $refVals{$tag} = $rt;
355 my $reference = Bio
::Annotation
::Reference
->new( %refVals );
357 # Pull out any <Reference> information hidden in <Attributes>
359 comment
=> [ 'comment', 'remark' ],
360 medline
=> [ 'medline', ],
361 pubmed
=> [ 'pubmed' ],
362 start
=> [ 'start', 'begin' ],
363 end
=> [ 'stop', 'end' ],
366 my $floppies = &GETFLOPPIES
($ref);
367 for my $attr (@
{$floppies}) {
368 my ($name, $content) = &FLOPPYVALS
($attr);
370 # Cycle through the Seq methods:
371 for my $method (keys %refMap) {
372 # Cycle through potential matching attributes:
373 for my $match (@
{$refMap{$method}}) {
374 # If the <Attribute> name matches one of the keys,
375 # set $value, unless it has already been set
376 $value ||= $content if ($name =~ /$match/i);
379 my $str = '$reference->' . $method . "($value)";
384 next if ($value ne "");
385 # Don't know what the <Attribute> is, dump it to comments:
386 push @refCom, $name . $nvtoken . $content;
388 unless ($#refCom < 0) {
389 # Random stuff was found, tack it to the comment field
390 my $exist = $reference->comment;
391 $exist .= join ", ", @refCom;
392 $reference->comment($exist);
394 push @refs, $reference;
396 $bioSeq->annotation->add_Annotation('reference' => $_) for @refs;
397 my $ann_col = $bioSeq->annotation;
398 # Extract the <Feature>s for this <Sequence>
399 for my $feat ( $xmlSeq->getElementsByTagName("Feature") ) {
400 $bioSeq->add_SeqFeature( $self->_parse_bsml_feature($feat) );
403 $species->classification( @classification );
404 $bioSeq->species( $species );
406 $bioSeq->annotation->add_Annotation('dblink' => $_) for @links;
408 $self->{'current_node'}++;
411 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
412 # Get all the <Attribute> and <Qualifier> children for an object, and
413 # return them as an array reference
414 # ('floppy' since these elements have poor/no schema control)
419 my $attributes = $obj->getElementsByTagName ("Attribute");
420 for (my $i = 0; $i < $attributes->getLength; $i++) {
421 push @floppies, $attributes->item($i);
423 my $qualifiers = $obj->getElementsByTagName ("Qualifier");
424 for (my $i = 0; $i < $qualifiers->getLength; $i++) {
425 push @floppies, $qualifiers->item($i);
429 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
430 # Given a DOM <Attribute> or <Qualifier> object, return the [name, value] pair
435 if ($obj->getNodeName eq "Attribute") {
436 $name = $obj->getAttribute('name');
437 $value = $obj->getAttribute('content');
438 } elsif ($obj->getNodeName eq "Qualifier") {
439 # Wheras <Attribute>s require both 'name' and 'content' attributes,
440 # <Qualifier>s can technically have either blank (and sometimes do)
441 my $n = $obj->getAttribute('value-type');
442 $name = $n if ($n ne "");
443 my $v = $obj->getAttribute('value');
444 $value = $v if ($v ne "");
446 return ($name, $value);
448 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
449 # Returns the value of the first TEXT_NODE encountered below an element
450 # Rational - avoid grabbing a comment rather than the PCDATA. Not foolproof...
453 return unless ($element);
455 my $hopefuls = $element->getChildNodes;
457 for (my $i = 0; $i < $hopefuls->getLength; $i++) {
458 if ($hopefuls->item($i)->getNodeType ==
459 XML
::DOM
::Node
::TEXT_NODE
() ) {
460 $data = $hopefuls->item($i)->getNodeValue;
466 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
467 # Just collapses whitespace runs in a string
470 $string =~ s/[\s\r\n]+/ /g;
477 Usage : my $domDoc = $obj->to_bsml(@args)
478 Function: Generates an XML structure for one or more Bio::Seq objects.
479 If $seqref is an array ref, the XML tree generated will include
480 all the sequences in the array.
481 Returns : A reference to the XML DOM::Document object generated / modified
482 Args : Argument array in form of -key => val. Recognized keys:
484 -seq A Bio::Seq reference, or an array reference of many of them
486 -xmldoc Specifies an existing XML DOM document to add the sequences
487 to. If included, then only data (no page formatting) will
488 be added. If not, a new XML::DOM::Document will be made,
489 and will be populated with both <Sequence> data, as well as
490 <Page> display elements.
492 -nodisp Do not generate <Display> elements, or any children
493 thereof, even if -xmldoc is not set.
495 -skipfeat If set to 'all', all <Feature>s will be skipped. If it is
496 a hash reference, any <Feature> with a class matching a key
497 in the hash will be skipped - for example, to skip 'source'
498 and 'score' features, use:
500 -skipfeat => { source => 'Y', score => 'Y' }
502 -skiptags As above: if set to 'all', no tags are included, and if a
503 hash reference, those specific tags will be ignored.
505 Skipping some or all tags and features can result in
506 noticable speed improvements.
508 -nodata If true, then <Seq-data> will not be included. This may be
509 useful if you just want annotations and do not care about
510 the raw ACTG information.
512 -return Default is 'xml', which will return a reference to the BSML
513 XML object. If set to 'seq' will return an array ref of the
514 <Sequence> objects added (rather than the whole XML object)
516 -close Early BSML browsers will crash if an element *could* have
517 children but does not, and is closed as an empty element
518 e.g. <Styles/>. If -close is true, then such tags are given
519 a comment child to explicitly close them e.g. <Styles><!--
520 --></Styles>. This is default true, set to "0" if you do
521 not want this behavior.
523 Examples : my $domObj = $stream->to_bsml( -seq => \@fourCoolSequenceObjects,
524 -skipfeat => { source => 1 },
527 # Or add sequences to an existing BSML document:
528 $stream->to_bsml( -seq => \@fourCoolSequenceObjects,
529 -skipfeat => { source => 1 },
530 -xmldoc => $myBsmlDocumentInProgress, );
536 my $args = $self->_parseparams( -close => 1,
539 $args->{NODISP
} ||= $args->{NODISPLAY
};
540 my $seqref = $args->{SEQ
};
541 $seqref = (ref($seqref) eq 'ARRAY') ?
$seqref : [ $seqref ];
543 #############################
544 # Basic BSML XML Components #
545 #############################
548 my ($bsmlElem, $defsElem, $seqsElem, $dispElem);
549 if ($args->{XMLDOC
}) {
550 # The user has provided an existing XML DOM object
551 $xml = $args->{XMLDOC
};
552 unless ($xml->isa("XML::DOM::Document")) {
553 $self->throw('SeqIO::bsml.pm error:\n'.
554 'When calling ->to_bsml( { xmldoc => $myDoc }), $myDoc \n' .
555 'should be an XML::DOM::Document object, or an object that\n'.
556 'inherits from that class (like BsmlHelper.pm)');
559 # The user has not provided a new document, make one from scratch
560 $xml = XML
::DOM
::Document
->new();
561 $xml->setXMLDecl( $xml->createXMLDecl("1.0") );
562 my $url = "http://www.labbook.com/dtd/bsml2_2.dtd";
563 my $doc = $xml->createDocumentType("Bsml",$url);
564 $xml->setDoctype($doc);
565 $bsmlElem = $self->_addel( $xml, 'Bsml');
566 $defsElem = $self->_addel( $bsmlElem, 'Definitions');
567 $seqsElem = $self->_addel( $defsElem, 'Sequences');
568 unless ($args->{NODISP
}) {
569 $dispElem = $self->_addel( $bsmlElem, 'Display');
570 my $stylElem = $self->_addel( $dispElem, 'Styles');
571 my $style = $self->_addel( $stylElem, 'Style', {
572 type
=> "text/css" });
574 qq(Interval
-widget
{ display
: "1"; }\n) .
575 qq(Feature
{ display
-auto
: "1"; });
576 $style->appendChild( $xml->createTextNode($styleText) );
580 # Establish fundamental BSML elements, if they do not already exist
581 $bsmlElem ||= $xml->getElementsByTagName("Bsml")->item(0);
582 $defsElem ||= $xml->getElementsByTagName("Definitions")->item(0);
583 $seqsElem ||= $xml->getElementsByTagName("Sequences")->item(0);
589 # Map over Bio::Seq to BSML
590 my %mol = ('dna' => 'DNA', 'rna' => 'RNA', 'protein' => 'AA');
593 for my $bioSeq (@
{$seqref}) {
594 my $xmlSeq = $xml->createElement("Sequence");
595 my $FTs = $xml->createElement("Feature-tables");
597 # Array references to hold <Reference> objects:
598 my $seqRefs = []; my $featRefs = [];
599 # Array references to hold <Attribute> values (not objects):
601 push @
{$seqDesc}, ["comment" , "This file generated to BSML 2.2 standards - joins will be collapsed to a single feature enclosing all members of the join"];
602 push @
{$seqDesc}, ["description" , eval{$bioSeq->desc}];
603 for my $kwd ( eval{$bioSeq->get_keywords} ) {
604 push @
{$seqDesc}, ["keyword" , $kwd];
606 push @
{$seqDesc}, ["keyword" , eval{$bioSeq->keywords}];
607 push @
{$seqDesc}, ["version" , eval{
608 join(".", $bioSeq->accession_number, $bioSeq->seq_version); }];
609 push @
{$seqDesc}, ["division" , eval{$bioSeq->division}];
610 push @
{$seqDesc}, ["pid" , eval{$bioSeq->pid}];
611 # push @{$seqDesc}, ["bio_object" , ref($bioSeq)];
612 push @
{$seqDesc}, ["primary_id" , eval{$bioSeq->primary_id}];
613 for my $dt (eval{$bioSeq->get_dates()} ) {
614 push @
{$seqDesc}, ["date" , $dt];
616 for my $ac (eval{$bioSeq->get_secondary_accessions()} ) {
617 push @
{$seqDesc}, ["secondary_accession" , $ac];
620 # Determine the accession number and a unique identifier
621 my $acc = $bioSeq->accession_number eq "unknown" ?
622 "" : $bioSeq->accession_number;
624 my $pi = $bioSeq->primary_id;
625 if ($pi && $pi !~ /Bio::/) {
626 # Not sure I understand what primary_id is... It sometimes
627 # is a string describing a reference to a BioSeq object...
628 $id = "SEQ" . $bioSeq->primary_id;
630 # Nothing useful found, make a new unique ID
631 $id = $acc || ("SEQ-io" . $idcounter->{Sequence
}++);
633 # print "$id->",ref($bioSeq->primary_id),"\n";
634 # An id field with spaces is interpreted as an idref - kill the spaces
636 # Map over <Sequence> attributes
637 my %attr = ( 'title' => $bioSeq->display_id,
638 'length' => $bioSeq->length,
641 'representation' => 'raw',
643 $attr{molecule
} = $mol{ lc($bioSeq->molecule) } if $bioSeq->can('molecule');
646 for my $a (keys %attr) {
647 $xmlSeq->setAttribute($a, $attr{$a}) if (defined $attr{$a} &&
650 # Orphaned Attributes:
651 $xmlSeq->setAttribute('topology', 'circular')
652 if ($bioSeq->is_circular);
653 # <Sequence> strand, locus
655 $self->_add_page($xml, $xmlSeq) if ($dispElem);
660 # Check for Bio::Annotations on the * <Sequence> *.
661 $self->_parse_annotation( -xml
=> $xml, -obj
=> $bioSeq,
662 -desc
=> $seqDesc, -refs
=> $seqRefs);
664 # Incorporate species data
665 if (ref($bioSeq->species) eq 'Bio::Species') {
666 # Need to peer into Bio::Species ...
667 my @specs = ('common_name', 'genus', 'species', 'sub_species');
668 for my $sp (@specs) {
669 next unless (my $val = $bioSeq->species()->$sp());
670 push @
{$seqDesc}, [$sp , $val];
672 push @
{$seqDesc}, ['classification',
673 (join " ", $bioSeq->species->classification) ];
674 # Species::binomial will return "genus species sub_species" ...
675 } elsif (my $val = $bioSeq->species) {
676 # Ok, no idea what it is, just dump it in there...
677 push @
{$seqDesc}, ["species", $val];
680 # Add the description <Attribute>s for the <Sequence>
681 for my $seqD (@
{$seqDesc}) {
682 $self->_addel($xmlSeq, "Attribute", {
683 name
=> $seqD->[0], content
=> $seqD->[1]}) if ($seqD->[1]);
686 # If sequence references were added, make a Feature-table for them
687 unless ($#{$seqRefs} < 0) {
688 my $seqFT = $self->_addel($FTs, "Feature-table", {
689 title
=> "Sequence References", });
690 for my $feat (@
{$seqRefs}) {
691 $seqFT->appendChild($feat);
695 # This is the appropriate place to add <Feature-tables>
696 $xmlSeq->appendChild($FTs);
702 #>>>> # Perhaps it is better to loop through top_Seqfeatures?...
703 #>>>> # ...however, BSML does not have a hierarchy for Features
705 if (defined $args->{SKIPFEAT
} &&
706 $args->{SKIPFEAT
} eq 'all') {
707 $args->{SKIPFEAT
} = { all
=> 1};
708 } else { $args->{SKIPFEAT
} ||= {} }
709 for my $class (keys %{$args->{SKIPFEAT
}}) {
710 $args->{SKIPFEAT
}{lc($class)} = $args->{SKIPFEAT
}{$class};
712 # Loop through all the features
713 my @features = $bioSeq->all_SeqFeatures();
714 if (@features && !$args->{SKIPFEAT
}{all
}) {
715 my $ft = $self->_addel($FTs, "Feature-table", {
716 title
=> "Features", });
717 for my $bioFeat (@features ) {
719 my $class = lc($bioFeat->primary_tag);
720 # The user may have specified to ignore this type of feature
721 next if ($args->{SKIPFEAT
}{$class});
722 my $id = "FEAT-io" . $idcounter->{Feature
}++;
723 my $xmlFeat = $self->_addel( $ft, 'Feature', {
726 'value-type' => $bioFeat->source_tag });
727 # Check for Bio::Annotations on the * <Feature> *.
728 $self->_parse_annotation( -xml
=> $xml, -obj
=> $bioFeat,
729 -desc
=> $featDesc, -id
=> $id,
730 -refs
=>$featRefs, );
731 # Add the description stuff for the <Feature>
732 for my $de (@
{$featDesc}) {
733 $self->_addel($xmlFeat, "Attribute", {
734 name
=> $de->[0], content
=> $de->[1]}) if ($de->[1]);
736 $self->_parse_location($xml, $xmlFeat, $bioFeat);
738 # loop through the tags, add them as <Qualifiers>
739 next if (defined $args->{SKIPTAGS
} &&
740 $args->{SKIPTAGS
} =~ /all/i);
741 # Tags can consume a lot of CPU cycles, and can often be
742 # rather non-informative, so -skiptags can allow total or
743 # selective omission of tags.
744 for my $tag ($bioFeat->all_tags()) {
745 next if (exists $args->{SKIPTAGS
}{$tag});
746 for my $val ($bioFeat->each_tag_value($tag)) {
747 $self->_addel( $xmlFeat, 'Qualifier', {
748 'value-type' => $tag ,
760 if ( (my $data = $bioSeq->seq) && !$args->{NODATA
} ) {
761 my $d = $self->_addel($xmlSeq, 'Seq-data');
762 $d->appendChild( $xml->createTextNode($data) );
765 # If references were added, make a Feature-table for them
766 unless ($#{$featRefs} < 0) {
767 my $seqFT = $self->_addel($FTs, "Feature-table", {
768 title
=> "Feature References", });
769 for my $feat (@
{$featRefs}) {
770 $seqFT->appendChild($feat);
774 # Place the completed <Sequence> tree as a child of <Sequences>
775 $seqsElem->appendChild($xmlSeq);
776 push @xmlSequences, $xmlSeq;
779 # Prevent browser crashes by explicitly closing empty elements:
780 if ($args->{CLOSE
}) {
781 my @problemChild = ('Sequences', 'Sequence', 'Feature-tables',
782 'Feature-table', 'Screen', 'View',);
783 for my $kid (@problemChild) {
784 for my $prob ($xml->getElementsByTagName($kid)) {
785 unless ($prob->hasChildNodes) {
787 $xml->createComment(" Must close <$kid> explicitly "));
793 if (defined $args->{RETURN
} &&
794 $args->{RETURN
} =~ /seq/i) {
795 return \
@xmlSequences;
804 Usage : $obj->write_seq(@args)
805 Function: Prints out an XML structure for one or more Bio::Seq objects.
806 If $seqref is an array ref, the XML tree generated will include
807 all the sequences in the array. This method is fairly simple,
808 most of the processing is performed within to_bsml.
809 Returns : A reference to the XML object generated / modified
810 Args : Argument array. Recognized keys:
812 -seq A Bio::Seq reference, or an array reference of many of them
814 Alternatively, the method may be called simply as...
816 $obj->write_seq( $bioseq )
818 ... if only a single argument is passed, it is assumed that
819 it is the sequence object (can also be an array ref of
822 -printmime If true prints "Content-type: $mimetype\n\n" at top of
823 document, where $mimetype is the value designated by this
824 key. For generic XML use text/xml, for BSML use text/x-bsml
826 -return This option will be supressed, since the nature of this
827 method is to print out the XML document. If you wish to
828 retrieve the <Sequence> objects generated, use the to_bsml
835 my $args = $self->_parseparams( @_);
837 # If only a single value is passed, assume it is the seq object
840 # Build a BSML XML DOM object based on the sequence(s)
841 my $xml = $self->to_bsml( @_,
843 # Convert to a string
844 my $out = $xml->toString;
845 # Print after putting a return after each element - more readable
847 $self->_print("Content-type: " . $args->{PRINTMIME
} . "\n\n")
848 if ($args->{PRINTMIME
});
849 $self->_print( $out );
850 # Return the DOM tree in case the user wants to do something with it
852 $self->flush if $self->_flush_on_write && defined $self->_fh;
856 =head1 INTERNAL METHODS
857 #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-
859 The following methods are used for internal processing, and should probably
860 not be accessed by the user.
862 =head2 _parse_location
864 Title : _parse_location
865 Usage : $obj->_parse_location($xmlDocument, $parentElem, $SeqFeatureObj)
866 Function: Adds <Interval-loc> and <Site-loc> children to <$parentElem> based
867 on locations / sublocations found in $SeqFeatureObj. If
868 sublocations exist, the original location will be ignored.
869 Returns : An array ref containing the elements added to the parent.
870 These will have already been added to <$parentElem>
871 Args : 0 The DOM::Document being modified
872 1 The DOM::Element parent that you want to add to
873 2 Reference to the Bio::SeqFeature being analyzed
877 ###############################
878 # <Interval-loc> & <Site-loc> #
879 ###############################
881 sub _parse_location
{
883 my ($xml, $xmlFeat, $bioFeat) = @_;
884 my $bioLoc = $bioFeat->location;
886 if (ref($bioLoc) =~ /Split/) {
887 @locations = $bioLoc->sub_Location;
888 # BSML 2.2 does not recognize / support joins. For this reason,
889 # we will just use the upper-level location. The line below can
890 # be deleted or commented out if/when BSML 3 supports complex
891 # interval deffinitions:
892 @locations = ($bioLoc);
894 @locations = ($bioLoc);
898 # Add the site or interval positional information:
899 for my $loc (@locations) {
900 my ($start, $end) = ($loc->start, $loc->end);
902 # Strand information is not well described in BSML
903 $locAttr{complement
} = 1 if ($loc->strand == -1);
904 if ($start ne "" && ($start == $end || $end eq "")) {
905 $locAttr{sitepos
} = $start;
906 push @added, $self->_addel($xmlFeat,'Site-loc',\
%locAttr);
907 } elsif ($start ne "" && $end ne "") {
909 # The feature is on the complementary strand
910 ($start, $end) = ($end, $start);
911 $locAttr{complement
} = 1;
913 $locAttr{startpos
} = $start;
914 $locAttr{endpos
} = $end;
915 push @added, $self->_addel($xmlFeat,'Interval-loc',\
%locAttr);
917 warn "Failure to parse SeqFeature location. Start = '$start' & End = '$end'";
923 =head2 _parse_bsml_feature
925 Title : _parse_bsml_feature
926 Usage : $obj->_parse_bsml_feature($xmlFeature )
927 Function: Will examine the <Feature> element provided by $xmlFeature and
928 return a generic seq feature.
929 Returns : Bio::SeqFeature::Generic
930 Args : 0 XML::DOM::Element <Feature> being analyzed.
934 sub _parse_bsml_feature
{
938 my $basegsf = Bio
::SeqFeature
::Generic
->new();
943 # Use the class as the primary tag value, if it is present
944 if ( my $val = $feat->getAttribute("class") ) {
945 $basegsf->primary_tag($val);
948 # Positional information is in <Interval-loc>s or <Site-loc>s
949 # We need to grab these in order, to try to recreate joins...
951 for my $kid ($feat->getChildNodes) {
952 my $nodeName = $kid->getNodeName;
953 next unless ($nodeName eq "Interval-loc" ||
954 $nodeName eq "Site-loc");
955 push @locations, $kid;
957 if ($#locations == 0) {
958 # There is only one location specified
959 $self->_parse_bsml_location($locations[0], $basegsf);
960 } elsif ($#locations > 0) {
961 #>>>> # This is not working, I think the error is somewhere downstream
962 # of add_sub_SeqFeature, probably in RangeI::union ?
963 # The sub features are added fine, but the EXPANDed parent feature
964 # location has a messed up start - Bio::SeqFeature::Generic ref
965 # instead of an integer - and an incorrect end - the end of the first
966 # sub feature added, not of the union of all of them.
968 # Also, the SeqIO::genbank.pm output is odd - the sub features appear
969 # to be listed with the *previous* feature, not this one.
971 for my $location (@locations) {
972 my $subgsf = $self->_parse_bsml_location($location);
973 # print "start ", $subgsf->start,"\n";
974 # print "end ", $subgsf->end,"\n";
975 $basegsf->add_sub_SeqFeature($subgsf, 'EXPAND');
977 # print $feat->getAttribute('id'),"\n";
978 # print $basegsf->primary_tag,"\n";
981 # What to do if there are no locations? Nothing needed?
984 # Look at any <Attribute>s or <Qualifier>s that are present:
985 my $floppies = &GETFLOPPIES
($feat);
986 for my $attr (@
{$floppies}) {
987 my ($name, $content) = &FLOPPYVALS
($attr);
988 # Don't know what the object is, dump it to a tag:
989 $basegsf->add_tag_value(lc($name), $content);
992 # Mostly this helps with debugging, but may be of utility...
993 # Add a tag holding the BSML id value
994 if ( (my $val = $feat->getAttribute('id')) &&
995 !$basegsf->has_tag('bsml-id')) {
996 # Decided that this got a little sloppy...
997 # $basegsf->add_tag_value("bsml-id", $val);
1002 =head2 _parse_bsml_location
1004 Title : _parse_bsml_location
1005 Usage : $obj->_parse_bsml_feature( $intOrSiteLoc, $gsfObject )
1006 Function: Will examine the <Interval-loc> or <Site-loc> element provided
1007 Returns : Bio::SeqFeature::Generic
1008 Args : 0 XML::DOM::Element <Interval/Site-loc> being analyzed.
1009 1 Optional SeqFeature::Generic to use
1013 sub _parse_bsml_location
{
1015 my ($loc, $gsf) = @_;
1017 $gsf ||= Bio
::SeqFeature
::Generic
->new();
1018 my $type = $loc->getNodeName;
1020 if ($type eq 'Interval-loc') {
1021 $start = $loc->getAttribute('startpos');
1022 $end = $loc->getAttribute('endpos');
1023 } elsif ($type eq 'Site-loc') {
1024 $start = $end = $loc->getAttribute('sitepos');
1026 warn "Unknown location type '$type', could not make GSF\n";
1029 $gsf->start($start);
1032 # BSML does not have an explicit method to set undefined strand
1033 if (my $s = $loc->getAttribute("complement")) {
1040 # We're setting "strand nonspecific" here - bad idea?
1041 # In most cases the user likely meant it to be on the + strand
1048 =head2 _parse_reference
1050 Title : _parse_reference
1051 Usage : $obj->_parse_reference(@args )
1052 Function: Makes a new <Reference> object from a ::Reference, which is
1053 then stored in an array provide by -refs. It will be
1054 appended to the XML tree later.
1056 Args : Argument array. Recognized keys:
1058 -xml The DOM::Document being modified
1060 -refobj The Annotation::Reference Object
1062 -refs An array reference to hold the new <Reference> DOM object
1064 -id Optional. If the XML id for the 'calling' element is
1065 provided, it will be placed in any <Reference> refs
1070 sub _parse_reference
{
1072 my $args = $self->_parseparams( @_);
1073 my ($xml, $ref, $refRef) = ($args->{XML
}, $args->{REFOBJ
}, $args->{REFS
});
1079 my $xmlRef = $xml->createElement("Reference");
1080 #>> This may not be the right way to make a BSML dbxref...
1081 if (my $link = $ref->medline) {
1082 $xmlRef->setAttribute('dbxref', $link);
1085 # Make attributes for some of the characteristics
1086 my %stuff = ( start
=> $ref->start,
1089 comment
=> $ref->comment,
1090 pubmed
=> $ref->pubmed,
1092 for my $s (keys %stuff) {
1093 $self->_addel($xmlRef, "Attribute", {
1094 name
=> $s, content
=> $stuff{$s} }) if ($stuff{$s});
1096 $xmlRef->setAttribute('refs', $args->{ID
}) if ($args->{ID
});
1097 # Add the basic information
1098 # Should probably check for content before creation...
1099 $self->_addel($xmlRef, "RefAuthors")->
1100 appendChild
( $xml->createTextNode(&STRIP
($ref->authors)) );
1101 $self->_addel($xmlRef, "RefTitle")->
1102 appendChild
( $xml->createTextNode(&STRIP
($ref->title)) );
1103 $self->_addel($xmlRef, "RefJournal")->
1104 appendChild
( $xml->createTextNode(&STRIP
($ref->location)) );
1105 # References will be added later in a <Feature-Table>
1106 push @
{$refRef}, $xmlRef;
1109 =head2 _parse_annotation
1111 Title : _parse_annotation
1112 Usage : $obj->_parse_annotation(@args )
1113 Function: Will examine any Annotations found in -obj. Data found in
1114 ::Comment and ::DBLink structures, as well as Annotation
1115 description fields are stored in -desc for later
1116 generation of <Attribute>s. <Reference> objects are generated
1117 from ::References, and are stored in -refs - these will
1118 be appended to the XML tree later.
1120 Args : Argument array. Recognized keys:
1122 -xml The DOM::Document being modified
1124 -obj Reference to the Bio object being analyzed
1126 -descr An array reference for holding description text items
1128 -refs An array reference to hold <Reference> DOM objects
1130 -id Optional. If the XML id for the 'calling' element is
1131 provided, it will be placed in any <Reference> refs
1136 sub _parse_annotation
{
1138 my $args = $self->_parseparams( @_);
1139 my ($xml, $obj, $descRef, $refRef) =
1140 ( $args->{XML
}, $args->{OBJ
}, $args->{DESC
}, $args->{REFS
} );
1141 # No good place to put any of this (except for references). Most stuff
1142 # just gets dumped to <Attribute>s
1143 my $ann = $obj->annotation;
1144 return unless ($ann);
1145 # use BMS::Branch; my $debug = BMS::Branch->new( ); warn "$obj :"; $debug->branch($ann);
1146 unless (ref($ann) =~ /Collection/) {
1147 # Old style annotation. It seems that Features still use this
1149 $self->_parse_annotation_old(@_);
1153 for my $key ($ann->get_all_annotation_keys()) {
1154 for my $thing ($ann->get_Annotations($key)) {
1155 if ($key eq 'description') {
1156 push @
{$descRef}, ["description" , $thing->value];
1157 } elsif ($key eq 'comment') {
1158 push @
{$descRef}, ["comment" , $thing->text];
1159 } elsif ($key eq 'dblink') {
1160 # DBLinks get dumped to attributes, too
1161 push @
{$descRef}, ["db_xref" , $thing->database . ":"
1162 . $thing->primary_id ];
1163 if (my $com = $thing->comment) {
1164 push @
{$descRef}, ["link" , $com->text ];
1167 } elsif ($key eq 'reference') {
1168 $self->_parse_reference( @_, -refobj
=> $thing );
1169 } elsif (ref($thing) =~ /SimpleValue/) {
1170 push @
{$descRef}, [$key , $thing->value];
1173 push @
{$descRef}, ["error", "bsml.pm did not understand ".
1174 "'$key' = '$thing'" ];
1180 =head2 _parse_annotation_old
1182 Title : _parse_annotation_old
1183 Usage : $obj->_parse_annotation_old(@args)
1184 Function: As above, but for the old Annotation system.
1185 Apparently needed because Features are still using the old-style
1188 Args : Argument array. Recognized keys:
1190 -xml The DOM::Document being modified
1192 -obj Reference to the Bio object being analyzed
1194 -descr An array reference for holding description text items
1196 -refs An array reference to hold <Reference> DOM objects
1198 -id Optional. If the XML id for the 'calling' element is
1199 provided, it will be placed in any <Reference> refs
1208 sub _parse_annotation_old
{
1210 my $args = $self->_parseparams( @_);
1211 my ($xml, $obj, $descRef, $refRef) =
1212 ( $args->{XML
}, $args->{OBJ
}, $args->{DESC
}, $args->{REFS
} );
1213 # No good place to put any of this (except for references). Most stuff
1214 # just gets dumped to <Attribute>s
1215 if (my $ann = $obj->annotation) {
1216 push @
{$descRef}, ["annotation", $ann->description];
1217 for my $com ($ann->each_Comment) {
1218 push @
{$descRef}, ["comment" , $com->text];
1221 # Gene names just get dumped to <Attribute name="gene">
1222 for my $gene ($ann->each_gene_name) {
1223 push @
{$descRef}, ["gene" , $gene];
1226 # DBLinks get dumped to attributes, too
1227 for my $link ($ann->each_DBLink) {
1228 push @
{$descRef}, ["db_xref" ,
1229 $link->database . ":" . $link->primary_id ];
1230 if (my $com = $link->comment) {
1231 push @
{$descRef}, ["link" , $com->text ];
1235 # References get produced and temporarily held
1236 for my $ref ($ann->each_Reference) {
1237 $self->_parse_reference( @_, -refobj
=> $ref );
1245 Usage : $obj->_add_page($xmlDocument, $xmlSequenceObject)
1246 Function: Adds a simple <Page> and <View> structure for a <Sequence>
1247 Returns : a reference to the newly created <Page>
1248 Args : 0 The DOM::Document being modified
1249 1 Reference to the <Sequence> object
1255 my ($xml, $seq) = @_;
1256 my $disp = $xml->getElementsByTagName("Display")->item(0);
1257 my $page = $self->_addel($disp, "Page");
1258 my ($width, $height) = ( 7.8, 5.5);
1259 my $screen = $self->_addel($page, "Screen", {
1260 width
=> $width, height
=> $height, });
1261 # $screen->appendChild($xml->createComment("Must close explicitly"));
1262 my $view = $self->_addel($page, "View", {
1263 seqref
=> $seq->getAttribute('id'),
1264 title
=> $seq->getAttribute('title'),
1266 title2
=> "{LENGTH} {UNIT}",
1268 $self->_addel($view, "View-line-widget", {
1269 shape
=> 'horizontal',
1270 hcenter
=> $width/2 + 0.7,
1271 'linear-length' => $width - 2,
1273 $self->_addel($view, "View-axis-widget");
1281 Usage : $obj->_addel($parentElem, 'ChildName',
1282 { anAttr => 'someValue', anotherAttr => 'aValue',})
1283 Function: Add an element with attribute values to a DOM tree
1284 Returns : a reference to the newly added element
1285 Args : 0 The DOM::Element parent that you want to add to
1286 1 The name of the new child element
1287 2 Optional hash reference containing
1288 attribute name => attribute value assignments
1294 my ($root, $name, $attr) = @_;
1296 # Find the DOM::Document for the parent
1297 my $doc = $root->getOwnerDocument || $root;
1298 my $elem = $doc->createElement($name);
1299 for my $a (keys %{$attr}) {
1300 $elem->setAttribute($a, $attr->{$a});
1302 $root->appendChild($elem);
1309 Usage : $obj->_show_dna($newval)
1310 Function: (cut-and-pasted directly from embl.pm)
1311 Returns : value of _show_dna
1312 Args : newvalue (optional)
1320 $obj->{'_show_dna'} = $value;
1322 return $obj->{'_show_dna'};
1328 Usage : $dom = $obj->_initialize(@args)
1329 Function: Coppied from embl.pm, and augmented with initialization of the
1332 Args : -file => the XML file to be parsed
1337 my($self,@args) = @_;
1339 $self->SUPER::_initialize
(@args);
1340 # hash for functions for decoding keys.
1341 $self->{'_func_ftunit_hash'} = {};
1342 $self->_show_dna(1); # sets this to one by default. People can change it
1344 my %param = @args; # From SeqIO.pm
1345 @param{ map { lc $_ } keys %param } = values %param; # lowercase keys
1346 if ( exists $param{-file
} && $param{-file
} !~ /^>/) {
1347 # Is it blasphemy to add your own keys to an object in another package?
1348 # domtree => the parsed DOM tree retruned by XML::DOM
1349 $self->{'domtree'} = $self->_parse_xml( $param{-file
} );
1350 # current_node => the <Sequence> node next in line for next_seq
1351 $self->{'current_node'} = 0;
1354 $self->sequence_factory( Bio
::Seq
::SeqFactory
->new
1355 ( -verbose
=> $self->verbose(),
1356 -type
=> 'Bio::Seq::RichSeq'))
1357 if( ! defined $self->sequence_factory );
1363 Title : _parseparams
1364 Usage : my $paramHash = $obj->_parseparams(@args)
1365 Function: Borrowed from Bio::Parse.pm, who borrowed it from CGI.pm
1366 Lincoln Stein -> Richard Resnick -> here
1367 Returns : A hash reference of the parameter keys (uppercase) pointing to
1369 Args : An array of key, value pairs. Easiest to pass values as:
1370 -key1 => value1, -key2 => value2, etc
1371 Leading "-" are removed.
1380 # Hacked out from Parse.pm
1381 # The next few lines strip out the '-' characters which
1382 # preceed the keys, and capitalizes them.
1383 for (my $i=0;$i<@param;$i+=2) {
1384 $param[$i]=~s/^\-//;
1385 $param[$i]=~tr/a-z/A-Z/;
1387 pop @param if @param %2; # not an even multiple
1395 Usage : $dom = $obj->_parse_xml($filename)
1396 Function: uses XML::DOM to construct a DOM tree from the BSML document
1397 Returns : a reference to the parsed DOM tree
1398 Args : 0 Path to the XML file needing to be parsed
1407 $self->throw("Could not parse non-existant XML file '$file'.");
1410 my $parser = new XML
::DOM
::Parser
;
1411 my $doc = $parser->parsefile ($file);
1417 # Reports off the net imply that DOM::Parser will memory leak if you
1418 # do not explicitly dispose of it:
1419 # http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/788458
1420 my $dom = $self->{'domtree'};
1421 # For some reason the domtree can get undef-ed somewhere...
1422 $dom->dispose if ($dom);
1426 =head1 TESTING SCRIPT
1428 The following script may be used to test the conversion process. You
1429 will need a file of the format you wish to test. The script will
1430 convert the file to BSML, store it in /tmp/bsmltemp, read that file
1431 into a new SeqIO stream, and write it back as the original
1432 format. Comparison of this second file to the original input file
1433 will allow you to track where data may be lost or corrupted. Note
1434 that you will need to specify $readfile and $readformat.
1437 # Tests preservation of details during round-trip conversion:
1438 # $readformat -> BSML -> $readformat
1439 my $tempspot = "/tmp/bsmltemp"; # temp folder to hold generated files
1440 my $readfile = "rps4y.embl"; # The name of the file you want to test
1441 my $readformat = "embl"; # The format of the file being tested
1443 system "mkdir $tempspot" unless (-d $tempspot);
1444 # Make Seq object from the $readfile
1445 my $biostream = Bio::SeqIO->new( -file => "$readfile" );
1446 my $seq = $biostream->next_seq();
1448 # Write BSML from SeqObject
1449 my $bsmlout = Bio::SeqIO->new( -format => 'bsml',
1450 -file => ">$tempspot/out.bsml");
1451 warn "\nBSML written to $tempspot/out.bsml\n";
1452 $bsmlout->write_seq($seq);
1453 # Need to kill object for following code to work... Why is this so?
1456 # Make Seq object from BSML
1457 my $bsmlin = Bio::SeqIO->new( -file => "$tempspot/out.bsml",
1459 my $seq2 = $bsmlin->next_seq();
1461 # Write format back from Seq Object
1462 my $genout = Bio::SeqIO->new( -format => $readformat,
1463 -file => ">$tempspot/out.$readformat");
1464 $genout->write_seq($seq2);
1465 warn "$readformat written to $tempspot/out.$readformat\n";
1468 # Join information (not possible in BSML 2.2)
1469 # Sequence type (??)