2 # bioperl module for Bio::SeqFeature::Tools::Unflattener
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Chris Mungall <cjm@fruitfly.org>
8 # Copyright Chris Mungall
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
16 Bio::SeqFeature::Tools::Unflattener - turns flat list of genbank-sourced features into a nested SeqFeatureI hierarchy
20 # standard / generic use - unflatten a genbank record
22 use Bio::SeqFeature::Tools::Unflattener;
24 # generate an Unflattener object
25 $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
27 # first fetch a genbank SeqI object
29 Bio::SeqIO->new(-file=>'AE003644.gbk',
32 Bio::SeqIO->new(-format=>'asciitree');
33 while ($seq = $seqio->next_seq()) {
35 # get top level unflattended SeqFeatureI objects
36 $unflattener->unflatten_seq(-seq=>$seq,
38 $out->write_seq($seq);
40 @top_sfs = $seq->get_SeqFeatures;
41 foreach my $sf (@top_sfs) {
42 # do something with top-level features (eg genes)
49 Most GenBank entries for annotated genomic DNA contain a B<flat> list
50 of features. These features can be parsed into an equivalent flat list
51 of L<Bio::SeqFeatureI> objects using the standard L<Bio::SeqIO>
52 classes. However, it is often desirable to B<unflatten> this list into
53 something resembling actual B<gene models>, in which genes, mRNAs and CDSs
54 are B<nested> according to the nature of the gene model.
56 The BioPerl object model allows us to store these kind of associations
57 between SeqFeatures in B<containment hierarchies> -- any SeqFeatureI
58 object can contain nested SeqFeatureI objects. The
59 Bio::SeqFeature::Tools::Unflattener object facilitates construction of
60 these hierarchies from the underlying GenBank flat-feature-list
63 For example, if you were to look at a typical GenBank DNA entry, say,
64 B<AE003644>, you would see a flat list of features:
83 These features have sequence locations, but it is not immediately
84 clear how to write code such that each mRNA is linked to the
85 appropriate CDS (other than relying on IDs which is very bad)
87 We would like to convert the above list into the B<containment
88 hierarchy>, shown below:
113 Where each feature is nested underneath its container. Note that exons
114 have been automatically inferred (even for tRNA genes).
116 We do this using a call on a L<Bio::SeqFeature::Tools::Unflattener>
119 @sfs = $unflattener->unflatten_seq(-seq=>$seq);
121 This would return a list of the B<top level> (i.e. container)
122 SeqFeatureI objects - in this case, genes. Other top level features
123 are possible; for instance, the B<source> feature which is always
124 present, and other features such as B<variation> or B<misc_feature>
127 The containment hierarchy can be accessed using the get_SeqFeature()
128 call on any feature object - see L<Bio::SeqFeature::FeatureHolderI>.
129 The following code will traverse the containment hierarchy for a
133 $sf = shift; # $sf isa Bio::SeqfeatureI
135 # ...do something with $sf!
137 # depth first traversal of containment tree
138 @contained_sfs = $sf->get_SeqFeatures;
139 traverse($_) foreach @contained_sfs;
142 Once you have built the hierarchy, you can do neat stuff like turn the
143 features into 'rich' feature objects (eg
144 L<Bio::SeqFeature::Gene::GeneStructure>) or convert to a suitable
145 format such as GFF3 or chadoxml (after mapping to the Sequence
146 Ontology); this step is not described here.
150 Due to the quixotic nature of how features are stored in
151 GenBank/EMBL/DDBJ, there is no guarantee that the default behaviour of
152 this module will produce perfect results. Sometimes it is hard or
153 impossible to build a correct containment hierarchy if the information
154 provided is simply too lossy, as is often the case. If you care deeply
155 about your data, you should always manually inspect the resulting
156 containment hierarchy; you may have to customise the algorithm for
157 building the hierarchy, or even manually tweak the resulting
158 hierarchy. This is explained in more detail further on in the document.
160 However, if you are satisfied with the default behaviour, then you do
161 not need to read any further. Just make sure you set the parameter
162 B<use_magic> - this will invoke incantations which will magically
163 produce good results no matter what the idiosyncracies of the
164 particular GenBank record in question.
168 $unflattener->unflatten_seq(-seq=>$seq,
171 The success of this depends on the phase of the moon at the time the
172 entry was submitted to GenBank. Note that the magical recipe is being
173 constantly improved, so the results of invoking magic may vary
174 depending on the bioperl release.
176 If you are skeptical of magic, or you wish to exact fine grained
177 control over how the entry is unflattened, or you simply wish to
178 understand more about how this crazy stuff works, then read on!
180 =head1 PROBLEMATIC DATA AND INCONSISTENCIES
182 Occasionally the Unflattener will have problems with certain
183 records. For example, the record may contain inconsistent data - maybe
184 there is an B<exon> entry that has no corresponding B<mRNA> location.
186 The default behaviour is to throw an exception reporting the problem,
187 if the problem is relatively serious - for example, inconsistent data.
189 You can exert more fine grained control over this - perhaps you want
190 the Unflattener to do the best it can, and report any problems. This
191 can be done - refer to the methods.
203 This is the default algorithm; you should be able to override any part
206 The core of the algorithm is in two parts
210 =item Partitioning the flat feature list into groups
212 =item Resolving the feature containment hierarchy for each group
216 There are other optional steps after the completion of these two
217 steps, such as B<inferring exons>; we now describe in more detail what
220 =head2 Partitioning into groups
222 First of all the flat feature list is partitioned into B<group>s.
224 The default way of doing this is to use the B<gene> attribute; if we
225 look at two features from GenBank accession AE003644.3:
230 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
232 /db_xref="FLYBASE:FBgn0005771"
233 mRNA join(20111..20584,20887..23268)
237 /db_xref="FLYBASE:FBgn0005771"
239 Both these features share the same /gene tag which is "noc", so they
240 correspond to the same gene model (the CDS feature is not shown, but
241 this also has a tag-value /gene="noc").
243 Not all groups need to correspond to gene models, but this is the most
244 common use case; later on we shall describe how to customise the
247 Sometimes other tags have to be used; for instance, if you look at the
248 entire record for AE003644.3 you will see you actually need the use the
249 /locus_tag attribute. This attribute is actually B<not present> in
252 You can override this:
254 $collection->unflatten_seq(-seq=>$seq, -group_tag=>'locus_tag');
256 Alternatively, if you B<-use_magic>, the object will try and make a
257 guess as to what the correct group_tag should be.
259 At the end of this step, we should have a list of groups - there is no
260 structure within a group; the group just serves to partition the flat
261 features. For the example data above, we would have the following groups.
267 [ gene mRNA mRNA mRNA CDS CDS CDS ]
269 =head3 Multicopy Genes
271 Multicopy genes are usually rRNAs or tRNAs that are duplicated across
272 the genome. Because they are functionally equivalent, and usually have
273 the same sequence, they usually have the same group_tag (ie gene
274 symbol); they often have a /note tag giving copy number. This means
275 they will end up in the same group. This is undesirable, because they
276 are spatially disconnected.
278 There is another step, which involves splitting spatially disconnected
279 groups into distinct groups
283 [gene-rrn3 rRNA-rrn3 gene-rrn3 rRNA-rrn3]
287 [gene-rrn3 rRNA-rrn3] [gene-rrn3 rRNA-rrn3]
289 based on the coordinates
293 The next step is to add some structure to each group, by making
294 B<containment hierarchies>, trees that represent how the features
297 =head2 Resolving the containment hierarchy
299 After the grouping is done, we end up with a list of groups which
300 probably contain features of type 'gene', 'mRNA', 'CDS' and so on.
302 Singleton groups (eg the 'source' feature) are ignored at this stage.
304 Each group is itself flat; we need to add an extra level of
305 organisation. Usually this is because different spliceforms
306 (represented by the 'mRNA' feature) can give rise to different
307 protein products (indicated by the 'CDS' feature). We want to correctly
308 associate mRNAs to CDSs.
310 We want to go from a group like this:
312 [ gene mRNA mRNA mRNA CDS CDS CDS ]
314 to a containment hierarchy like this:
324 In which each CDS is nested underneath the correct corresponding mRNA.
326 For entries that contain no alternate splicing, this is simple; we
331 Must resolve to the tree
337 How can we do this in entries with alternate splicing? The bad
338 news is that there is no guaranteed way of doing this correctly for
339 any GenBank entry. Occasionally the submission will have been done in
340 such a way as to reconstruct the containment hierarchy. However, this
341 is not consistent across databank entries, so no generic solution can
342 be provided by this object. This module does provide the framework
343 within which you can customise a solution for the particular dataset
344 you are interested in - see later.
346 The good news is that there is an inference we can do that should
347 produce pretty good results the vast majority of the time. It uses
348 splice coordinate data - this is the default behaviour of this module,
349 and is described in detail below.
351 =head2 Using splice site coordinates to infer containment
353 If an mRNA is to be the container for a CDS, then the splice site
354 coordinates (or intron coordinates, depending on how you look at it)
355 of the CDS must fit inside the splice site coordinates of the mRNA.
357 Ambiguities can still arise, but the results produced should still be
358 reasonable and consistent at the sequence level. Look at this fake
361 mRNA XXX---XX--XXXXXX--XXXX join(1..3,7..8,11..16,19..23)
362 mRNA XXX-------XXXXXX--XXXX join(1..3,11..16,19..23)
363 CDS XXXX--XX join(13..16,19..20)
364 CDS XXXX--XX join(13..16,19..20)
366 [obviously the positions have been scaled down]
368 We cannot unambiguously match mRNA with CDS based on splice sites,
369 since both CDS share the splice site locations 16^17 and
370 18^19. However, the consequences of making a wrong match are probably
371 not very severe. Any annotation data attached to the first CDS is
372 probably identical to the seconds CDS, other than identifiers.
374 The default behaviour of this module is to make an arbitrary call
375 where it is ambiguous (the mapping will always be bijective; i.e. one
376 mRNA -E<gt> one CDS).
378 [TODO: NOTE: not tested on EMBL data, which may not be bijective; ie two
379 mRNAs can share the same CDS??]
381 This completes the building of the containment hierarchy; other
384 =head1 POST-GROUPING STEPS
386 =head2 Inferring exons from mRNAs
388 This step always occurs if B<-use_magic> is invoked.
390 In a typical GenBank entry, the exons are B<implicit>. That is they
391 can be inferred from the mRNA location.
395 mRNA join(20111..20584,20887..23268)
397 This tells us that this particular transcript has two exons. In
398 bioperl, the mRNA feature will have a 'split location'.
402 $unflattener->feature_from_splitloc(-seq=>$seq);
404 This will generate the necessary exon features, and nest them under
405 the appropriate mRNAs. Note that the mRNAs will no longer have split
406 locations - they will have simple locations spanning the extent of the
407 exons. This is intentional, to avoid redundancy.
409 Occasionally a GenBank entry will have both implicit exons (from the
410 mRNA location) B<and> explicit exon features.
412 In this case, exons will still be transferred. Tag-value data from the
413 explicit exon will be transferred to the implicit exon. If exons are
414 shared between mRNAs these will be represented by different
415 objects. Any inconsistencies between implicit and explicit will be
418 =head3 tRNAs and other noncoding RNAs
420 exons will also be generated from these features
422 =head2 Inferring mRNAs from CDS
424 Some GenBank entries represent gene models using features of type
425 gene, mRNA and CDS; some entries just use gene and CDS.
427 If we only have gene and CDS, then the containment hierarchies will
433 If we want the containment hierarchies to be uniform, like this
439 Then we must create an mRNA feature. This will have identical
440 coordinates to the CDS. The assumption is that there is either no
441 untranslated region, or it is unknown.
443 To do this, we can call
445 $unflattener->infer_mRNA_from_CDS(-seq=>$seq);
447 This is taken care of automatically, if B<-use_magic> is invoked.
451 =head2 Customising the grouping of features
453 The default behaviour is suited mostly to building models of protein
454 coding genes and noncoding genes from genbank genomic DNA submissions.
456 You can change the tag used to partition the feature by passing in a
457 different group_tag argument - see the unflatten_seq() method
459 Other behaviour may be desirable. For example, even though SNPs
460 (features of type 'variation' in GenBank) are not actually part of the
461 gene model, it may be desirable to group SNPs that overlap or are
464 It should certainly be possible to extend this module to do
465 this. However, I have yet to code this part!!! If anyone would find
466 this useful let me know.
468 In the meantime, you could write your own grouping subroutine, and
469 feed the results into unflatten_groups() [see the method documentation
472 =head2 Customising the resolution of the containment hierarchy
474 Once the flat list of features has been partitioned into groups, the
475 method unflatten_group() is called on each group to build a tree.
477 The algorithm for doing this is described above; ambiguities are
478 resolved by using splice coordinates. As discussed, this can be
481 Some submissions may contain information in tags/attributes that hint
482 as to the mapping that needs to be made between the features.
484 For example, with the Drosophila Melanogaster release 3 submission, we
485 see that CDS features in alternately spliced mRNAs have a form like
488 CDS join(145588..145686,145752..146156,146227..146493)
490 /note="CG32954 gene product from transcript CG32954-RA"
491 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
493 /product="CG32954-PA"
494 /protein_id="AAF53403.1"
495 /db_xref="GI:7298167"
496 /db_xref="FLYBASE:FBgn0052954"
497 /translation="MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENP..."
499 Here the /note tag provides the clue we need to link CDS to mRNA
500 (highlighted with ^^^^). We just need to find the mRNA with the tag
502 /product="CG32954-RA"
504 I have no idea how consistent this practice is across submissions; it
505 is consistent for the fruitfly genome submission.
507 We can customise the behaviour of unflatten_group() by providing our
508 own resolver method. This obviously requires a bit of extra
509 programming, but there is no way to get around this.
511 Here is an example of how to pass in your own resolver; this example
512 basically checks the parent (container) /product tag to see if it
513 matches the required string in the child (contained) /note tag.
515 $unflattener->unflatten_seq(-seq=>$seq,
516 -group_tag=>'locus_tag',
517 -resolver_method=>sub {
519 my ($sf, @candidate_container_sfs) = @_;
520 if ($sf->has_tag('note')) {
521 my @notes = $sf->get_tag_values('note');
522 my @trnames = map {/from transcript\s+(.*)/;
524 @trnames = grep {$_} @trnames;
527 $self->throw("UNRESOLVABLE");
529 elsif (@trnames == 1) {
530 $trname = $trnames[0];
533 $self->throw("AMBIGUOUS: @trnames");
538 $_->has_tag('product') ?
539 $_->get_tag_values('product') :
542 } @candidate_container_sfs;
543 if (@container_sfs == 0) {
544 $self->throw("UNRESOLVABLE");
546 elsif (@container_sfs == 1) {
548 return $container_sfs[0];
551 $self->throw("AMBIGUOUS");
556 the resolver method is only called when there is more than one spliceform.
558 =head2 Parsing mRNA records
560 Some of the entries in sequence databanks are for mRNA sequences as
561 well as genomic DNA. We may want to build models from these too.
563 NOT YET DONE - IN PROGRESS!!!
565 Open question - what would these look like?
567 Ideally we would like a way of combining a mRNA record with the
568 corresponding SeFeature entry from the appropriate genomic DNA
569 record. This could be problemmatic in some cases - for example, the
570 mRNA sequences may not match 100% (due to differences in strain,
571 assembly problems, sequencing problems, etc). What then...?
575 Feature table description
577 http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
583 User feedback is an integral part of the evolution of this and other
584 Bioperl modules. Send your comments and suggestions preferably to the
585 Bioperl mailing lists Your participation is much appreciated.
587 bioperl-l@bioperl.org - General discussion
588 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
592 Please direct usage questions or support issues to the mailing list:
594 I<bioperl-l@bioperl.org>
596 rather than to the module maintainer directly. Many experienced and
597 reponsive experts will be able look at the problem and quickly
598 address it. Please include a thorough description of the problem
599 with code and data examples if at all possible.
601 =head2 Reporting Bugs
603 report bugs to the Bioperl bug tracking system to help us keep track
604 the bugs and their resolution. Bug reports can be submitted via the
607 https://github.com/bioperl/bioperl-live/issues
609 =head1 AUTHOR - Chris Mungall
611 Email: cjm@fruitfly.org
615 The rest of the documentation details each of the object
616 methods. Internal methods are usually preceded with a _
621 # Let the code begin...
623 package Bio
::SeqFeature
::Tools
::Unflattener
;
627 # Object preamble - inherits from Bio::Root::Root
628 use Bio
::Location
::Simple
;
629 use Bio
::SeqFeature
::Generic
;
633 use base
qw(Bio::Root::Root);
638 Usage : $unflattener = Bio::SeqFeature::Tools::Unflattener->new();
639 $unflattener->unflatten_seq(-seq=>$seq);
640 Function: constructor
642 Returns : a new Bio::SeqFeature::Tools::Unflattener
647 -seq : A L<Bio::SeqI> object (optional)
648 the sequence to unflatten; this can also be passed in
649 when we call unflatten_seq()
651 -group_tag : a string representing the /tag used to partition flat features
652 (see discussion above)
658 my($class,@args) = @_;
659 my $self = $class->SUPER::new
(@args);
661 my($seq, $group_tag, $trust_grouptag) =
662 $self->_rearrange([qw(SEQ
668 $seq && $self->seq($seq);
669 $group_tag && $self->group_tag($group_tag);
670 # $self->{'trust_grouptag'}= $trust_grouptag if($trust_grouptag); #dgg suggestion
671 return $self; # success - we hope!
676 return if $self->{_reported_problems
};
677 return if $self->{_ignore_problems
};
678 my @probs = $self->get_problems;
679 if (!$self->{_problems_reported
} &&
682 "WARNING: There are UNREPORTED PROBLEMS.\n".
683 "You may wish to use the method report_problems(), \n",
684 "or ignore_problems() on the Unflattener object\n");
692 Usage : $unflattener->seq($newval)
695 Returns : value of seq (a Bio::SeqI)
696 Args : on set, new value (a Bio::SeqI, optional)
698 The Bio::SeqI object should hold a flat list of Bio::SeqFeatureI
699 objects; this is the list that will be unflattened.
701 The sequence object can also be set when we call unflatten_seq()
708 return $self->{'seq'} = shift if @_;
709 return $self->{'seq'};
715 Usage : $unflattener->group_tag($newval)
718 Returns : value of group_tag (a scalar)
719 Args : on set, new value (a scalar or undef, optional)
721 This is the tag that will be used to collect elements from the flat
722 feature list into groups; for instance, if we look at two typical
728 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
730 /db_xref="FLYBASE:FBgn0005771"
731 mRNA join(20111..20584,20887..23268)
735 /db_xref="FLYBASE:FBgn0005771"
737 We can see that these comprise the same gene model because they share
738 the same /gene attribute; we want to collect these together in groups.
740 Setting group_tag is optional. The default is to use 'gene'. In the
741 example above, we could also use /locus_tag
748 return $self->{'group_tag'} = shift if @_;
749 return $self->{'group_tag'};
755 Usage : $unflattener->partonomy({mRNA=>'gene', CDS=>'mRNA')
758 Returns : value of partonomy (a scalar)
759 Args : on set, new value (a scalar or undef, optional)
761 A hash representing the containment structure that the seq_feature
762 nesting should conform to; each key represents the contained (child)
763 type; each value represents the container (parent) type.
770 return $self->{'partonomy'} = shift if @_;
771 if (!$self->{'partonomy'}) {
772 $self->{'partonomy'} = $self->_default_partonomy;
774 return $self->{'partonomy'};
777 sub _default_partonomy
{
790 pseudoexon
=> 'pseudogene',
791 pseudointron
=> 'pseudogene',
792 pseudotranscript
=> 'pseudogene',
796 =head2 structure_type
798 Title : structure_type
799 Usage : $unflattener->structure_type($newval)
802 Returns : value of structure_type (a scalar)
803 Args : on set, new value (an int or undef, optional)
805 GenBank entries conform to different flavours, or B<structure
806 types>. Some have mRNAs, some do not.
808 Right now there are only two base structure types defined. If you set
809 the structure type, then appropriate unflattening action will be
810 taken. The presence or absence of explicit exons does not affect the
813 If you invoke B<-use_magic> then this will be set automatically, based
814 on the content of the record.
818 =item Type 0 (DEFAULT)
827 with this structure type, we want the seq_features to be nested like this
834 exons and introns are implicit from the mRNA 'join' location
836 to get exons from the mRNAs, you will need this call (see below)
838 $unflattener->feature_from_splitloc(-seq=>$seq);
850 there are no mRNA features
852 with this structure type, we want the seq_features to be nested like this
859 exon and intron may or may not be present; they may be implicit from
860 the CDS 'join' location
869 return $self->{'structure_type'} = shift if @_;
870 return $self->{'structure_type'};
876 Usage : @probs = get_problems()
877 Function: Get the list of problem(s) for this object.
879 Returns : An array of [severity, description] pairs
882 In the course of unflattening a record, problems may occur. Some of
883 these problems are non-fatal, and can be ignored.
885 Problems are represented as arrayrefs containing a pair [severity,
888 severity is a number, the higher, the more severe the problem
890 the description is a text string
897 return @
{$self->{'_problems'}} if exists($self->{'_problems'});
901 =head2 clear_problems
903 Title : clear_problems
905 Function: resets the problem list to empty
914 my ($self,@args) = @_;
915 $self->{'_problems'} = [];
925 $self->{'_problems'} = [] unless exists($self->{'_problems'});
926 if ($self->verbose > 0) {
927 warn( "PROBLEM: $_\n") foreach @_;
929 push(@
{$self->{'_problems'}}, @_);
936 my ($severity, $desc, @sfs) = @_;
938 foreach my $sf (@sfs) {
940 sprintf("\nSF [$sf]: ". $sf->location->to_FTstring . "; %s\n",
945 $sf->get_tag_values($_) : ()
946 } qw(locus_tag gene product label)));
949 my $thresh = $self->error_threshold;
950 if ($severity > $thresh) {
951 $self->{_problems_reported
} = 1;
952 $self->throw("PROBLEM, SEVERITY==$severity\n$desc");
954 $self->add_problem([$severity, $desc]);
958 =head2 report_problems
960 Title : report_problems
961 Usage : $unflattener->report_problems(\*STDERR);
965 Args : FileHandle (defaults to STDERR)
971 my ($self, $fh) = @_;
976 foreach my $problem ($self->get_problems) {
977 my ($sev, $desc) = @
$problem;
978 printf $fh "PROBLEM, SEVERITY==$sev\n$desc\n";
980 $self->{_problems_reported
} = 1;
984 =head2 ignore_problems
986 Title : ignore_problems
987 Usage : $obj->ignore_problems();
993 Unflattener is very particular about problems it finds along the
994 way. If you have set the error_threshold such that less severe
995 problems do not cause exceptions, Unflattener still expects you to
996 report_problems() at the end, so that the user of the module is aware
997 of any inconsistencies or problems with the data. In fact, a warning
998 will be produced if there are unreported problems. To silence, this
999 warning, call the ignore_problems() method before the Unflattener
1000 object is destroyed.
1004 sub ignore_problems
{
1006 $self->{_ignore_problems
} = 1;
1011 =head2 error_threshold
1013 Title : error_threshold
1014 Usage : $obj->error_threshold($severity)
1017 Returns : value of error_threshold (a scalar)
1018 Args : on set, new value (an integer)
1020 Sets the threshold above which errors cause this module to throw an
1021 exception. The default is 0; all problems with a severity E<gt> 0 will
1024 If you raise the threshold to 1, then the unflattening process will be
1025 more lax; problems of severity==1 are generally non-fatal, but may
1026 indicate that the results should be inspected, for example, to make
1027 sure there is no data loss.
1031 sub error_threshold
{
1034 return $self->{'error_threshold'} = shift if @_;
1035 return $self->{'error_threshold'} || 0;
1042 # given a type (eg mRNA), will return the container type (eg gene)
1043 sub get_container_type
{
1044 my ($self,$type) = @_;
1045 my @roots = $self->_get_partonomy_roots;
1046 if (grep {$_ eq $type} @roots) {
1047 # it is a root - no parents/containers
1050 my $ch = $self->partonomy;
1051 my $ctype = $ch->{$type};
1053 # asterix acts as a wild card
1054 $ctype = $ch->{'*'};
1059 # get root node of partonomy hierarchy (usually gene)
1060 sub _get_partonomy_roots
{
1062 my $ch = $self->partonomy;
1063 my @parents = values %$ch;
1064 # find parents that do not have parents themselves
1065 return grep {!$ch->{$_}} @parents;
1070 =head2 unflatten_seq
1072 Title : unflatten_seq
1073 Usage : @sfs = $unflattener->unflatten_seq($seq);
1074 Function: turns a flat list of features into a list of holder features
1076 Returns : list of Bio::SeqFeatureI objects
1079 partitions a list of features then arranges them in a nested tree; see
1080 above for full explanation.
1082 note - the Bio::SeqI object passed in will be modified
1086 -seq : a Bio::SeqI object; must contain Bio::SeqFeatureI objects
1087 (this is optional if seq has already been set)
1089 -use_magic: if TRUE (ie non-zero) then magic will be invoked;
1090 see discussion above.
1092 -resolver_method: a CODE reference
1093 see the documentation above for an example of
1094 a subroutine that can be used to resolve hierarchies
1097 this is optional - if nothing is supplied, a default
1098 subroutine will be used (see below)
1100 -group_tag: a string
1101 [ see the group_tag() method ]
1102 this overrides the default group_tag which is 'gene'
1109 my ($self,@args) = @_;
1111 my($seq, $resolver_method, $group_tag, $partonomy,
1112 $structure_type, $resolver_tag, $use_magic, $noinfer) =
1113 $self->_rearrange([qw(SEQ
1124 # seq we want to unflatten
1125 $seq = $seq || $self->seq;
1131 # prevent bad argument combinations
1133 defined($structure_type)) {
1134 $self->throw("You cannot set both -partonomy and -structure_type\n".
1135 "(the former is implied by the latter)");
1138 # remember the current value of partonomy, to reset later
1139 my $old_partonomy = $self->partonomy;
1140 $self->partonomy($partonomy) if defined $partonomy;
1142 # remember old structure_type
1143 my $old_structure_type = $self->structure_type;
1144 $self->structure_type($structure_type) if defined $structure_type;
1146 # if we are sourcing our data from genbank, all the
1147 # features should be flat (eq no sub_SeqFeatures)
1148 my @flat_seq_features = $seq->get_SeqFeatures;
1149 my @all_seq_features = $seq->get_all_SeqFeatures;
1152 if (@all_seq_features > @flat_seq_features) {
1153 $self->throw("It looks as if this sequence has already been unflattened");
1155 if (@all_seq_features < @flat_seq_features) {
1156 $self->throw("ASSERTION ERROR: something is seriously wrong with your features");
1159 # tag for ungrouping; usually /gene or /locus_tag
1160 # for example: /gene="foo"
1161 $group_tag = $group_tag || $self->group_tag;
1163 # use magic to guess the group tag
1164 my @sfs_with_locus_tag =
1165 grep {$_->has_tag("locus_tag")} @flat_seq_features;
1166 my @sfs_with_gene_tag =
1167 grep {$_->has_tag("gene")} @flat_seq_features;
1168 my @sfs_with_product_tag =
1169 grep {$_->has_tag("product")} @flat_seq_features;
1171 # if ($group_tag && $self->{'trust_grouptag'}) { # dgg suggestion
1175 if (@sfs_with_locus_tag) {
1176 # dgg note: would like to -use_magic with -group_tag = 'gene' for ensembl genomes
1177 # where ensembl gene FT have both /locus_tag and /gene, but mRNA, CDS have /gene only
1178 if ($group_tag && $group_tag ne 'locus_tag') {
1179 $self->throw("You have explicitly set group_tag to be '$group_tag'\n".
1180 "However, I detect that some features use /locus_tag\n".
1181 "I believe that this is the correct group_tag to use\n".
1182 "You can resolve this by either NOT setting -group_tag\n".
1183 "OR you can unset -use_magic to regain control");
1186 # use /locus_tag instead of /gene tag for grouping
1187 # see GenBank entry AE003677 (version 3) for an example
1188 $group_tag = 'locus_tag';
1189 if ($self->verbose > 0) {
1190 warn "Set group tag to: $group_tag\n";
1194 # on rare occasions, records will have no /gene or /locus_tag
1195 # but it WILL have /product tags. These serve the same purpose
1196 # for grouping. For an example, see AY763288 (also in t/data)
1197 if (@sfs_with_locus_tag==0 &&
1198 @sfs_with_gene_tag==0 &&
1199 @sfs_with_product_tag>0 &&
1201 $group_tag = 'product';
1202 if ($self->verbose > 0) {
1203 warn "Set group tag to: $group_tag\n";
1209 $group_tag = 'gene';
1212 # ------------------------------
1213 # GROUP FEATURES using $group_tag
1214 # collect features into unstructured groups
1215 # ------------------------------
1218 # we want to generate a list of groups;
1219 # each group is a list of SeqFeatures; this
1220 # group probably (but not necessarily)
1221 # corresponds to a gene model.
1223 # this array will look something like this:
1224 # ([$f1], [$f2, $f3, $f4], ...., [$f97, $f98, $f99])
1226 # there are also 'singleton' groups, with one member.
1227 # for instance, the 'source' feature is in a singleton group;
1228 # the same with others such as 'misc_feature'
1232 # --------------------
1233 # we hope that the genbank record allows us to group by some grouping
1235 # for instance, most of the time a gene model can be grouped using
1236 # the gene tag - that is where you see
1238 # in a genbank record
1239 # --------------------
1241 # keep an index of groups by their
1243 my %group_by_tag = ();
1246 # iterate through all features, putting them into groups
1247 foreach my $sf (@flat_seq_features) {
1248 if (!$sf->has_tag($group_tag)) {
1250 # this is an ungroupable feature;
1251 # add it to a group of its own
1252 push(@groups, [$sf]);
1256 my @group_tagvals = $sf->get_tag_values($group_tag);
1257 if (@group_tagvals > 1) {
1259 # currently something can only belong to one group
1261 ">1 value for /$group_tag: @group_tagvals\n".
1262 "At this time this module is not equipped to handle this adequately", $sf);
1264 # get value of group tag
1265 my $gtv = shift @group_tagvals;
1266 $gtv || $self->throw("Empty /$group_tag vals not allowed!");
1268 # is this a new group?
1269 my $group = $group_by_tag{$gtv};
1271 # this group has been encountered before - add current
1272 # sf to the end of the group
1276 # new group; add to index and create new group
1277 $group = [$sf]; # currently one member; probably more to come
1278 $group_by_tag{$gtv} = $group;
1279 push(@groups, $group);
1284 # as well as having the same group_tag, a group should be spatially
1285 # connected. if not, then the group should be split into subgroups.
1286 # this turns out to be necessary in the case of multicopy genes.
1287 # the standard way to represent these is as spatially disconnected
1288 # gene models (usually a 'gene' feature and some kind of RNA feature)
1289 # with the same group tag; the code below will split these into
1290 # seperate groups, one per copy.
1291 @groups = map { $self->_split_group_if_disconnected($_) } @groups;
1293 # remove any duplicates; most of the time the method below has
1294 # no effect. there are some unusual genbank records for which
1295 # duplicate removal is necessary. see the comments in the
1296 # _remove_duplicates_from_group() method if you want to know
1298 foreach my $group (@groups) {
1299 $self->_remove_duplicates_from_group($group);
1304 # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
1305 # these are indicated with the /pseudo tag
1306 # these are mapped to a different type; they should NOT
1307 # be treated as normal genes
1308 foreach my $sf (@all_seq_features) {
1309 if ($sf->has_tag('pseudo')) {
1310 my $type = $sf->primary_tag;
1311 # SO type is typically the same as the normal
1312 # type but preceeded by "pseudo"
1313 if ($type eq 'misc_RNA' || $type eq 'mRNA') {
1314 # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
1315 $sf->primary_tag("pseudotranscript");
1318 $sf->primary_tag("pseudo$type");
1322 # now some of the post-processing that follows which applies to
1323 # genes will NOT be applied to pseudogenes; this is deliberate
1324 # for example, gene models are normalised to be gene-transcript-exon
1325 # for pseudogenes we leave them as pseudogene-pseudoexon
1328 my $need_to_infer_exons = 0;
1329 my $need_to_infer_mRNAs = 0;
1330 my @removed_exons = ();
1332 if (defined($structure_type)) {
1333 $self->throw("Can't combine use_magic AND setting structure_type");
1336 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1338 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1340 scalar(grep {$_->primary_tag eq 'mRNA'} @flat_seq_features);
1341 my $n_mrnas_attached_to_gene =
1342 scalar(grep {$_->primary_tag eq 'mRNA' &&
1343 $_->has_tag($group_tag)} @flat_seq_features);
1345 scalar(grep {$_->primary_tag eq 'CDS'} @flat_seq_features);
1347 scalar(grep {$_->primary_tag =~ /RNA/} @flat_seq_features);
1348 # Are there any CDS features in the record?
1352 # - a pc gene model should contain at the least a CDS
1354 # Are there any mRNA features in the record?
1355 if ($n_mrnas == 0) {
1357 # looks like structure_type == 1
1358 $structure_type = 1;
1359 $need_to_infer_mRNAs = 1;
1361 elsif ($n_mrnas_attached_to_gene == 0) {
1363 # $n_mrnas_attached_to_gene = 0
1365 # The entries _do_ contain mRNA features,
1366 # but none of them are part of a group/gene, i.e. they
1369 # this is an annoying weird file that has some floating
1371 # eg ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/
1373 if ($self->verbose) {
1374 my @floating_mrnas =
1375 grep {$_->primary_tag eq 'mRNA' &&
1376 !$_->has_tag($group_tag)} @flat_seq_features;
1377 printf STDERR
"Unattached mRNAs:\n";
1378 foreach my $mrna (@floating_mrnas) {
1379 $self->_write_sf_detail($mrna);
1381 printf STDERR
"Don't know how to deal with these; filter at source?\n";
1384 foreach (@flat_seq_features) {
1385 if ($_->primary_tag eq 'mRNA') {
1386 # what should we do??
1388 # I think for pombe we just have to filter
1389 # out bogus mRNAs prior to starting
1393 # looks like structure_type == 2
1394 $structure_type = 2;
1395 $need_to_infer_mRNAs = 1;
1400 # we always infer exons in magic mode
1401 $need_to_infer_exons = 1;
1404 # this doesn't seem to be any kind of protein coding gene model
1405 if ( $n_rnas > 0 ) {
1406 $need_to_infer_exons = 1;
1410 $need_to_infer_exons = 0 if $noinfer; #NML
1412 if ($need_to_infer_exons) {
1413 # remove exons and introns from group -
1414 # we will infer exons later, and we
1415 # can always infer introns from exons
1416 foreach my $group (@groups) {
1419 my $type = $_->primary_tag();
1420 if ($type eq 'exon') {
1421 # keep track of all removed exons,
1422 # so we can do a sanity check later
1423 push(@removed_exons, $_);
1425 $type ne 'exon' && $type ne 'intron'
1428 # get rid of any groups that have zero members
1429 @groups = grep {scalar(@
$_)} @groups;
1432 # --- END OF MAGIC ---
1435 if (grep {!scalar(@
$_)} @groups) {
1436 $self->throw("ASSERTION ERROR: empty group");
1440 if ($self->verbose > 0) {
1441 printf STDERR
"GROUPS:\n";
1442 foreach my $group (@groups) {
1443 $self->_write_group($group, $group_tag);
1448 # --------- FINISHED GROUPING -------------
1451 # TYPE CONTAINMENT HIERARCHY (aka partonomy)
1452 # set the containment hierarchy if desired
1453 # see docs for structure_type() method
1454 if ($structure_type) {
1455 if ($structure_type == 1) {
1464 $self->throw("structure_type $structure_type is currently unknown");
1468 # see if we have an obvious resolver_tag
1470 foreach my $sf (@all_seq_features) {
1471 if ($sf->has_tag('derived_from')) {
1472 $resolver_tag = 'derived_from';
1478 # point all feature types without a container type to the root type.
1480 # for example, if we have an unanticipated feature_type, say
1481 # 'aberration', this should by default point to the parent 'gene'
1482 foreach my $group (@groups) {
1485 foreach my $sf (@sfs) {
1486 my $type = $sf->primary_tag;
1487 next if $type eq 'gene';
1488 my $container_type = $self->get_container_type($type);
1489 if (!$container_type) {
1490 $self->partonomy->{$type} = 'gene';
1497 # we have done the first part of the unflattening.
1498 # we now have a list of groups; each group is a list of seqfeatures.
1499 # the actual group itself is flat; we may want to unflatten this further;
1500 # for instance, a gene model can contain multiple mRNAs and CDSs. We may want
1501 # to link the correct mRNA to the correct CDS via the bioperl sub_SeqFeature tree.
1503 # what we would end up with would be
1509 my @top_sfs = $self->unflatten_groups(-groups
=>\
@groups,
1510 -resolver_method
=>$resolver_method,
1511 -resolver_tag
=>$resolver_tag);
1514 $self->partonomy($old_partonomy);
1517 $self->structure_type($old_structure_type);
1519 # modify the original Seq object - the top seqfeatures are now
1520 # the top features from each group
1521 $seq->remove_SeqFeatures;
1522 $seq->add_SeqFeature($_) foreach @top_sfs;
1524 # --------- FINISHED UNFLATTENING -------------
1526 # lets see if there are any post-unflattening tasks we need to do
1531 if ($need_to_infer_mRNAs) {
1532 if ($self->verbose > 0) {
1533 printf STDERR
"** INFERRING mRNA from CDS\n";
1535 $self->infer_mRNA_from_CDS(-seq
=>$seq, -noinfer
=>$noinfer);
1539 if ($need_to_infer_exons) {
1541 # infer exons, one group/gene at a time
1542 foreach my $sf (@top_sfs) {
1543 my @sub_sfs = ($sf, $sf->get_all_SeqFeatures);
1544 $self->feature_from_splitloc(-features
=>\
@sub_sfs);
1547 # some exons are stated explicitly; ie there is an "exon" feature
1548 # most exons are inferred; ie there is a "mRNA" feature with
1551 # if there were exons explicitly stated in the entry, we need to
1554 # make sure these exons are consistent with the inferred exons
1557 # transfer annotation (tag-vals) from the explicit exon to the
1559 if (@removed_exons) {
1560 my @allfeats = $seq->get_all_SeqFeatures;
1562 # find all the inferred exons that are children of mRNA
1563 my @mrnas = grep {$_->primary_tag eq 'mRNA'} @allfeats;
1565 grep {$_->primary_tag eq 'exon'}
1566 map {$_->get_SeqFeatures} @mrnas;
1568 my %exon_h = (); # index of exons by location;
1570 # there CAN be >1 exon at a location; we can represent these redundantly
1571 # (ie as a tree, not a graph)
1572 push(@
{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
1573 my @problems = (); # list of problems;
1575 # [$severity, $description] pair
1577 my ($n_exons, $n_removed_exons) =
1578 (scalar(keys %exon_h), scalar(@removed_exons));
1579 foreach my $removed_exon (@removed_exons) {
1580 my $locstr = $self->_locstr($removed_exon);
1581 my $inferred_exons = $exon_h{$locstr};
1582 delete $exon_h{$locstr};
1583 if ($inferred_exons) {
1584 my %exons_done = ();
1585 foreach my $exon (@
$inferred_exons) {
1587 # make sure we don't move stuff twice
1588 next if $exons_done{$exon};
1589 $exons_done{$exon} = 1;
1591 # we need to tranfer any tag-values from the explicit
1592 # exon to the implicit exon
1593 foreach my $tag ($removed_exon->get_all_tags) {
1594 my @vals = $removed_exon->get_tag_values($tag);
1595 if (!$exon->can("add_tag_value")) {
1596 # I'm puzzled as to what should be done here;
1597 # SeqFeatureIs are not necessarily mutable,
1598 # but we know that in practice the implementing
1600 $self->throw("The SeqFeature object does not ".
1601 "implement add_tag_value()");
1603 $exon->add_tag_value($tag, @vals);
1608 # no exons inferred at $locstr
1611 "there is a conflict with exons; there was an explicitly ".
1612 "stated exon with location $locstr, yet I cannot generate ".
1613 "this exon from the supplied mRNA locations\n"]);
1616 # do we have any inferred exons left over, that were not
1617 # covered in the explicit exons?
1619 # TODO - we ignore this problem for now
1622 sprintf("There are some inferred exons that are not in the ".
1623 "explicit exon list; they are the exons at locations:\n".
1624 join("\n", keys %exon_h)."\n")]);
1627 # report any problems
1629 my $thresh = $self->error_threshold;
1630 my @bad_problems = grep {$_->[0] > $thresh} @problems;
1631 if (@bad_problems) {
1632 printf STDERR
"PROBLEM:\n";
1633 $self->_write_hier(\
@top_sfs);
1634 # TODO - allow more fine grained control over this
1635 $self->{_problems_reported
} = 1;
1636 $self->throw(join("\n",
1637 map {"@$_"} @bad_problems));
1639 $self->problem(@
$_) foreach @problems;
1643 # --- end of inferring exons --
1645 # return new top level features; this can also
1647 # $seq->get_SeqFeatures();
1649 return $seq->get_SeqFeatures;
1652 # _split_group_if_disconnected([@sfs])
1654 # as well as having the same group_tag, a group should be spatially
1655 # connected. if not, then the group should be split into subgroups.
1656 # this turns out to be necessary in the case of multicopy genes.
1657 # the standard way to represent these is as spatially disconnected
1658 # gene models (usually a 'gene' feature and some kind of RNA feature)
1659 # with the same group tag; the code below will split these into
1660 # seperate groups, one per copy.
1662 sub _split_group_if_disconnected
{
1667 Bio
::Range
->disconnected_ranges(@sfs);
1670 $self->throw("ASSERTION ERROR");
1672 elsif (@ranges == 1) {
1673 # no need to split the group
1678 # split the group into disconnected ranges
1679 if ($self->verbose > 0) {
1680 printf STDERR
"GROUP PRE-SPLIT:\n";
1681 $self->_write_group($group, $self->group_tag);
1687 $_->intersection($range);
1690 if ($self->verbose > 0) {
1691 printf STDERR
"SPLIT GROUPS:\n";
1692 $self->_write_group($_, $self->group_tag) foreach @groups;
1698 sub _remove_duplicates_from_group
{
1702 # ::: WEIRD BOUNDARY CASE CODE :::
1703 # for some reason, there are some gb records with two gene
1704 # features for one gene; for example, see ATF14F8.gbk
1705 # in the t/data directory
1707 # in this case, we get rid of one of the genes
1709 my @genes = grep {$_->primary_tag eq 'gene'} @
$group;
1711 # OK, if we look at ATF14F8.gbk we see that some genes
1712 # just exist as a single location, some exist as a multisplit location;
1719 # gene complement(join(16790..19855,20136..20912,21378..21497,
1720 # 21654..21876,22204..22400,22527..23158,23335..23448,
1721 # 23538..23938,24175..24536,24604..24715,24889..24984,
1722 # 25114..25171,25257..25329,25544..25589,25900..26018,
1726 # the former is the 'standard' way of representing the gene in genbank;
1727 # the latter is redundant with the CDS entry. So we shall get rid of
1728 # the latter with the following filter
1730 if ($self->verbose > 0) {
1731 printf STDERR
"REMOVING DUPLICATES:\n";
1736 my $loc = $_->location;
1737 if ($loc->isa("Bio::Location::SplitLocationI")) {
1738 my @locs = $loc->each_Location;
1752 # OK, that didn't work. Our only resort is to just pick one at random
1753 @genes = ($genes[0]);
1756 @genes == 1 || $self->throw("ASSERTION ERROR");
1758 ($genes[0], grep {$_->primary_tag ne 'gene'} @
$group);
1761 # its a dirty job but someone's gotta do it
1766 =head2 unflatten_groups
1768 Title : unflatten_groups
1770 Function: iterates over groups, calling unflatten_group() [see below]
1772 Returns : list of Bio::SeqFeatureI objects that are holders
1777 -groups: list of list references; inner list is of Bio::SeqFeatureI objects
1778 e.g. ( [$sf1], [$sf2, $sf3, $sf4], [$sf5, ...], ...)
1780 -resolver_method: a CODE reference
1781 see the documentation above for an example of
1782 a subroutine that can be used to resolve hierarchies
1785 this is optional - a default subroutine will be used
1788 NOTE: You should not need to call this method, unless you want fine
1789 grained control over how the unflattening process.
1793 sub unflatten_groups
{
1794 my ($self,@args) = @_;
1795 my($groups, $resolver_method, $resolver_tag) =
1796 $self->_rearrange([qw(GROUPS
1802 # this is just a simple wrapper for unflatten_group()
1805 $self->unflatten_group(-group
=>$_,
1806 -resolver_method
=>$resolver_method,
1807 -resolver_tag
=>$resolver_tag)
1811 =head2 unflatten_group
1813 Title : unflatten_group
1815 Function: nests a group of features into a feature containment hierarchy
1817 Returns : Bio::SeqFeatureI objects that holds other features
1822 -group: reference to list of Bio::SeqFeatureI objects
1824 -resolver_method: a CODE reference
1825 see the documentation above for an example of
1826 a subroutine that can be used to resolve hierarchies
1829 this is optional - a default subroutine will be used
1832 NOTE: You should not need to call this method, unless you want fine
1833 grained control over how the unflattening process.
1837 sub unflatten_group
{
1838 my ($self,@args) = @_;
1840 my($group, $resolver_method, $resolver_tag) =
1841 $self->_rearrange([qw(GROUP
1847 if ($self->verbose > 0) {
1848 printf STDERR
"UNFLATTENING GROUP:\n";
1849 $self->_write_group($group, $self->group_tag);
1854 # we can safely ignore singletons (e.g. [source])
1855 return $sfs[0] if @sfs == 1;
1857 my $partonomy = $self->partonomy;
1859 # $resolver_method is a reference to a SUB that will resolve
1860 # ambiguous parent/child containment; for example, determining
1861 # which mRNAs go with which CDSs
1862 $resolver_method = $resolver_method || \
&_resolve_container_for_sf
;
1864 # TAG BASED RESOLVING OF HIERARCHIES
1866 # if the user specifies $resolver_tag, then we use this tag
1867 # to pair up ambiguous parents and children;
1869 # for example, the CDS feature may have a resolver tag of /derives_from
1870 # which is a 'foreign key' into the /label tag of the mRNA feature
1872 # this kind of tag-based resolution is possible for a certain subset
1873 # of genbank records
1875 # if no resolver tag is specified, we revert to the normal
1877 if ($resolver_tag) {
1878 my $backup_resolver_method = $resolver_method;
1879 # closure: $resolver_tag is remembered by this sub
1882 my ($self, $sf, @possible_container_sfs) = @_;
1883 my @container_sfs = ();
1884 if ($sf->has_tag($resolver_tag)) {
1885 my ($resolver_tagval) = $sf->get_tag_values($resolver_tag);
1886 # if a feature has a resolver_tag (e.g. /derives_from)
1887 # this specifies the /product, /symbol or /label for the
1892 $self->_write_sf($_) if $self->verbose > 0;
1893 foreach my $tag (qw(product symbol label)) {
1894 if ($_->has_tag($tag)) {
1896 $_->get_tag_values($tag);
1897 if (grep {$_ eq $resolver_tagval} @vals) {
1904 } @possible_container_sfs;
1907 return $backup_resolver_method->($sf, @possible_container_sfs);
1909 return map {$_=>0} @container_sfs;
1911 $resolver_method = $sub;
1914 # CONDITION: $resolver_tag is NOT set
1915 $self->throw("assertion error") if $resolver_tag;
1917 # we have now set $resolver_method to a subroutine for
1918 # disambiguatimng parent/child relationships. we will
1919 # now build the whole containment hierarchy for this group
1922 # FIND TOP/ROOT SEQFEATURES
1924 # find all the features for which there is no
1925 # containing feature type (eg genes)
1928 !$self->get_container_type($_->primary_tag);
1931 # CONDITION: there must be at most one root
1933 $self->_write_group($group, $self->group_tag);
1934 printf STDERR
"TOP SFS:\n";
1935 $self->_write_sf($_) foreach @top_sfs;
1936 $self->throw("multiple top-sfs in group");
1938 my $top_sf = $top_sfs[0];
1940 # CREATE INDEX OF SEQFEATURES BY TYPE
1941 my %sfs_by_type = ();
1942 foreach my $sf (@sfs) {
1943 push(@
{$sfs_by_type{$sf->primary_tag}}, $sf);
1946 # containment index; keyed by child; lookup parent
1947 # note: this index uses the stringified object reference of
1948 # the object as a surrogate lookup key
1950 my %container = (); # child -> parent
1952 # ALGORITHM: build containment graph
1954 # find all possible containers for each SF;
1955 # for instance, for a CDS, the possible containers are all
1956 # the mRNAs in the same group. For a mRNA, the possible
1957 # containers are any SFs of type 'gene' (should only be 1).
1958 # (these container-type mappings can be overridden)
1960 # contention is resolved by checking coordinates of splice sites
1961 # (this is the default, but can be overridden)
1963 # most of the time, there is no problem identifying a unique
1964 # parent for every child; this can be ambiguous when constructing
1965 # CDS to mRNA relationships with lots of alternate splicing
1967 # a hash of child->parent relationships is constructed (%container)
1968 # any mappings that need further resolution (eg CDS to mRNA) are
1969 # placed in %unresolved
1972 # (keyed by stringified object reference of child seqfeature)
1973 my %unresolved = (); # child -> [parent,score] to be resolved
1975 # index of seqfeatures by their stringified object reference;
1976 # this is essentially a way of 'reviving' an object from its stringified
1978 # (see NOTE ON USING OBJECTS AS KEYS IN HASHES, below)
1979 my %idxsf = map {$_=>$_} @sfs;
1981 foreach my $sf (@sfs) {
1982 my $type = $sf->primary_tag;
1984 # container type (e.g. the container type for CDS is usually mRNA)
1985 my $container_type =
1986 $self->get_container_type($type);
1987 if ($container_type) {
1989 my @possible_container_sfs =
1990 @
{$sfs_by_type{$container_type} || []};
1991 # we now have a list of possible containers
1992 # (eg for a CDS in an alternately spliced gene, this
1993 # would be a list of all the mRNAs for this gene)
1995 if (!@possible_container_sfs) {
1999 if (@possible_container_sfs == 1) {
2000 # this is the easy situation, whereby the containment
2001 # hierarchy is unambiguous. this will probably be the
2002 # case if the genbank record has no alternate splicing
2005 # ONE OPTION ONLY - resolved!
2006 $container{$sf} = $possible_container_sfs[0];
2010 # MULTIPLE CONTAINER CHOICES
2011 $self->throw("ASSERTION ERROR") unless @possible_container_sfs > 1;
2013 # push this onto the %unresolved graph, and deal with it
2016 # for now we hardcode things such that the only type
2017 # with ambiguous parents is a CDS; if this is violated,
2018 # it has a weak problem class of '1' so the API user
2019 # can easily set things to ignore these
2020 if ($sf->primary_tag ne 'CDS') {
2022 "multiple container choice for non-CDS; ".
2023 "CDS to mRNA should be the only ".
2024 "relationships requiring resolving",
2028 # previously we set the SUB $resolver_method
2029 $self->throw("ASSERTION ERROR")
2030 unless $resolver_method;
2032 # $resolver_method will assign scores to
2033 # parent/child combinations; later on we
2034 # will use these scores to find the optimal
2035 # parent/child pairings
2037 # the default $resolver_method uses splice sites to
2038 # score possible parent/child matches
2041 $resolver_method->($self, $sf, @possible_container_sfs);
2042 if (!%container_sfh) {
2044 "no containers possible for SeqFeature of ".
2045 "type: $type; this SF is being placed at ".
2048 # RESOLVED! (sort of - placed at root/gene level)
2049 $container{$sf} = $top_sf;
2051 # this sort of thing happens if the record is
2052 # badly messed up and there is absolutely no indication
2053 # of where to put the CDS. Perhaps we should just
2054 # place it with a random mRNA?
2056 foreach my $jsf (keys %container_sfh) {
2058 # add [score, parent] pairs to the %unresolved
2059 # lookup table/graph
2060 push(@
{$unresolved{$sf}},
2061 [$idxsf{$jsf}, $container_sfh{$jsf} || 0]);
2068 # not container type for $sf->primary_tag
2071 # $sf must be a root/top node (eg gene)
2077 # CODE CURRENTLY DISABLED
2079 # we require a 1:1 mapping between mRNAs and CDSs;
2080 # create artificial duplicates if we can't do this...
2082 my %childh = map {$_=>1} keys %unresolved;
2083 my %parenth = map {$_->[0]=>1} map {@
$_} values %unresolved;
2084 if ($self->verbose > 0) {
2085 printf STDERR
"MATCHING %d CHILDREN TO %d PARENTS\n",
2086 scalar(keys %childh), scalar(keys %parenth);
2088 # 99.99% of the time in genbank genomic record of structure type 0, we
2089 # see one CDS for every mRNA; one exception is the S Pombe
2090 # genome, which is all CDS, bar a few spurious mRNAs; we have to
2091 # filter out the spurious mRNAs in this case
2093 # another strange case is in the mouse genome, NT_078847.1
2094 # for Pcdh13 you will notice there is 4 mRNAs and 5 CDSs.
2096 # I'm at a loss for a really clever thing to do here. I think the
2097 # best thing is to create duplicate features to preserve the 1:1 mapping
2098 # my $suffix_id = 1;
2099 # while (keys %childh > keys %parenth) {
2106 if ($self->verbose > 0 && scalar(keys %unresolved)) {
2107 printf STDERR
"UNRESOLVED PAIRS:\n";
2108 foreach my $childsf (keys %unresolved) {
2109 my @poss = @
{$unresolved{$childsf}};
2110 foreach my $p (@poss) {
2111 my $parentsf = $p->[0];
2112 $childsf = $idxsf{$childsf};
2113 my @clabels = ($childsf->get_tagset_values(qw(protein_id label product)), "?");
2114 my @plabels = ($parentsf->get_tagset_values(qw(transcript_id label product)), "?");
2116 (" PAIR: $clabels[0] => $plabels[0] (of %d)\n",
2120 } # -- end of verbose
2122 # Now we have to fully resolve the containment hierarchy; remember,
2123 # the graph %container has the fully resolved child->parent links;
2125 # the graph %unresolved is keyed by children missing parents; we
2126 # need to put all these orphans in the %container graph
2128 # we do this using the scores in %unresolved, with the
2129 # find_best_matches() algorithm
2130 my $unresolved_problem_reported = 0;
2133 $self->find_best_matches(\
%unresolved, []);
2135 my ($g) = $sfs[0]->get_tagset_values($self->group_tag || 'gene');
2137 "Could not resolve hierarchy for $g");
2139 $unresolved_problem_reported = 1;
2141 foreach my $pair (@
$new_pairs) {
2142 if ($self->verbose > 0) {
2143 printf STDERR
" resolved pair @$pair\n";
2145 $container{$pair->[0]} = $pair->[1];
2146 delete $unresolved{$pair->[0]};
2150 # CONDITION: containment hierarchy resolved
2152 $self->throw("UNRESOLVED: %unresolved")
2153 unless $unresolved_problem_reported;
2156 # make nested SeqFeature hierarchy from @containment_pairs
2157 # ie put child SeqFeatures into parent SeqFeatures
2159 foreach my $sf (@sfs) {
2160 my $container_sf = $container{$sf};
2161 if ($container_sf) {
2162 # make $sf nested inside $container_sf
2164 # first check if the container spatially contains the containee
2165 if ($container_sf->contains($sf)) {
2167 $container_sf->add_SeqFeature($sf);
2170 # weird case - the container does NOT spatially
2171 # contain the containee;
2172 # we expand and throw a warning
2174 # for an example of this see ZFP91-CNTF dicistronic gene
2175 # in NCBI chrom 11 build 34.3
2177 "Container feature does not spatially contain ".
2178 "subfeature. Perhaps this is a dicistronic gene? ".
2179 "I am expanding the parent feature",
2182 $container_sf->add_SeqFeature($sf, 'EXPAND');
2190 } # -- end of unflatten_group
2193 # A NOTE ON USING OBJECTS AS KEYS IN HASHES (stringified objects)
2195 # Often we with to use seqfeatures as keys in a hashtable; because seqfeatures
2196 # in bioperl have no unique ID, we use a surrogate ID in the form of the
2197 # stringified object references - this is just what you get if you say
2201 # this is guaranteed to be unique (within a particular perl execution)
2203 # often we want to 'revive' the objects used as keys in a hash - once the
2204 # objects are used as keys, remember it is the *strings* used as keys and
2205 # not the object itself, so the object needs to be revived using another
2206 # hashtable that looks like this
2208 # %sfidx = map { $_ => $_ } @sfs
2213 # recursively finds the best set of pairings from a matrix of possible pairings
2215 # tries to make sure nothing is unpaired
2217 # given a matrix of POSSIBLE matches
2218 # (matrix expressed as hash/lookup; keyed by child object; val = [parent, score]
2221 sub find_best_matches
{
2224 my $pairs = shift; # [child,parent] pairs already selected
2226 my $verbose = $self->verbose;
2227 #################################print "I";
2229 printf STDERR
"find_best_matches: (/%d)\n", scalar(@
$pairs);
2232 my %selected_children = map {($_->[0]=>1)} @
$pairs;
2233 my %selected_parents = map {($_->[1]=>1)} @
$pairs;
2235 # make a copy of the matrix with the portions still to be
2237 my %unresolved_parents = ();
2241 printf STDERR
" $_ : %s\n", join("; ", map {"[@$_]"} @
{$matrix->{$_}});
2243 if ($selected_children{$_}) {
2249 !$selected_parents{$_->[0]}
2251 $unresolved_parents{$_} = 1 foreach @parents;
2257 my @I = keys %unresolved;
2259 return $pairs if !scalar(keys %unresolved_parents);
2260 # NECESSARY CONDITION:
2261 # all possible parents have a child match
2263 return $pairs if !scalar(@I);
2264 # NECESSARY CONDITION:
2265 # all possible children have a parent match
2267 # give those with fewest choices highest priority
2269 # n possible parents
2270 scalar(@
{$unresolved{$a}})
2272 scalar(@
{$unresolved{$b}}) ;
2277 my @J = @
{$unresolved{$csf}}; # array of [parent, score]
2279 # sort by score, highest first
2285 # select pair(s) from remaining matrix of possible pairs
2286 # by iterating through possible parents
2288 my $successful_pairs;
2289 foreach my $j (@J) {
2290 my ($psf, $score) = @
$j;
2291 # would selecting $csf, $psf as a pair
2292 # remove all choices from another?
2294 foreach my $sf (@I) {
2295 if (!grep {$_->[0] ne $psf} @
{$unresolved{$sf}}) {
2296 # $psf was the only parent choice for $sf
2302 my $pair = [$csf, $psf];
2303 my $new_pairs = [@
$pairs, $pair];
2304 my $set = $self->find_best_matches($matrix, $new_pairs);
2306 $successful_pairs = $set;
2312 return $successful_pairs if $successful_pairs;
2317 # ----------------------------------------------
2318 # writes a group to stdout
2320 # mostly for logging/debugging
2321 # ----------------------------------------------
2325 my $group_tag = shift || 'gene';
2327 my $f = $group->[0];
2329 if ($f->has_tag($group_tag)) {
2330 ($label) = $f->get_tag_values($group_tag);
2332 if( $self->verbose > 0 ) {
2333 printf STDERR
(" GROUP [%s]:%s\n",
2336 map { $_->primary_tag } @
$group));
2344 printf STDERR
"TYPE:%s\n", $sf->primary_tag;
2348 sub _write_sf_detail
{
2351 printf STDERR
"TYPE:%s\n", $sf->primary_tag;
2352 my @locs = $sf->location->each_Location;
2353 printf STDERR
" %s,%s [%s]\n", $_->start, $_->end, $_->strand foreach @locs;
2359 my @sfs = @
{shift || []};
2360 my $indent = shift || 0;
2361 if( $self->verbose > 0 ) {
2362 foreach my $sf (@sfs) {
2364 if ($sf->has_tag('product')) {
2365 ($label) = $sf->get_tag_values('product');
2367 printf STDERR
"%s%s $label\n", ' ' x
$indent, $sf->primary_tag;
2368 my @sub_sfs = $sf->sub_SeqFeature;
2369 $self->_write_hier(\
@sub_sfs, $indent+1);
2374 # -----------------------------------------------
2376 # returns all possible containers for an SF based
2377 # on splice site coordinates; splice site coords
2379 # -----------------------------------------------
2380 sub _resolve_container_for_sf
{
2381 my ($self, $sf, @possible_container_sfs) = @_;
2383 my @coords = $self->_get_splice_coords_for_sf($sf);
2384 my $start = $sf->start;
2386 my $splice_uniq_str = "@coords";
2388 my @sf_score_pairs = ();
2389 # a CDS is contained by a mRNA if the locations of the splice
2390 # coordinates are identical
2391 foreach (@possible_container_sfs) {
2392 my @container_coords = $self->_get_splice_coords_for_sf($_);
2394 !$splice_uniq_str ||
2395 index("@container_coords", $splice_uniq_str) > -1;
2397 # the container cannot be smaller than the thing contained
2398 if ($_->start > $start || $_->end < $end) {
2404 # SPECIAL CASE FOR /ribosomal_slippage
2405 # See: https://www.ncbi.nlm.nih.gov/collab/FT/
2406 if (!$inside && $sf->has_tag('ribosomal_slippage')) {
2407 if ($self->verbose > 0) {
2408 printf STDERR
" Checking for ribosomal_slippage\n";
2411 # TODO: rewrite this to match introns;
2412 # each slippage will be a "fake" small CDS exon
2413 my @transcript_splice_sites = @container_coords;
2414 my @cds_splice_sites = @coords;
2415 ##printf STDERR "xxTR SSs: @transcript_splice_sites :: %s\n", $_->get_tag_values('product');
2416 ##printf STDERR "xxCD SSs: @cds_splice_sites :: %s\n\n", $sf->get_tag_values('product');
2418 # find the the first splice site within the CDS
2419 while (scalar(@transcript_splice_sites) &&
2420 $transcript_splice_sites[0] < $cds_splice_sites[0]) {
2421 shift @transcript_splice_sites;
2424 ##print STDERR "TR SSs: @transcript_splice_sites\n";
2425 ##print STDERR "CD SSs: @cds_splice_sites\n\n";
2427 if (!(scalar(@transcript_splice_sites)) ||
2428 $transcript_splice_sites[0] == $cds_splice_sites[0]) {
2430 # we will now try and align all splice remaining sites in the transcript and CDS;
2431 # any splice site that can't be aligned is assumed to be a ribosomal slippage
2435 $inside = 1; # innocent until proven guilty..
2436 while (@cds_splice_sites) {
2437 if (!@transcript_splice_sites) {
2439 # ribosomal slippage is after the last transcript splice site
2440 # Example: (NC_00007, isoform 3 of PEG10)
2441 # mRNA join(85682..85903,92646..99007)
2442 # mRNA join(85682..85903,92646..99007)
2443 # CDS join(85899..85903,92646..93825,93825..94994)
2445 # OR: None of the splice sites align;
2446 # may be a single CDS exon with one slippage inside it.
2447 # Example: (NC_00007, isoform 4 of PEG10)
2448 # mRNA join(85637..85892,92646..99007)
2449 # CDS join(92767..93825,93825..94994)
2451 # Yes, this code is repeated below...
2452 my $p1 = shift @cds_splice_sites;
2453 my $p2 = shift @cds_splice_sites;
2454 if ($self->verbose > 0) {
2455 printf STDERR
" Found the ribosomal_slippage: $p1..$p2\n";
2457 push(@slips, ($p2-$p1)-1);
2459 elsif ($cds_splice_sites[0] == $transcript_splice_sites[0]) {
2460 # splice sites align: this is not the slippage
2461 shift @cds_splice_sites;
2462 shift @transcript_splice_sites;
2463 ##print STDERR "MATCH\n";
2467 if ($cds_splice_sites[0] < $transcript_splice_sites[0]) {
2468 # potential slippage
2474 my $p1 = shift @cds_splice_sites;
2475 my $p2 = shift @cds_splice_sites;
2476 if ($self->verbose > 0) {
2477 printf STDERR
" Found the ribosomal_slippage: $p1..$p2\n";
2479 push(@slips, ($p2-$p1)-1);
2482 # not a potential ribosomal slippage
2483 $inside = 0; # guilty!
2484 ##print STDERR "FAIL\n";
2490 # TODO: this is currently completely arbitrary. How many ribosomal slippages do we allow?
2491 # perhaps we need some mini-statistical model here....?
2495 # TODO: this is currently completely arbitrary. What is the maximum size of a ribosomal slippage?
2496 # perhaps we need some mini-statistical model here....?
2497 if (grep {$_ > 2} @slips) {
2503 # not a ribosomal_slippage, sorry
2506 if ($self->verbose > 0) {
2507 printf STDERR
" Checking containment:[$inside] (@container_coords) IN ($splice_uniq_str)\n";
2510 # SCORE: matching (ss-scoords+2)/(n-container-ss-coords+2)
2512 (scalar(@coords)+2)/(scalar(@container_coords)+2);
2513 push(@sf_score_pairs,
2517 return @sf_score_pairs;
2520 sub _get_splice_coords_for_sf
{
2524 my @locs = $sf->location;
2525 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2526 @locs = $sf->location->each_Location;
2529 # get an ordered list of (start, end) positions
2533 # $_->strand > 0 ? ($_->start, $_->end) : ($_->end, $_->start)
2536 my @coords = map {($_->start, $_->end)} @locs;
2538 # remove first and last leaving only splice sites
2544 =head2 feature_from_splitloc
2546 Title : feature_from_splitloc
2547 Usage : $unflattener->feature_from_splitloc(-features=>$sfs);
2553 At this time all this method does is generate exons for mRNA or other RNA features
2557 -feature: a Bio::SeqFeatureI object (that conforms to Bio::FeatureHolderI)
2558 -seq: a Bio::SeqI object that contains Bio::SeqFeatureI objects
2559 -features: an arrayref of Bio::SeqFeatureI object
2564 sub feature_from_splitloc
{
2565 my ($self,@args) = @_;
2567 my($sf, $seq, $sfs) =
2568 $self->_rearrange([qw(FEATURE
2573 my @sfs = (@
{$sfs || []});
2574 push(@sfs, $sf) if $sf;
2576 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2577 @sfs = $seq->get_all_SeqFeatures;
2579 my @exons = grep {$_->primary_tag eq 'exon'} @sfs;
2582 "There are already exons, so I will not infer exons");
2585 # index of features by type+location
2588 # infer for every feature
2589 foreach my $sf (@sfs) {
2591 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2592 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2594 my $type = $sf->primary_tag;
2595 next unless $type eq 'mRNA' or $type =~ /RNA/;
2597 # an mRNA from genbank will have a discontinuous location,
2598 # with each sub-location being equivalent to an exon
2599 my @locs = $sf->location;
2601 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2602 @locs = $sf->location->each_Location;
2608 $self->throw("ASSERTION ERROR: sf has no location objects");
2611 # make exons from locations
2614 my $subsf = Bio
::SeqFeature
::Generic
->new(-location
=>$_,
2615 -primary_tag
=>'exon');
2616 ## Provide seq_id to new feature:
2617 $subsf->seq_id($sf->seq_id) if $sf->seq_id;
2618 $subsf->source_tag($sf->source_tag) if $sf->source_tag;
2619 ## Transfer /locus_tag and /gene tag values to inferred
2620 ## features. TODO: Perhaps? this should not be done
2621 ## indiscriminantly but rather by virtue of the setting
2623 foreach my $tag (grep /gene|locus_tag/, $sf->get_all_tags) {
2624 my @vals = $sf->get_tag_values($tag);
2625 $subsf->add_tag_value($tag, @vals);
2628 my $locstr = 'exon::'.$self->_locstr($subsf);
2630 # re-use feature if type and location the same
2631 if ($loc_h{$locstr}) {
2632 $subsf = $loc_h{$locstr};
2635 $loc_h{$locstr} = $subsf;
2641 $self->_check_order_is_consistent($sf->location->strand,@subsfs);
2644 $sf->location(Bio
::Location
::Simple
->new());
2646 # we allow the exons to define the boundaries of the transcript
2647 $sf->add_SeqFeature($_, 'EXPAND') foreach @subsfs;
2650 if (!$sf->location->strand) {
2651 # correct weird bioperl bug in previous versions;
2652 # strand was not being set correctly
2653 $sf->location->strand($subsfs[0]->location->strand);
2661 #sub merge_features_with_same_loc {
2662 # my ($self,@args) = @_;
2665 # $self->_rearrange([qw(FEATURES
2669 # my @sfs = (@$sfs);
2671 # $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2672 # @sfs = $seq->get_all_SeqFeatures;
2677 # foreach my $sf (@sfs) {
2678 # my $type = $sf->primary_tag;
2679 # my $locstr = $self->_locstr($sf);
2680 ## $loc_h{$type.$locstr}
2681 # push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
2685 =head2 infer_mRNA_from_CDS
2687 Title : infer_mRNA_from_CDS
2694 given a "type 1" containment hierarchy
2700 this will infer the uniform "type 0" containment hierarchy
2707 all the children of the CDS will be moved to the mRNA
2709 a "type 2" containment hierarchy is mixed type "0" and "1" (for
2710 example, see ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/)
2714 sub infer_mRNA_from_CDS
{
2715 my ($self,@args) = @_;
2717 my($sf, $seq, $noinfer) =
2718 $self->_rearrange([qw(FEATURE
2725 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2726 @sfs = $seq->get_all_SeqFeatures;
2729 foreach my $sf (@sfs) {
2731 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2732 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2733 if ($self->verbose > 0) {
2734 printf STDERR
" Checking $sf %s\n", $sf->primary_tag;
2737 if ($sf->primary_tag eq 'mRNA') {
2739 "Inferring mRNAs when there are already mRNAs present");
2742 my @cdsl = grep {$_->primary_tag eq 'CDS' } $sf->get_SeqFeatures;
2744 my @children = grep {$_->primary_tag ne 'CDS'} $sf->get_SeqFeatures;
2748 foreach my $cds (@cdsl) {
2750 if ($self->verbose > 0) {
2751 print " Inferring mRNA from CDS $cds\n";
2753 $self->_check_order_is_consistent($cds->location->strand,$cds->location->each_Location);
2755 my $loc = Bio
::Location
::Split
->new;
2756 foreach my $cdsexonloc ($cds->location->each_Location) {
2758 Bio
::Location
::Simple
->new(-start
=>$cdsexonloc->start,
2759 -end
=>$cdsexonloc->end,
2760 -strand
=>$cdsexonloc->strand);
2761 $loc->add_sub_Location($subloc);
2767 # share the same location
2769 Bio
::SeqFeature
::Generic
->new(-location
=>$loc,
2770 -primary_tag
=>'mRNA');
2772 ## Provide seq_id to new feature:
2773 $mrna->seq_id($cds->seq_id) if $cds->seq_id;
2774 $mrna->source_tag($cds->source_tag) if $cds->source_tag;
2776 $self->_check_order_is_consistent($mrna->location->strand,$mrna->location->each_Location);
2778 # make the mRNA hold the CDS; no EXPAND option,
2779 # the CDS cannot be wider than the mRNA
2780 $mrna->add_SeqFeature($cds);
2782 # mRNA steals children of CDS
2783 foreach my $subsf ($cds->get_SeqFeatures) {
2784 $mrna->add_SeqFeature($subsf);
2786 $cds->remove_SeqFeatures;
2787 push(@mrnas, $mrna);
2790 # change gene/CDS to gene/mRNA
2791 $sf->remove_SeqFeatures;
2792 $sf->add_SeqFeature($_) foreach (@mrnas, @children);
2802 Title : remove_types
2803 Usage : $unf->remove_types(-seq=>$seq, -types=>["mRNA"]);
2809 removes features of a set type
2811 useful for pre-filtering a genbank record; eg to get rid of STSs
2813 also, there is no way to unflatten
2814 ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/ UNLESS the bogus
2815 mRNAs in these records are removed (or changed to a different type) -
2816 they just confuse things too much
2821 my ($self,@args) = @_;
2824 $self->_rearrange([qw(
2829 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2830 my @sfs = $seq->get_all_SeqFeatures;
2831 my %rh = map {$_=>1} @
$types;
2832 @sfs = grep {!$rh{$_->primary_tag}} @sfs;
2833 $seq->remove_SeqFeatures;
2834 $seq->add_SeqFeature($_) foreach @sfs;
2839 # _check_order_is_consistent($strand,$ranges) RETURNS BOOL
2841 # note: the value of this test is moot - there are many valid,
2842 # if unusual cases where it would flag an anomaly. for example
2843 # transpliced genes such as mod(mdg4) in dmel on AE003744, and
2844 # the following spliced gene on NC_001284:
2846 # mRNA complement(join(20571..20717,21692..22086,190740..190761,
2847 # 140724..141939,142769..142998))
2849 # /note="trans-splicing, RNA editing"
2850 # /db_xref="GeneID:814567"
2852 # note how the exons are not in order
2853 # this will flag a level-3 warning, the user of this module
2854 # can ignore this and deal appropriately with the resulting
2856 sub _check_order_is_consistent
{
2859 my $parent_strand = shift; # this does nothing..?
2861 return unless @ranges;
2863 join(" ",map{sprintf("[%s,%s]",$_->start,$_->end)} @ranges);
2864 my $strand = $ranges[0]->strand;
2865 for (my $i=1; $i<@ranges;$i++) {
2866 if ($ranges[$i]->strand != $strand) {
2867 $self->problem(1,"inconsistent strands. Trans-spliced gene? Range: $rangestr");
2869 # mixed ranges - autopass
2870 # some mRNAs have exons on both strands; for
2871 # example, the dmel mod(mdg4) gene which is
2872 # trans-spliced (in actual fact two mRNAs)
2876 for (my $i=1; $i<@ranges;$i++) {
2877 my $rangeP = $ranges[$i-1];
2878 my $range = $ranges[$i];
2879 if ($rangeP->start > $range->end) {
2880 if ($self->seq->is_circular) {
2881 # see for example NC_006578.gbk
2882 # we make exceptions for circular genomes here.
2883 # see Re: [Gmod-ajax] flatfile-to-json.pl error with GFF
2887 # failed - but still get one more chance..
2889 $self->problem(2,"Ranges not in correct order. Strange ensembl genbank entry? Range: $rangestr");
2896 # sometimes (eg ensembl flavour genbank files)
2897 # exons on reverse strand listed in reverse order
2898 # eg join(complement(R1),...,complement(Rn))
2900 for (my $i=1; $i<@ranges;$i++) {
2901 my $rangeP = $ranges[$i-1];
2902 my $range = $ranges[$i];
2903 if ($rangeP->end < $range->start) {
2904 $self->problem(3,"inconsistent order. Range: $rangestr");
2912 # PRIVATE METHOD: _locstr($sf)
2914 # returns a location string for a feature; just the outer boundaries
2919 sprintf("%d..%d", $sf->start, $sf->end);
2922 sub iterate_containment_tree
{
2924 my $feature_holder = shift;
2926 $sub->($feature_holder);
2927 my @sfs = $feature_holder->get_SeqFeatures;
2928 $self->iterate_containment_tree($_) foreach @sfs;
2931 sub find_best_pairs
{
2936 for (my $j=0; $j < $size; $j++) {
2937 my $score = $matrix->[$i][$j];
2938 if (!defined($score)) {