maint: remove Travis stuff which has been replaced with Github actions (#325)
[bioperl-live.git] / lib / Bio / SeqFeature / Tools / Unflattener.pm
blob5a8b889440ab8861c8b651ae9719138512d61a8e
2 # bioperl module for Bio::SeqFeature::Tools::Unflattener
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Chris Mungall <cjm@fruitfly.org>
8 # Copyright Chris Mungall
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 =head1 NAME
16 Bio::SeqFeature::Tools::Unflattener - turns flat list of genbank-sourced features into a nested SeqFeatureI hierarchy
18 =head1 SYNOPSIS
20 # standard / generic use - unflatten a genbank record
21 use Bio::SeqIO;
22 use Bio::SeqFeature::Tools::Unflattener;
24 # generate an Unflattener object
25 $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
27 # first fetch a genbank SeqI object
28 $seqio =
29 Bio::SeqIO->new(-file=>'AE003644.gbk',
30 -format=>'GenBank');
31 my $out =
32 Bio::SeqIO->new(-format=>'asciitree');
33 while ($seq = $seqio->next_seq()) {
35 # get top level unflattended SeqFeatureI objects
36 $unflattener->unflatten_seq(-seq=>$seq,
37 -use_magic=>1);
38 $out->write_seq($seq);
40 @top_sfs = $seq->get_SeqFeatures;
41 foreach my $sf (@top_sfs) {
42 # do something with top-level features (eg genes)
47 =head1 DESCRIPTION
49 Most GenBank entries for annotated genomic DNA contain a B<flat> list
50 of features. These features can be parsed into an equivalent flat list
51 of L<Bio::SeqFeatureI> objects using the standard L<Bio::SeqIO>
52 classes. However, it is often desirable to B<unflatten> this list into
53 something resembling actual B<gene models>, in which genes, mRNAs and CDSs
54 are B<nested> according to the nature of the gene model.
56 The BioPerl object model allows us to store these kind of associations
57 between SeqFeatures in B<containment hierarchies> -- any SeqFeatureI
58 object can contain nested SeqFeatureI objects. The
59 Bio::SeqFeature::Tools::Unflattener object facilitates construction of
60 these hierarchies from the underlying GenBank flat-feature-list
61 representation.
63 For example, if you were to look at a typical GenBank DNA entry, say,
64 B<AE003644>, you would see a flat list of features:
66 source
68 gene CG4491
69 mRNA CG4491-RA
70 CDS CG4491-PA
72 gene tRNA-Pro
73 tRNA tRNA-Pro
75 gene CG32954
76 mRNA CG32954-RA
77 mRNA CG32954-RC
78 mRNA CG32954-RB
79 CDS CG32954-PA
80 CDS CG32954-PB
81 CDS CG32954-PC
83 These features have sequence locations, but it is not immediately
84 clear how to write code such that each mRNA is linked to the
85 appropriate CDS (other than relying on IDs which is very bad)
87 We would like to convert the above list into the B<containment
88 hierarchy>, shown below:
90 source
91 gene
92 mRNA CG4491-RA
93 CDS CG4491-PA
94 exon
95 exon
96 gene
97 tRNA tRNA-Pro
98 exon
99 gene
100 mRNA CG32954-RA
101 CDS CG32954-PA
102 exon
103 exon
104 mRNA CG32954-RC
105 CDS CG32954-PC
106 exon
107 exon
108 mRNA CG32954-RB
109 CDS CG32954-PB
110 exon
111 exon
113 Where each feature is nested underneath its container. Note that exons
114 have been automatically inferred (even for tRNA genes).
116 We do this using a call on a L<Bio::SeqFeature::Tools::Unflattener>
117 object
119 @sfs = $unflattener->unflatten_seq(-seq=>$seq);
121 This would return a list of the B<top level> (i.e. container)
122 SeqFeatureI objects - in this case, genes. Other top level features
123 are possible; for instance, the B<source> feature which is always
124 present, and other features such as B<variation> or B<misc_feature>
125 types.
127 The containment hierarchy can be accessed using the get_SeqFeature()
128 call on any feature object - see L<Bio::SeqFeature::FeatureHolderI>.
129 The following code will traverse the containment hierarchy for a
130 feature:
132 sub traverse {
133 $sf = shift; # $sf isa Bio::SeqfeatureI
135 # ...do something with $sf!
137 # depth first traversal of containment tree
138 @contained_sfs = $sf->get_SeqFeatures;
139 traverse($_) foreach @contained_sfs;
142 Once you have built the hierarchy, you can do neat stuff like turn the
143 features into 'rich' feature objects (eg
144 L<Bio::SeqFeature::Gene::GeneStructure>) or convert to a suitable
145 format such as GFF3 or chadoxml (after mapping to the Sequence
146 Ontology); this step is not described here.
148 =head1 USING MAGIC
150 Due to the quixotic nature of how features are stored in
151 GenBank/EMBL/DDBJ, there is no guarantee that the default behaviour of
152 this module will produce perfect results. Sometimes it is hard or
153 impossible to build a correct containment hierarchy if the information
154 provided is simply too lossy, as is often the case. If you care deeply
155 about your data, you should always manually inspect the resulting
156 containment hierarchy; you may have to customise the algorithm for
157 building the hierarchy, or even manually tweak the resulting
158 hierarchy. This is explained in more detail further on in the document.
160 However, if you are satisfied with the default behaviour, then you do
161 not need to read any further. Just make sure you set the parameter
162 B<use_magic> - this will invoke incantations which will magically
163 produce good results no matter what the idiosyncracies of the
164 particular GenBank record in question.
166 For example
168 $unflattener->unflatten_seq(-seq=>$seq,
169 -use_magic=>1);
171 The success of this depends on the phase of the moon at the time the
172 entry was submitted to GenBank. Note that the magical recipe is being
173 constantly improved, so the results of invoking magic may vary
174 depending on the bioperl release.
176 If you are skeptical of magic, or you wish to exact fine grained
177 control over how the entry is unflattened, or you simply wish to
178 understand more about how this crazy stuff works, then read on!
180 =head1 PROBLEMATIC DATA AND INCONSISTENCIES
182 Occasionally the Unflattener will have problems with certain
183 records. For example, the record may contain inconsistent data - maybe
184 there is an B<exon> entry that has no corresponding B<mRNA> location.
186 The default behaviour is to throw an exception reporting the problem,
187 if the problem is relatively serious - for example, inconsistent data.
189 You can exert more fine grained control over this - perhaps you want
190 the Unflattener to do the best it can, and report any problems. This
191 can be done - refer to the methods.
193 error_threshold()
195 get_problems()
197 report_problems()
199 ignore_problems()
201 =head1 ALGORITHM
203 This is the default algorithm; you should be able to override any part
204 of it to customise.
206 The core of the algorithm is in two parts
208 =over
210 =item Partitioning the flat feature list into groups
212 =item Resolving the feature containment hierarchy for each group
214 =back
216 There are other optional steps after the completion of these two
217 steps, such as B<inferring exons>; we now describe in more detail what
218 is going on.
220 =head2 Partitioning into groups
222 First of all the flat feature list is partitioned into B<group>s.
224 The default way of doing this is to use the B<gene> attribute; if we
225 look at two features from GenBank accession AE003644.3:
227 gene 20111..23268
228 /gene="noc"
229 /locus_tag="CG4491"
230 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
231 /map="35B2-35B2"
232 /db_xref="FLYBASE:FBgn0005771"
233 mRNA join(20111..20584,20887..23268)
234 /gene="noc"
235 /locus_tag="CG4491"
236 /product="CG4491-RA"
237 /db_xref="FLYBASE:FBgn0005771"
239 Both these features share the same /gene tag which is "noc", so they
240 correspond to the same gene model (the CDS feature is not shown, but
241 this also has a tag-value /gene="noc").
243 Not all groups need to correspond to gene models, but this is the most
244 common use case; later on we shall describe how to customise the
245 grouping.
247 Sometimes other tags have to be used; for instance, if you look at the
248 entire record for AE003644.3 you will see you actually need the use the
249 /locus_tag attribute. This attribute is actually B<not present> in
250 most records!
252 You can override this:
254 $collection->unflatten_seq(-seq=>$seq, -group_tag=>'locus_tag');
256 Alternatively, if you B<-use_magic>, the object will try and make a
257 guess as to what the correct group_tag should be.
259 At the end of this step, we should have a list of groups - there is no
260 structure within a group; the group just serves to partition the flat
261 features. For the example data above, we would have the following groups.
263 [ source ]
264 [ gene mRNA CDS ]
265 [ gene mRNA CDS ]
266 [ gene mRNA CDS ]
267 [ gene mRNA mRNA mRNA CDS CDS CDS ]
269 =head3 Multicopy Genes
271 Multicopy genes are usually rRNAs or tRNAs that are duplicated across
272 the genome. Because they are functionally equivalent, and usually have
273 the same sequence, they usually have the same group_tag (ie gene
274 symbol); they often have a /note tag giving copy number. This means
275 they will end up in the same group. This is undesirable, because they
276 are spatially disconnected.
278 There is another step, which involves splitting spatially disconnected
279 groups into distinct groups
281 this would turn this
283 [gene-rrn3 rRNA-rrn3 gene-rrn3 rRNA-rrn3]
285 into this
287 [gene-rrn3 rRNA-rrn3] [gene-rrn3 rRNA-rrn3]
289 based on the coordinates
291 =head3 What next?
293 The next step is to add some structure to each group, by making
294 B<containment hierarchies>, trees that represent how the features
295 interrelate
297 =head2 Resolving the containment hierarchy
299 After the grouping is done, we end up with a list of groups which
300 probably contain features of type 'gene', 'mRNA', 'CDS' and so on.
302 Singleton groups (eg the 'source' feature) are ignored at this stage.
304 Each group is itself flat; we need to add an extra level of
305 organisation. Usually this is because different spliceforms
306 (represented by the 'mRNA' feature) can give rise to different
307 protein products (indicated by the 'CDS' feature). We want to correctly
308 associate mRNAs to CDSs.
310 We want to go from a group like this:
312 [ gene mRNA mRNA mRNA CDS CDS CDS ]
314 to a containment hierarchy like this:
316 gene
317 mRNA
319 mRNA
321 mRNA
324 In which each CDS is nested underneath the correct corresponding mRNA.
326 For entries that contain no alternate splicing, this is simple; we
327 know that the group
329 [ gene mRNA CDS ]
331 Must resolve to the tree
333 gene
334 mRNA
337 How can we do this in entries with alternate splicing? The bad
338 news is that there is no guaranteed way of doing this correctly for
339 any GenBank entry. Occasionally the submission will have been done in
340 such a way as to reconstruct the containment hierarchy. However, this
341 is not consistent across databank entries, so no generic solution can
342 be provided by this object. This module does provide the framework
343 within which you can customise a solution for the particular dataset
344 you are interested in - see later.
346 The good news is that there is an inference we can do that should
347 produce pretty good results the vast majority of the time. It uses
348 splice coordinate data - this is the default behaviour of this module,
349 and is described in detail below.
351 =head2 Using splice site coordinates to infer containment
353 If an mRNA is to be the container for a CDS, then the splice site
354 coordinates (or intron coordinates, depending on how you look at it)
355 of the CDS must fit inside the splice site coordinates of the mRNA.
357 Ambiguities can still arise, but the results produced should still be
358 reasonable and consistent at the sequence level. Look at this fake
359 example:
361 mRNA XXX---XX--XXXXXX--XXXX join(1..3,7..8,11..16,19..23)
362 mRNA XXX-------XXXXXX--XXXX join(1..3,11..16,19..23)
363 CDS XXXX--XX join(13..16,19..20)
364 CDS XXXX--XX join(13..16,19..20)
366 [obviously the positions have been scaled down]
368 We cannot unambiguously match mRNA with CDS based on splice sites,
369 since both CDS share the splice site locations 16^17 and
370 18^19. However, the consequences of making a wrong match are probably
371 not very severe. Any annotation data attached to the first CDS is
372 probably identical to the seconds CDS, other than identifiers.
374 The default behaviour of this module is to make an arbitrary call
375 where it is ambiguous (the mapping will always be bijective; i.e. one
376 mRNA -E<gt> one CDS).
378 [TODO: NOTE: not tested on EMBL data, which may not be bijective; ie two
379 mRNAs can share the same CDS??]
381 This completes the building of the containment hierarchy; other
382 optional step follow
384 =head1 POST-GROUPING STEPS
386 =head2 Inferring exons from mRNAs
388 This step always occurs if B<-use_magic> is invoked.
390 In a typical GenBank entry, the exons are B<implicit>. That is they
391 can be inferred from the mRNA location.
393 For example:
395 mRNA join(20111..20584,20887..23268)
397 This tells us that this particular transcript has two exons. In
398 bioperl, the mRNA feature will have a 'split location'.
400 If we call
402 $unflattener->feature_from_splitloc(-seq=>$seq);
404 This will generate the necessary exon features, and nest them under
405 the appropriate mRNAs. Note that the mRNAs will no longer have split
406 locations - they will have simple locations spanning the extent of the
407 exons. This is intentional, to avoid redundancy.
409 Occasionally a GenBank entry will have both implicit exons (from the
410 mRNA location) B<and> explicit exon features.
412 In this case, exons will still be transferred. Tag-value data from the
413 explicit exon will be transferred to the implicit exon. If exons are
414 shared between mRNAs these will be represented by different
415 objects. Any inconsistencies between implicit and explicit will be
416 reported.
418 =head3 tRNAs and other noncoding RNAs
420 exons will also be generated from these features
422 =head2 Inferring mRNAs from CDS
424 Some GenBank entries represent gene models using features of type
425 gene, mRNA and CDS; some entries just use gene and CDS.
427 If we only have gene and CDS, then the containment hierarchies will
428 look like this:
430 gene
433 If we want the containment hierarchies to be uniform, like this
435 gene
436 mRNA
439 Then we must create an mRNA feature. This will have identical
440 coordinates to the CDS. The assumption is that there is either no
441 untranslated region, or it is unknown.
443 To do this, we can call
445 $unflattener->infer_mRNA_from_CDS(-seq=>$seq);
447 This is taken care of automatically, if B<-use_magic> is invoked.
449 =head1 ADVANCED
451 =head2 Customising the grouping of features
453 The default behaviour is suited mostly to building models of protein
454 coding genes and noncoding genes from genbank genomic DNA submissions.
456 You can change the tag used to partition the feature by passing in a
457 different group_tag argument - see the unflatten_seq() method
459 Other behaviour may be desirable. For example, even though SNPs
460 (features of type 'variation' in GenBank) are not actually part of the
461 gene model, it may be desirable to group SNPs that overlap or are
462 nearby gene models.
464 It should certainly be possible to extend this module to do
465 this. However, I have yet to code this part!!! If anyone would find
466 this useful let me know.
468 In the meantime, you could write your own grouping subroutine, and
469 feed the results into unflatten_groups() [see the method documentation
470 below]
472 =head2 Customising the resolution of the containment hierarchy
474 Once the flat list of features has been partitioned into groups, the
475 method unflatten_group() is called on each group to build a tree.
477 The algorithm for doing this is described above; ambiguities are
478 resolved by using splice coordinates. As discussed, this can be
479 ambiguous.
481 Some submissions may contain information in tags/attributes that hint
482 as to the mapping that needs to be made between the features.
484 For example, with the Drosophila Melanogaster release 3 submission, we
485 see that CDS features in alternately spliced mRNAs have a form like
486 this:
488 CDS join(145588..145686,145752..146156,146227..146493)
489 /locus_tag="CG32954"
490 /note="CG32954 gene product from transcript CG32954-RA"
491 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
492 /codon_start=1
493 /product="CG32954-PA"
494 /protein_id="AAF53403.1"
495 /db_xref="GI:7298167"
496 /db_xref="FLYBASE:FBgn0052954"
497 /translation="MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENP..."
499 Here the /note tag provides the clue we need to link CDS to mRNA
500 (highlighted with ^^^^). We just need to find the mRNA with the tag
502 /product="CG32954-RA"
504 I have no idea how consistent this practice is across submissions; it
505 is consistent for the fruitfly genome submission.
507 We can customise the behaviour of unflatten_group() by providing our
508 own resolver method. This obviously requires a bit of extra
509 programming, but there is no way to get around this.
511 Here is an example of how to pass in your own resolver; this example
512 basically checks the parent (container) /product tag to see if it
513 matches the required string in the child (contained) /note tag.
515 $unflattener->unflatten_seq(-seq=>$seq,
516 -group_tag=>'locus_tag',
517 -resolver_method=>sub {
518 my $self = shift;
519 my ($sf, @candidate_container_sfs) = @_;
520 if ($sf->has_tag('note')) {
521 my @notes = $sf->get_tag_values('note');
522 my @trnames = map {/from transcript\s+(.*)/;
523 $1} @notes;
524 @trnames = grep {$_} @trnames;
525 my $trname;
526 if (@trnames == 0) {
527 $self->throw("UNRESOLVABLE");
529 elsif (@trnames == 1) {
530 $trname = $trnames[0];
532 else {
533 $self->throw("AMBIGUOUS: @trnames");
535 my @container_sfs =
536 grep {
537 my ($product) =
538 $_->has_tag('product') ?
539 $_->get_tag_values('product') :
540 ('');
541 $product eq $trname;
542 } @candidate_container_sfs;
543 if (@container_sfs == 0) {
544 $self->throw("UNRESOLVABLE");
546 elsif (@container_sfs == 1) {
547 # we got it!
548 return $container_sfs[0];
550 else {
551 $self->throw("AMBIGUOUS");
556 the resolver method is only called when there is more than one spliceform.
558 =head2 Parsing mRNA records
560 Some of the entries in sequence databanks are for mRNA sequences as
561 well as genomic DNA. We may want to build models from these too.
563 NOT YET DONE - IN PROGRESS!!!
565 Open question - what would these look like?
567 Ideally we would like a way of combining a mRNA record with the
568 corresponding SeFeature entry from the appropriate genomic DNA
569 record. This could be problemmatic in some cases - for example, the
570 mRNA sequences may not match 100% (due to differences in strain,
571 assembly problems, sequencing problems, etc). What then...?
573 =head1 SEE ALSO
575 Feature table description
577 http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
579 =head1 FEEDBACK
581 =head2 Mailing Lists
583 User feedback is an integral part of the evolution of this and other
584 Bioperl modules. Send your comments and suggestions preferably to the
585 Bioperl mailing lists Your participation is much appreciated.
587 bioperl-l@bioperl.org - General discussion
588 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
590 =head2 Support
592 Please direct usage questions or support issues to the mailing list:
594 I<bioperl-l@bioperl.org>
596 rather than to the module maintainer directly. Many experienced and
597 reponsive experts will be able look at the problem and quickly
598 address it. Please include a thorough description of the problem
599 with code and data examples if at all possible.
601 =head2 Reporting Bugs
603 report bugs to the Bioperl bug tracking system to help us keep track
604 the bugs and their resolution. Bug reports can be submitted via the
605 web:
607 https://github.com/bioperl/bioperl-live/issues
609 =head1 AUTHOR - Chris Mungall
611 Email: cjm@fruitfly.org
613 =head1 APPENDIX
615 The rest of the documentation details each of the object
616 methods. Internal methods are usually preceded with a _
618 =cut
621 # Let the code begin...
623 package Bio::SeqFeature::Tools::Unflattener;
625 use strict;
627 # Object preamble - inherits from Bio::Root::Root
628 use Bio::Location::Simple;
629 use Bio::SeqFeature::Generic;
630 use Bio::Range;
633 use base qw(Bio::Root::Root);
635 =head2 new
637 Title : new
638 Usage : $unflattener = Bio::SeqFeature::Tools::Unflattener->new();
639 $unflattener->unflatten_seq(-seq=>$seq);
640 Function: constructor
641 Example :
642 Returns : a new Bio::SeqFeature::Tools::Unflattener
643 Args : see below
645 Arguments
647 -seq : A L<Bio::SeqI> object (optional)
648 the sequence to unflatten; this can also be passed in
649 when we call unflatten_seq()
651 -group_tag : a string representing the /tag used to partition flat features
652 (see discussion above)
654 =cut
657 sub new {
658 my($class,@args) = @_;
659 my $self = $class->SUPER::new(@args);
661 my($seq, $group_tag, $trust_grouptag) =
662 $self->_rearrange([qw(SEQ
663 GROUP_TAG
664 TRUST_GROUPTAG
666 @args);
668 $seq && $self->seq($seq);
669 $group_tag && $self->group_tag($group_tag);
670 # $self->{'trust_grouptag'}= $trust_grouptag if($trust_grouptag); #dgg suggestion
671 return $self; # success - we hope!
674 sub DESTROY {
675 my $self = shift;
676 return if $self->{_reported_problems};
677 return if $self->{_ignore_problems};
678 my @probs = $self->get_problems;
679 if (!$self->{_problems_reported} &&
680 scalar(@probs)) {
681 $self->warn(
682 "WARNING: There are UNREPORTED PROBLEMS.\n".
683 "You may wish to use the method report_problems(), \n",
684 "or ignore_problems() on the Unflattener object\n");
686 return;
689 =head2 seq
691 Title : seq
692 Usage : $unflattener->seq($newval)
693 Function:
694 Example :
695 Returns : value of seq (a Bio::SeqI)
696 Args : on set, new value (a Bio::SeqI, optional)
698 The Bio::SeqI object should hold a flat list of Bio::SeqFeatureI
699 objects; this is the list that will be unflattened.
701 The sequence object can also be set when we call unflatten_seq()
703 =cut
705 sub seq{
706 my $self = shift;
708 return $self->{'seq'} = shift if @_;
709 return $self->{'seq'};
712 =head2 group_tag
714 Title : group_tag
715 Usage : $unflattener->group_tag($newval)
716 Function:
717 Example :
718 Returns : value of group_tag (a scalar)
719 Args : on set, new value (a scalar or undef, optional)
721 This is the tag that will be used to collect elements from the flat
722 feature list into groups; for instance, if we look at two typical
723 GenBank features:
725 gene 20111..23268
726 /gene="noc"
727 /locus_tag="CG4491"
728 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
729 /map="35B2-35B2"
730 /db_xref="FLYBASE:FBgn0005771"
731 mRNA join(20111..20584,20887..23268)
732 /gene="noc"
733 /locus_tag="CG4491"
734 /product="CG4491-RA"
735 /db_xref="FLYBASE:FBgn0005771"
737 We can see that these comprise the same gene model because they share
738 the same /gene attribute; we want to collect these together in groups.
740 Setting group_tag is optional. The default is to use 'gene'. In the
741 example above, we could also use /locus_tag
743 =cut
745 sub group_tag{
746 my $self = shift;
748 return $self->{'group_tag'} = shift if @_;
749 return $self->{'group_tag'};
752 =head2 partonomy
754 Title : partonomy
755 Usage : $unflattener->partonomy({mRNA=>'gene', CDS=>'mRNA')
756 Function:
757 Example :
758 Returns : value of partonomy (a scalar)
759 Args : on set, new value (a scalar or undef, optional)
761 A hash representing the containment structure that the seq_feature
762 nesting should conform to; each key represents the contained (child)
763 type; each value represents the container (parent) type.
765 =cut
767 sub partonomy{
768 my $self = shift;
770 return $self->{'partonomy'} = shift if @_;
771 if (!$self->{'partonomy'}) {
772 $self->{'partonomy'} = $self->_default_partonomy;
774 return $self->{'partonomy'};
777 sub _default_partonomy{
778 return {
779 mRNA => 'gene',
780 tRNA => 'gene',
781 rRNA => 'gene',
782 scRNA => 'gene',
783 snRNA => 'gene',
784 snoRNA => 'gene',
785 misc_RNA => 'gene',
786 CDS => 'mRNA',
787 exon => 'mRNA',
788 intron => 'mRNA',
790 pseudoexon => 'pseudogene',
791 pseudointron => 'pseudogene',
792 pseudotranscript => 'pseudogene',
796 =head2 structure_type
798 Title : structure_type
799 Usage : $unflattener->structure_type($newval)
800 Function:
801 Example :
802 Returns : value of structure_type (a scalar)
803 Args : on set, new value (an int or undef, optional)
805 GenBank entries conform to different flavours, or B<structure
806 types>. Some have mRNAs, some do not.
808 Right now there are only two base structure types defined. If you set
809 the structure type, then appropriate unflattening action will be
810 taken. The presence or absence of explicit exons does not affect the
811 structure type.
813 If you invoke B<-use_magic> then this will be set automatically, based
814 on the content of the record.
816 =over
818 =item Type 0 (DEFAULT)
820 typically contains
822 source
823 gene
824 mRNA
827 with this structure type, we want the seq_features to be nested like this
829 gene
830 mRNA
832 exon
834 exons and introns are implicit from the mRNA 'join' location
836 to get exons from the mRNAs, you will need this call (see below)
838 $unflattener->feature_from_splitloc(-seq=>$seq);
840 =item Type 1
842 typically contains
844 source
845 gene
847 exon [optional]
848 intron [optional]
850 there are no mRNA features
852 with this structure type, we want the seq_features to be nested like this
854 gene
856 exon
857 intron
859 exon and intron may or may not be present; they may be implicit from
860 the CDS 'join' location
862 =back
864 =cut
866 sub structure_type{
867 my $self = shift;
869 return $self->{'structure_type'} = shift if @_;
870 return $self->{'structure_type'};
873 =head2 get_problems
875 Title : get_problems
876 Usage : @probs = get_problems()
877 Function: Get the list of problem(s) for this object.
878 Example :
879 Returns : An array of [severity, description] pairs
880 Args :
882 In the course of unflattening a record, problems may occur. Some of
883 these problems are non-fatal, and can be ignored.
885 Problems are represented as arrayrefs containing a pair [severity,
886 description]
888 severity is a number, the higher, the more severe the problem
890 the description is a text string
892 =cut
894 sub get_problems{
895 my $self = shift;
897 return @{$self->{'_problems'}} if exists($self->{'_problems'});
898 return ();
901 =head2 clear_problems
903 Title : clear_problems
904 Usage :
905 Function: resets the problem list to empty
906 Example :
907 Returns :
908 Args :
911 =cut
913 sub clear_problems{
914 my ($self,@args) = @_;
915 $self->{'_problems'} = [];
916 return;
920 # PRIVATE
921 # see get_problems
922 sub add_problem{
923 my $self = shift;
925 $self->{'_problems'} = [] unless exists($self->{'_problems'});
926 if ($self->verbose > 0) {
927 warn( "PROBLEM: $_\n") foreach @_;
929 push(@{$self->{'_problems'}}, @_);
932 # PRIVATE
933 # see get_problems
934 sub problem {
935 my $self = shift;
936 my ($severity, $desc, @sfs) = @_;
937 if (@sfs) {
938 foreach my $sf (@sfs) {
939 $desc .=
940 sprintf("\nSF [$sf]: ". $sf->location->to_FTstring . "; %s\n",
941 join('; ',
942 $sf->primary_tag,
943 map {
944 $sf->has_tag($_) ?
945 $sf->get_tag_values($_) : ()
946 } qw(locus_tag gene product label)));
949 my $thresh = $self->error_threshold;
950 if ($severity > $thresh) {
951 $self->{_problems_reported} = 1;
952 $self->throw("PROBLEM, SEVERITY==$severity\n$desc");
954 $self->add_problem([$severity, $desc]);
955 return;
958 =head2 report_problems
960 Title : report_problems
961 Usage : $unflattener->report_problems(\*STDERR);
962 Function:
963 Example :
964 Returns :
965 Args : FileHandle (defaults to STDERR)
968 =cut
970 sub report_problems{
971 my ($self, $fh) = @_;
973 if (!$fh) {
974 $fh = \*STDERR;
976 foreach my $problem ($self->get_problems) {
977 my ($sev, $desc) = @$problem;
978 printf $fh "PROBLEM, SEVERITY==$sev\n$desc\n";
980 $self->{_problems_reported} = 1;
981 return;
984 =head2 ignore_problems
986 Title : ignore_problems
987 Usage : $obj->ignore_problems();
988 Function:
989 Example :
990 Returns :
991 Args :
993 Unflattener is very particular about problems it finds along the
994 way. If you have set the error_threshold such that less severe
995 problems do not cause exceptions, Unflattener still expects you to
996 report_problems() at the end, so that the user of the module is aware
997 of any inconsistencies or problems with the data. In fact, a warning
998 will be produced if there are unreported problems. To silence, this
999 warning, call the ignore_problems() method before the Unflattener
1000 object is destroyed.
1002 =cut
1004 sub ignore_problems{
1005 my ($self) = @_;
1006 $self->{_ignore_problems} = 1;
1007 return;
1011 =head2 error_threshold
1013 Title : error_threshold
1014 Usage : $obj->error_threshold($severity)
1015 Function:
1016 Example :
1017 Returns : value of error_threshold (a scalar)
1018 Args : on set, new value (an integer)
1020 Sets the threshold above which errors cause this module to throw an
1021 exception. The default is 0; all problems with a severity E<gt> 0 will
1022 cause an exception.
1024 If you raise the threshold to 1, then the unflattening process will be
1025 more lax; problems of severity==1 are generally non-fatal, but may
1026 indicate that the results should be inspected, for example, to make
1027 sure there is no data loss.
1029 =cut
1031 sub error_threshold{
1032 my $self = shift;
1034 return $self->{'error_threshold'} = shift if @_;
1035 return $self->{'error_threshold'} || 0;
1040 # PRIVATE
1042 # given a type (eg mRNA), will return the container type (eg gene)
1043 sub get_container_type{
1044 my ($self,$type) = @_;
1045 my @roots = $self->_get_partonomy_roots;
1046 if (grep {$_ eq $type} @roots) {
1047 # it is a root - no parents/containers
1048 return;
1050 my $ch = $self->partonomy;
1051 my $ctype = $ch->{$type};
1052 if (!$ctype) {
1053 # asterix acts as a wild card
1054 $ctype = $ch->{'*'};
1056 return $ctype;
1059 # get root node of partonomy hierarchy (usually gene)
1060 sub _get_partonomy_roots {
1061 my $self = shift;
1062 my $ch = $self->partonomy;
1063 my @parents = values %$ch;
1064 # find parents that do not have parents themselves
1065 return grep {!$ch->{$_}} @parents;
1070 =head2 unflatten_seq
1072 Title : unflatten_seq
1073 Usage : @sfs = $unflattener->unflatten_seq($seq);
1074 Function: turns a flat list of features into a list of holder features
1075 Example :
1076 Returns : list of Bio::SeqFeatureI objects
1077 Args : see below
1079 partitions a list of features then arranges them in a nested tree; see
1080 above for full explanation.
1082 note - the Bio::SeqI object passed in will be modified
1084 Arguments
1086 -seq : a Bio::SeqI object; must contain Bio::SeqFeatureI objects
1087 (this is optional if seq has already been set)
1089 -use_magic: if TRUE (ie non-zero) then magic will be invoked;
1090 see discussion above.
1092 -resolver_method: a CODE reference
1093 see the documentation above for an example of
1094 a subroutine that can be used to resolve hierarchies
1095 within groups.
1097 this is optional - if nothing is supplied, a default
1098 subroutine will be used (see below)
1100 -group_tag: a string
1101 [ see the group_tag() method ]
1102 this overrides the default group_tag which is 'gene'
1106 =cut
1108 sub unflatten_seq{
1109 my ($self,@args) = @_;
1111 my($seq, $resolver_method, $group_tag, $partonomy,
1112 $structure_type, $resolver_tag, $use_magic, $noinfer) =
1113 $self->_rearrange([qw(SEQ
1114 RESOLVER_METHOD
1115 GROUP_TAG
1116 PARTONOMY
1117 STRUCTURE_TYPE
1118 RESOLVER_TAG
1119 USE_MAGIC
1120 NOINFER
1122 @args);
1124 # seq we want to unflatten
1125 $seq = $seq || $self->seq;
1126 if (!$self->seq) {
1127 $self->seq($seq);
1131 # prevent bad argument combinations
1132 if ($partonomy &&
1133 defined($structure_type)) {
1134 $self->throw("You cannot set both -partonomy and -structure_type\n".
1135 "(the former is implied by the latter)");
1138 # remember the current value of partonomy, to reset later
1139 my $old_partonomy = $self->partonomy;
1140 $self->partonomy($partonomy) if defined $partonomy;
1142 # remember old structure_type
1143 my $old_structure_type = $self->structure_type;
1144 $self->structure_type($structure_type) if defined $structure_type;
1146 # if we are sourcing our data from genbank, all the
1147 # features should be flat (eq no sub_SeqFeatures)
1148 my @flat_seq_features = $seq->get_SeqFeatures;
1149 my @all_seq_features = $seq->get_all_SeqFeatures;
1151 # sanity checks
1152 if (@all_seq_features > @flat_seq_features) {
1153 $self->throw("It looks as if this sequence has already been unflattened");
1155 if (@all_seq_features < @flat_seq_features) {
1156 $self->throw("ASSERTION ERROR: something is seriously wrong with your features");
1159 # tag for ungrouping; usually /gene or /locus_tag
1160 # for example: /gene="foo"
1161 $group_tag = $group_tag || $self->group_tag;
1162 if ($use_magic) {
1163 # use magic to guess the group tag
1164 my @sfs_with_locus_tag =
1165 grep {$_->has_tag("locus_tag")} @flat_seq_features;
1166 my @sfs_with_gene_tag =
1167 grep {$_->has_tag("gene")} @flat_seq_features;
1168 my @sfs_with_product_tag =
1169 grep {$_->has_tag("product")} @flat_seq_features;
1171 # if ($group_tag && $self->{'trust_grouptag'}) { # dgg suggestion
1174 # elsif
1175 if (@sfs_with_locus_tag) {
1176 # dgg note: would like to -use_magic with -group_tag = 'gene' for ensembl genomes
1177 # where ensembl gene FT have both /locus_tag and /gene, but mRNA, CDS have /gene only
1178 if ($group_tag && $group_tag ne 'locus_tag') {
1179 $self->throw("You have explicitly set group_tag to be '$group_tag'\n".
1180 "However, I detect that some features use /locus_tag\n".
1181 "I believe that this is the correct group_tag to use\n".
1182 "You can resolve this by either NOT setting -group_tag\n".
1183 "OR you can unset -use_magic to regain control");
1186 # use /locus_tag instead of /gene tag for grouping
1187 # see GenBank entry AE003677 (version 3) for an example
1188 $group_tag = 'locus_tag';
1189 if ($self->verbose > 0) {
1190 warn "Set group tag to: $group_tag\n";
1194 # on rare occasions, records will have no /gene or /locus_tag
1195 # but it WILL have /product tags. These serve the same purpose
1196 # for grouping. For an example, see AY763288 (also in t/data)
1197 if (@sfs_with_locus_tag==0 &&
1198 @sfs_with_gene_tag==0 &&
1199 @sfs_with_product_tag>0 &&
1200 !$group_tag) {
1201 $group_tag = 'product';
1202 if ($self->verbose > 0) {
1203 warn "Set group tag to: $group_tag\n";
1208 if (!$group_tag) {
1209 $group_tag = 'gene';
1212 # ------------------------------
1213 # GROUP FEATURES using $group_tag
1214 # collect features into unstructured groups
1215 # ------------------------------
1217 # -------------
1218 # we want to generate a list of groups;
1219 # each group is a list of SeqFeatures; this
1220 # group probably (but not necessarily)
1221 # corresponds to a gene model.
1223 # this array will look something like this:
1224 # ([$f1], [$f2, $f3, $f4], ...., [$f97, $f98, $f99])
1226 # there are also 'singleton' groups, with one member.
1227 # for instance, the 'source' feature is in a singleton group;
1228 # the same with others such as 'misc_feature'
1229 my @groups = ();
1230 # -------------
1232 # --------------------
1233 # we hope that the genbank record allows us to group by some grouping
1234 # tag.
1235 # for instance, most of the time a gene model can be grouped using
1236 # the gene tag - that is where you see
1237 # /gene="foo"
1238 # in a genbank record
1239 # --------------------
1241 # keep an index of groups by their
1242 # grouping tag
1243 my %group_by_tag = ();
1246 # iterate through all features, putting them into groups
1247 foreach my $sf (@flat_seq_features) {
1248 if (!$sf->has_tag($group_tag)) {
1249 # SINGLETON
1250 # this is an ungroupable feature;
1251 # add it to a group of its own
1252 push(@groups, [$sf]);
1254 else {
1255 # NON-SINGLETON
1256 my @group_tagvals = $sf->get_tag_values($group_tag);
1257 if (@group_tagvals > 1) {
1258 # sanity check:
1259 # currently something can only belong to one group
1260 $self->problem(2,
1261 ">1 value for /$group_tag: @group_tagvals\n".
1262 "At this time this module is not equipped to handle this adequately", $sf);
1264 # get value of group tag
1265 my $gtv = shift @group_tagvals;
1266 $gtv || $self->throw("Empty /$group_tag vals not allowed!");
1268 # is this a new group?
1269 my $group = $group_by_tag{$gtv};
1270 if ($group) {
1271 # this group has been encountered before - add current
1272 # sf to the end of the group
1273 push(@$group, $sf);
1275 else {
1276 # new group; add to index and create new group
1277 $group = [$sf]; # currently one member; probably more to come
1278 $group_by_tag{$gtv} = $group;
1279 push(@groups, $group);
1284 # as well as having the same group_tag, a group should be spatially
1285 # connected. if not, then the group should be split into subgroups.
1286 # this turns out to be necessary in the case of multicopy genes.
1287 # the standard way to represent these is as spatially disconnected
1288 # gene models (usually a 'gene' feature and some kind of RNA feature)
1289 # with the same group tag; the code below will split these into
1290 # seperate groups, one per copy.
1291 @groups = map { $self->_split_group_if_disconnected($_) } @groups;
1293 # remove any duplicates; most of the time the method below has
1294 # no effect. there are some unusual genbank records for which
1295 # duplicate removal is necessary. see the comments in the
1296 # _remove_duplicates_from_group() method if you want to know
1297 # the ugly details
1298 foreach my $group (@groups) {
1299 $self->_remove_duplicates_from_group($group);
1304 # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
1305 # these are indicated with the /pseudo tag
1306 # these are mapped to a different type; they should NOT
1307 # be treated as normal genes
1308 foreach my $sf (@all_seq_features) {
1309 if ($sf->has_tag('pseudo')) {
1310 my $type = $sf->primary_tag;
1311 # SO type is typically the same as the normal
1312 # type but preceeded by "pseudo"
1313 if ($type eq 'misc_RNA' || $type eq 'mRNA') {
1314 # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
1315 $sf->primary_tag("pseudotranscript");
1317 else {
1318 $sf->primary_tag("pseudo$type");
1322 # now some of the post-processing that follows which applies to
1323 # genes will NOT be applied to pseudogenes; this is deliberate
1324 # for example, gene models are normalised to be gene-transcript-exon
1325 # for pseudogenes we leave them as pseudogene-pseudoexon
1327 # --- MAGIC ---
1328 my $need_to_infer_exons = 0;
1329 my $need_to_infer_mRNAs = 0;
1330 my @removed_exons = ();
1331 if ($use_magic) {
1332 if (defined($structure_type)) {
1333 $self->throw("Can't combine use_magic AND setting structure_type");
1335 my $n_introns =
1336 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1337 my $n_exons =
1338 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1339 my $n_mrnas =
1340 scalar(grep {$_->primary_tag eq 'mRNA'} @flat_seq_features);
1341 my $n_mrnas_attached_to_gene =
1342 scalar(grep {$_->primary_tag eq 'mRNA' &&
1343 $_->has_tag($group_tag)} @flat_seq_features);
1344 my $n_cdss =
1345 scalar(grep {$_->primary_tag eq 'CDS'} @flat_seq_features);
1346 my $n_rnas =
1347 scalar(grep {$_->primary_tag =~ /RNA/} @flat_seq_features);
1348 # Are there any CDS features in the record?
1349 if ($n_cdss > 0) {
1350 # YES
1352 # - a pc gene model should contain at the least a CDS
1354 # Are there any mRNA features in the record?
1355 if ($n_mrnas == 0) {
1356 # NO mRNAs:
1357 # looks like structure_type == 1
1358 $structure_type = 1;
1359 $need_to_infer_mRNAs = 1;
1361 elsif ($n_mrnas_attached_to_gene == 0) {
1362 # $n_mrnas > 0
1363 # $n_mrnas_attached_to_gene = 0
1365 # The entries _do_ contain mRNA features,
1366 # but none of them are part of a group/gene, i.e. they
1367 # are 'floating'
1369 # this is an annoying weird file that has some floating
1370 # mRNA features;
1371 # eg ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/
1373 if ($self->verbose) {
1374 my @floating_mrnas =
1375 grep {$_->primary_tag eq 'mRNA' &&
1376 !$_->has_tag($group_tag)} @flat_seq_features;
1377 printf STDERR "Unattached mRNAs:\n";
1378 foreach my $mrna (@floating_mrnas) {
1379 $self->_write_sf_detail($mrna);
1381 printf STDERR "Don't know how to deal with these; filter at source?\n";
1384 foreach (@flat_seq_features) {
1385 if ($_->primary_tag eq 'mRNA') {
1386 # what should we do??
1388 # I think for pombe we just have to filter
1389 # out bogus mRNAs prior to starting
1393 # looks like structure_type == 2
1394 $structure_type = 2;
1395 $need_to_infer_mRNAs = 1;
1397 else {
1400 # we always infer exons in magic mode
1401 $need_to_infer_exons = 1;
1403 else {
1404 # this doesn't seem to be any kind of protein coding gene model
1405 if ( $n_rnas > 0 ) {
1406 $need_to_infer_exons = 1;
1410 $need_to_infer_exons = 0 if $noinfer; #NML
1412 if ($need_to_infer_exons) {
1413 # remove exons and introns from group -
1414 # we will infer exons later, and we
1415 # can always infer introns from exons
1416 foreach my $group (@groups) {
1417 @$group =
1418 grep {
1419 my $type = $_->primary_tag();
1420 if ($type eq 'exon') {
1421 # keep track of all removed exons,
1422 # so we can do a sanity check later
1423 push(@removed_exons, $_);
1425 $type ne 'exon' && $type ne 'intron'
1426 } @$group;
1428 # get rid of any groups that have zero members
1429 @groups = grep {scalar(@$_)} @groups;
1432 # --- END OF MAGIC ---
1434 # LOGICAL ASSERTION
1435 if (grep {!scalar(@$_)} @groups) {
1436 $self->throw("ASSERTION ERROR: empty group");
1439 # LOGGING
1440 if ($self->verbose > 0) {
1441 printf STDERR "GROUPS:\n";
1442 foreach my $group (@groups) {
1443 $self->_write_group($group, $group_tag);
1448 # --------- FINISHED GROUPING -------------
1451 # TYPE CONTAINMENT HIERARCHY (aka partonomy)
1452 # set the containment hierarchy if desired
1453 # see docs for structure_type() method
1454 if ($structure_type) {
1455 if ($structure_type == 1) {
1456 $self->partonomy(
1457 {CDS => 'gene',
1458 exon => 'CDS',
1459 intron => 'CDS',
1463 else {
1464 $self->throw("structure_type $structure_type is currently unknown");
1468 # see if we have an obvious resolver_tag
1469 if ($use_magic) {
1470 foreach my $sf (@all_seq_features) {
1471 if ($sf->has_tag('derived_from')) {
1472 $resolver_tag = 'derived_from';
1477 if ($use_magic) {
1478 # point all feature types without a container type to the root type.
1480 # for example, if we have an unanticipated feature_type, say
1481 # 'aberration', this should by default point to the parent 'gene'
1482 foreach my $group (@groups) {
1483 my @sfs = @$group;
1484 if (@sfs > 1) {
1485 foreach my $sf (@sfs) {
1486 my $type = $sf->primary_tag;
1487 next if $type eq 'gene';
1488 my $container_type = $self->get_container_type($type);
1489 if (!$container_type) {
1490 $self->partonomy->{$type} = 'gene';
1497 # we have done the first part of the unflattening.
1498 # we now have a list of groups; each group is a list of seqfeatures.
1499 # the actual group itself is flat; we may want to unflatten this further;
1500 # for instance, a gene model can contain multiple mRNAs and CDSs. We may want
1501 # to link the correct mRNA to the correct CDS via the bioperl sub_SeqFeature tree.
1503 # what we would end up with would be
1504 # gene1
1505 # mRNA-a
1506 # CDS-a
1507 # mRNA-b
1508 # CDS-b
1509 my @top_sfs = $self->unflatten_groups(-groups=>\@groups,
1510 -resolver_method=>$resolver_method,
1511 -resolver_tag=>$resolver_tag);
1513 # restore settings
1514 $self->partonomy($old_partonomy);
1516 # restore settings
1517 $self->structure_type($old_structure_type);
1519 # modify the original Seq object - the top seqfeatures are now
1520 # the top features from each group
1521 $seq->remove_SeqFeatures;
1522 $seq->add_SeqFeature($_) foreach @top_sfs;
1524 # --------- FINISHED UNFLATTENING -------------
1526 # lets see if there are any post-unflattening tasks we need to do
1530 # INFERRING mRNAs
1531 if ($need_to_infer_mRNAs) {
1532 if ($self->verbose > 0) {
1533 printf STDERR "** INFERRING mRNA from CDS\n";
1535 $self->infer_mRNA_from_CDS(-seq=>$seq, -noinfer=>$noinfer);
1538 # INFERRING exons
1539 if ($need_to_infer_exons) {
1541 # infer exons, one group/gene at a time
1542 foreach my $sf (@top_sfs) {
1543 my @sub_sfs = ($sf, $sf->get_all_SeqFeatures);
1544 $self->feature_from_splitloc(-features=>\@sub_sfs);
1547 # some exons are stated explicitly; ie there is an "exon" feature
1548 # most exons are inferred; ie there is a "mRNA" feature with
1549 # split locations
1551 # if there were exons explicitly stated in the entry, we need to
1552 # do two things:
1554 # make sure these exons are consistent with the inferred exons
1555 # (you never know)
1557 # transfer annotation (tag-vals) from the explicit exon to the
1558 # new inferred exon
1559 if (@removed_exons) {
1560 my @allfeats = $seq->get_all_SeqFeatures;
1562 # find all the inferred exons that are children of mRNA
1563 my @mrnas = grep {$_->primary_tag eq 'mRNA'} @allfeats;
1564 my @exons =
1565 grep {$_->primary_tag eq 'exon'}
1566 map {$_->get_SeqFeatures} @mrnas;
1568 my %exon_h = (); # index of exons by location;
1570 # there CAN be >1 exon at a location; we can represent these redundantly
1571 # (ie as a tree, not a graph)
1572 push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
1573 my @problems = (); # list of problems;
1574 # each problem is a
1575 # [$severity, $description] pair
1576 my $problem = '';
1577 my ($n_exons, $n_removed_exons) =
1578 (scalar(keys %exon_h), scalar(@removed_exons));
1579 foreach my $removed_exon (@removed_exons) {
1580 my $locstr = $self->_locstr($removed_exon);
1581 my $inferred_exons = $exon_h{$locstr};
1582 delete $exon_h{$locstr};
1583 if ($inferred_exons) {
1584 my %exons_done = ();
1585 foreach my $exon (@$inferred_exons) {
1587 # make sure we don't move stuff twice
1588 next if $exons_done{$exon};
1589 $exons_done{$exon} = 1;
1591 # we need to tranfer any tag-values from the explicit
1592 # exon to the implicit exon
1593 foreach my $tag ($removed_exon->get_all_tags) {
1594 my @vals = $removed_exon->get_tag_values($tag);
1595 if (!$exon->can("add_tag_value")) {
1596 # I'm puzzled as to what should be done here;
1597 # SeqFeatureIs are not necessarily mutable,
1598 # but we know that in practice the implementing
1599 # class is mutable
1600 $self->throw("The SeqFeature object does not ".
1601 "implement add_tag_value()");
1603 $exon->add_tag_value($tag, @vals);
1607 else {
1608 # no exons inferred at $locstr
1609 push(@problems,
1610 [1,
1611 "there is a conflict with exons; there was an explicitly ".
1612 "stated exon with location $locstr, yet I cannot generate ".
1613 "this exon from the supplied mRNA locations\n"]);
1616 # do we have any inferred exons left over, that were not
1617 # covered in the explicit exons?
1618 if (keys %exon_h) {
1619 # TODO - we ignore this problem for now
1620 push(@problems,
1622 sprintf("There are some inferred exons that are not in the ".
1623 "explicit exon list; they are the exons at locations:\n".
1624 join("\n", keys %exon_h)."\n")]);
1627 # report any problems
1628 if (@problems) {
1629 my $thresh = $self->error_threshold;
1630 my @bad_problems = grep {$_->[0] > $thresh} @problems;
1631 if (@bad_problems) {
1632 printf STDERR "PROBLEM:\n";
1633 $self->_write_hier(\@top_sfs);
1634 # TODO - allow more fine grained control over this
1635 $self->{_problems_reported} = 1;
1636 $self->throw(join("\n",
1637 map {"@$_"} @bad_problems));
1639 $self->problem(@$_) foreach @problems;
1643 # --- end of inferring exons --
1645 # return new top level features; this can also
1646 # be retrieved via
1647 # $seq->get_SeqFeatures();
1648 # return @top_sfs;
1649 return $seq->get_SeqFeatures;
1652 # _split_group_if_disconnected([@sfs])
1654 # as well as having the same group_tag, a group should be spatially
1655 # connected. if not, then the group should be split into subgroups.
1656 # this turns out to be necessary in the case of multicopy genes.
1657 # the standard way to represent these is as spatially disconnected
1658 # gene models (usually a 'gene' feature and some kind of RNA feature)
1659 # with the same group tag; the code below will split these into
1660 # seperate groups, one per copy.
1662 sub _split_group_if_disconnected {
1663 my $self = shift;
1664 my $group = shift;
1665 my @sfs = @$group;
1666 my @ranges =
1667 Bio::Range->disconnected_ranges(@sfs);
1668 my @groups;
1669 if (@ranges == 0) {
1670 $self->throw("ASSERTION ERROR");
1672 elsif (@ranges == 1) {
1673 # no need to split the group
1674 @groups = ($group);
1676 else {
1677 # @ranges > 1
1678 # split the group into disconnected ranges
1679 if ($self->verbose > 0) {
1680 printf STDERR "GROUP PRE-SPLIT:\n";
1681 $self->_write_group($group, $self->group_tag);
1683 @groups =
1684 map {
1685 my $range = $_;
1686 [grep {
1687 $_->intersection($range);
1688 } @sfs]
1689 } @ranges;
1690 if ($self->verbose > 0) {
1691 printf STDERR "SPLIT GROUPS:\n";
1692 $self->_write_group($_, $self->group_tag) foreach @groups;
1695 return @groups;
1698 sub _remove_duplicates_from_group {
1699 my $self = shift;
1700 my $group = shift;
1702 # ::: WEIRD BOUNDARY CASE CODE :::
1703 # for some reason, there are some gb records with two gene
1704 # features for one gene; for example, see ATF14F8.gbk
1705 # in the t/data directory
1707 # in this case, we get rid of one of the genes
1709 my @genes = grep {$_->primary_tag eq 'gene'} @$group;
1710 if (@genes > 1) {
1711 # OK, if we look at ATF14F8.gbk we see that some genes
1712 # just exist as a single location, some exist as a multisplit location;
1714 # eg
1716 # gene 16790..26395
1717 # /gene="F14F8_60"
1718 # ...
1719 # gene complement(join(16790..19855,20136..20912,21378..21497,
1720 # 21654..21876,22204..22400,22527..23158,23335..23448,
1721 # 23538..23938,24175..24536,24604..24715,24889..24984,
1722 # 25114..25171,25257..25329,25544..25589,25900..26018,
1723 # 26300..26395))
1724 # /gene="F14F8_60"
1726 # the former is the 'standard' way of representing the gene in genbank;
1727 # the latter is redundant with the CDS entry. So we shall get rid of
1728 # the latter with the following filter
1730 if ($self->verbose > 0) {
1731 printf STDERR "REMOVING DUPLICATES:\n";
1734 @genes =
1735 grep {
1736 my $loc = $_->location;
1737 if ($loc->isa("Bio::Location::SplitLocationI")) {
1738 my @locs = $loc->each_Location;
1739 if (@locs > 1) {
1742 else {
1746 else {
1749 } @genes;
1751 if (@genes > 1) {
1752 # OK, that didn't work. Our only resort is to just pick one at random
1753 @genes = ($genes[0]);
1755 if (@genes) {
1756 @genes == 1 || $self->throw("ASSERTION ERROR");
1757 @$group =
1758 ($genes[0], grep {$_->primary_tag ne 'gene'} @$group);
1761 # its a dirty job but someone's gotta do it
1762 return;
1766 =head2 unflatten_groups
1768 Title : unflatten_groups
1769 Usage :
1770 Function: iterates over groups, calling unflatten_group() [see below]
1771 Example :
1772 Returns : list of Bio::SeqFeatureI objects that are holders
1773 Args : see below
1775 Arguments
1777 -groups: list of list references; inner list is of Bio::SeqFeatureI objects
1778 e.g. ( [$sf1], [$sf2, $sf3, $sf4], [$sf5, ...], ...)
1780 -resolver_method: a CODE reference
1781 see the documentation above for an example of
1782 a subroutine that can be used to resolve hierarchies
1783 within groups.
1785 this is optional - a default subroutine will be used
1788 NOTE: You should not need to call this method, unless you want fine
1789 grained control over how the unflattening process.
1791 =cut
1793 sub unflatten_groups{
1794 my ($self,@args) = @_;
1795 my($groups, $resolver_method, $resolver_tag) =
1796 $self->_rearrange([qw(GROUPS
1797 RESOLVER_METHOD
1798 RESOLVER_TAG
1800 @args);
1802 # this is just a simple wrapper for unflatten_group()
1803 return
1804 map {
1805 $self->unflatten_group(-group=>$_,
1806 -resolver_method=>$resolver_method,
1807 -resolver_tag=>$resolver_tag)
1808 } @$groups;
1811 =head2 unflatten_group
1813 Title : unflatten_group
1814 Usage :
1815 Function: nests a group of features into a feature containment hierarchy
1816 Example :
1817 Returns : Bio::SeqFeatureI objects that holds other features
1818 Args : see below
1820 Arguments
1822 -group: reference to list of Bio::SeqFeatureI objects
1824 -resolver_method: a CODE reference
1825 see the documentation above for an example of
1826 a subroutine that can be used to resolve hierarchies
1827 within groups
1829 this is optional - a default subroutine will be used
1832 NOTE: You should not need to call this method, unless you want fine
1833 grained control over how the unflattening process.
1835 =cut
1837 sub unflatten_group{
1838 my ($self,@args) = @_;
1840 my($group, $resolver_method, $resolver_tag) =
1841 $self->_rearrange([qw(GROUP
1842 RESOLVER_METHOD
1843 RESOLVER_TAG
1845 @args);
1847 if ($self->verbose > 0) {
1848 printf STDERR "UNFLATTENING GROUP:\n";
1849 $self->_write_group($group, $self->group_tag);
1852 my @sfs = @$group;
1854 # we can safely ignore singletons (e.g. [source])
1855 return $sfs[0] if @sfs == 1;
1857 my $partonomy = $self->partonomy;
1859 # $resolver_method is a reference to a SUB that will resolve
1860 # ambiguous parent/child containment; for example, determining
1861 # which mRNAs go with which CDSs
1862 $resolver_method = $resolver_method || \&_resolve_container_for_sf;
1864 # TAG BASED RESOLVING OF HIERARCHIES
1866 # if the user specifies $resolver_tag, then we use this tag
1867 # to pair up ambiguous parents and children;
1869 # for example, the CDS feature may have a resolver tag of /derives_from
1870 # which is a 'foreign key' into the /label tag of the mRNA feature
1872 # this kind of tag-based resolution is possible for a certain subset
1873 # of genbank records
1875 # if no resolver tag is specified, we revert to the normal
1876 # resolver_method
1877 if ($resolver_tag) {
1878 my $backup_resolver_method = $resolver_method;
1879 # closure: $resolver_tag is remembered by this sub
1880 my $sub =
1881 sub {
1882 my ($self, $sf, @possible_container_sfs) = @_;
1883 my @container_sfs = ();
1884 if ($sf->has_tag($resolver_tag)) {
1885 my ($resolver_tagval) = $sf->get_tag_values($resolver_tag);
1886 # if a feature has a resolver_tag (e.g. /derives_from)
1887 # this specifies the /product, /symbol or /label for the
1888 # parent feature
1889 @container_sfs =
1890 grep {
1891 my $match = 0;
1892 $self->_write_sf($_) if $self->verbose > 0;
1893 foreach my $tag (qw(product symbol label)) {
1894 if ($_->has_tag($tag)) {
1895 my @vals =
1896 $_->get_tag_values($tag);
1897 if (grep {$_ eq $resolver_tagval} @vals) {
1898 $match = 1;
1899 last;
1903 $match;
1904 } @possible_container_sfs;
1906 else {
1907 return $backup_resolver_method->($sf, @possible_container_sfs);
1909 return map {$_=>0} @container_sfs;
1911 $resolver_method = $sub;
1913 else {
1914 # CONDITION: $resolver_tag is NOT set
1915 $self->throw("assertion error") if $resolver_tag;
1917 # we have now set $resolver_method to a subroutine for
1918 # disambiguatimng parent/child relationships. we will
1919 # now build the whole containment hierarchy for this group
1922 # FIND TOP/ROOT SEQFEATURES
1924 # find all the features for which there is no
1925 # containing feature type (eg genes)
1926 my @top_sfs =
1927 grep {
1928 !$self->get_container_type($_->primary_tag);
1929 } @sfs;
1931 # CONDITION: there must be at most one root
1932 if (@top_sfs > 1) {
1933 $self->_write_group($group, $self->group_tag);
1934 printf STDERR "TOP SFS:\n";
1935 $self->_write_sf($_) foreach @top_sfs;
1936 $self->throw("multiple top-sfs in group");
1938 my $top_sf = $top_sfs[0];
1940 # CREATE INDEX OF SEQFEATURES BY TYPE
1941 my %sfs_by_type = ();
1942 foreach my $sf (@sfs) {
1943 push(@{$sfs_by_type{$sf->primary_tag}}, $sf);
1946 # containment index; keyed by child; lookup parent
1947 # note: this index uses the stringified object reference of
1948 # the object as a surrogate lookup key
1950 my %container = (); # child -> parent
1952 # ALGORITHM: build containment graph
1954 # find all possible containers for each SF;
1955 # for instance, for a CDS, the possible containers are all
1956 # the mRNAs in the same group. For a mRNA, the possible
1957 # containers are any SFs of type 'gene' (should only be 1).
1958 # (these container-type mappings can be overridden)
1960 # contention is resolved by checking coordinates of splice sites
1961 # (this is the default, but can be overridden)
1963 # most of the time, there is no problem identifying a unique
1964 # parent for every child; this can be ambiguous when constructing
1965 # CDS to mRNA relationships with lots of alternate splicing
1967 # a hash of child->parent relationships is constructed (%container)
1968 # any mappings that need further resolution (eg CDS to mRNA) are
1969 # placed in %unresolved
1971 # %unresolved index
1972 # (keyed by stringified object reference of child seqfeature)
1973 my %unresolved = (); # child -> [parent,score] to be resolved
1975 # index of seqfeatures by their stringified object reference;
1976 # this is essentially a way of 'reviving' an object from its stringified
1977 # reference
1978 # (see NOTE ON USING OBJECTS AS KEYS IN HASHES, below)
1979 my %idxsf = map {$_=>$_} @sfs;
1981 foreach my $sf (@sfs) {
1982 my $type = $sf->primary_tag;
1984 # container type (e.g. the container type for CDS is usually mRNA)
1985 my $container_type =
1986 $self->get_container_type($type);
1987 if ($container_type) {
1989 my @possible_container_sfs =
1990 @{$sfs_by_type{$container_type} || []};
1991 # we now have a list of possible containers
1992 # (eg for a CDS in an alternately spliced gene, this
1993 # would be a list of all the mRNAs for this gene)
1995 if (!@possible_container_sfs) {
1996 # root of hierarchy
1998 else {
1999 if (@possible_container_sfs == 1) {
2000 # this is the easy situation, whereby the containment
2001 # hierarchy is unambiguous. this will probably be the
2002 # case if the genbank record has no alternate splicing
2003 # within it
2005 # ONE OPTION ONLY - resolved!
2006 $container{$sf} = $possible_container_sfs[0];
2009 else {
2010 # MULTIPLE CONTAINER CHOICES
2011 $self->throw("ASSERTION ERROR") unless @possible_container_sfs > 1;
2013 # push this onto the %unresolved graph, and deal with it
2014 # later
2016 # for now we hardcode things such that the only type
2017 # with ambiguous parents is a CDS; if this is violated,
2018 # it has a weak problem class of '1' so the API user
2019 # can easily set things to ignore these
2020 if ($sf->primary_tag ne 'CDS') {
2021 $self->problem(1,
2022 "multiple container choice for non-CDS; ".
2023 "CDS to mRNA should be the only ".
2024 "relationships requiring resolving",
2025 $sf);
2028 # previously we set the SUB $resolver_method
2029 $self->throw("ASSERTION ERROR")
2030 unless $resolver_method;
2032 # $resolver_method will assign scores to
2033 # parent/child combinations; later on we
2034 # will use these scores to find the optimal
2035 # parent/child pairings
2037 # the default $resolver_method uses splice sites to
2038 # score possible parent/child matches
2040 my %container_sfh =
2041 $resolver_method->($self, $sf, @possible_container_sfs);
2042 if (!%container_sfh) {
2043 $self->problem(2,
2044 "no containers possible for SeqFeature of ".
2045 "type: $type; this SF is being placed at ".
2046 "root level",
2047 $sf);
2048 # RESOLVED! (sort of - placed at root/gene level)
2049 $container{$sf} = $top_sf;
2051 # this sort of thing happens if the record is
2052 # badly messed up and there is absolutely no indication
2053 # of where to put the CDS. Perhaps we should just
2054 # place it with a random mRNA?
2056 foreach my $jsf (keys %container_sfh) {
2058 # add [score, parent] pairs to the %unresolved
2059 # lookup table/graph
2060 push(@{$unresolved{$sf}},
2061 [$idxsf{$jsf}, $container_sfh{$jsf} || 0]);
2066 else {
2067 # CONDITION:
2068 # not container type for $sf->primary_tag
2070 # CONDITION:
2071 # $sf must be a root/top node (eg gene)
2075 if (0) {
2077 # CODE CURRENTLY DISABLED
2079 # we require a 1:1 mapping between mRNAs and CDSs;
2080 # create artificial duplicates if we can't do this...
2081 if (%unresolved) {
2082 my %childh = map {$_=>1} keys %unresolved;
2083 my %parenth = map {$_->[0]=>1} map {@$_} values %unresolved;
2084 if ($self->verbose > 0) {
2085 printf STDERR "MATCHING %d CHILDREN TO %d PARENTS\n",
2086 scalar(keys %childh), scalar(keys %parenth);
2088 # 99.99% of the time in genbank genomic record of structure type 0, we
2089 # see one CDS for every mRNA; one exception is the S Pombe
2090 # genome, which is all CDS, bar a few spurious mRNAs; we have to
2091 # filter out the spurious mRNAs in this case
2093 # another strange case is in the mouse genome, NT_078847.1
2094 # for Pcdh13 you will notice there is 4 mRNAs and 5 CDSs.
2095 # most unusual!
2096 # I'm at a loss for a really clever thing to do here. I think the
2097 # best thing is to create duplicate features to preserve the 1:1 mapping
2098 # my $suffix_id = 1;
2099 # while (keys %childh > keys %parenth) {
2105 # DEBUGGING CODE
2106 if ($self->verbose > 0 && scalar(keys %unresolved)) {
2107 printf STDERR "UNRESOLVED PAIRS:\n";
2108 foreach my $childsf (keys %unresolved) {
2109 my @poss = @{$unresolved{$childsf}};
2110 foreach my $p (@poss) {
2111 my $parentsf = $p->[0];
2112 $childsf = $idxsf{$childsf};
2113 my @clabels = ($childsf->get_tagset_values(qw(protein_id label product)), "?");
2114 my @plabels = ($parentsf->get_tagset_values(qw(transcript_id label product)), "?");
2115 printf STDERR
2116 (" PAIR: $clabels[0] => $plabels[0] (of %d)\n",
2117 scalar(@poss));
2120 } # -- end of verbose
2122 # Now we have to fully resolve the containment hierarchy; remember,
2123 # the graph %container has the fully resolved child->parent links;
2125 # the graph %unresolved is keyed by children missing parents; we
2126 # need to put all these orphans in the %container graph
2128 # we do this using the scores in %unresolved, with the
2129 # find_best_matches() algorithm
2130 my $unresolved_problem_reported = 0;
2131 if (%unresolved) {
2132 my $new_pairs =
2133 $self->find_best_matches(\%unresolved, []);
2134 if (!$new_pairs) {
2135 my ($g) = $sfs[0]->get_tagset_values($self->group_tag || 'gene');
2136 $self->problem(2,
2137 "Could not resolve hierarchy for $g");
2138 $new_pairs = [];
2139 $unresolved_problem_reported = 1;
2141 foreach my $pair (@$new_pairs) {
2142 if ($self->verbose > 0) {
2143 printf STDERR " resolved pair @$pair\n";
2145 $container{$pair->[0]} = $pair->[1];
2146 delete $unresolved{$pair->[0]};
2150 # CONDITION: containment hierarchy resolved
2151 if (%unresolved) {
2152 $self->throw("UNRESOLVED: %unresolved")
2153 unless $unresolved_problem_reported;
2156 # make nested SeqFeature hierarchy from @containment_pairs
2157 # ie put child SeqFeatures into parent SeqFeatures
2158 my @top = ();
2159 foreach my $sf (@sfs) {
2160 my $container_sf = $container{$sf};
2161 if ($container_sf) {
2162 # make $sf nested inside $container_sf
2164 # first check if the container spatially contains the containee
2165 if ($container_sf->contains($sf)) {
2166 # add containee
2167 $container_sf->add_SeqFeature($sf);
2169 else {
2170 # weird case - the container does NOT spatially
2171 # contain the containee;
2172 # we expand and throw a warning
2174 # for an example of this see ZFP91-CNTF dicistronic gene
2175 # in NCBI chrom 11 build 34.3
2176 $self->problem(1,
2177 "Container feature does not spatially contain ".
2178 "subfeature. Perhaps this is a dicistronic gene? ".
2179 "I am expanding the parent feature",
2180 $container_sf,
2181 $sf);
2182 $container_sf->add_SeqFeature($sf, 'EXPAND');
2185 else {
2186 push(@top, $sf);
2189 return @top;
2190 } # -- end of unflatten_group
2192 # -------
2193 # A NOTE ON USING OBJECTS AS KEYS IN HASHES (stringified objects)
2195 # Often we with to use seqfeatures as keys in a hashtable; because seqfeatures
2196 # in bioperl have no unique ID, we use a surrogate ID in the form of the
2197 # stringified object references - this is just what you get if you say
2199 # print "$sf\n";
2201 # this is guaranteed to be unique (within a particular perl execution)
2203 # often we want to 'revive' the objects used as keys in a hash - once the
2204 # objects are used as keys, remember it is the *strings* used as keys and
2205 # not the object itself, so the object needs to be revived using another
2206 # hashtable that looks like this
2208 # %sfidx = map { $_ => $_ } @sfs
2210 # -------
2213 # recursively finds the best set of pairings from a matrix of possible pairings
2215 # tries to make sure nothing is unpaired
2217 # given a matrix of POSSIBLE matches
2218 # (matrix expressed as hash/lookup; keyed by child object; val = [parent, score]
2221 sub find_best_matches {
2222 my $self = shift;
2223 my $matrix = shift;
2224 my $pairs = shift; # [child,parent] pairs already selected
2226 my $verbose = $self->verbose;
2227 #################################print "I";
2228 if ($verbose > 0) {
2229 printf STDERR "find_best_matches: (/%d)\n", scalar(@$pairs);
2232 my %selected_children = map {($_->[0]=>1)} @$pairs;
2233 my %selected_parents = map {($_->[1]=>1)} @$pairs;
2235 # make a copy of the matrix with the portions still to be
2236 # resolved
2237 my %unresolved_parents = ();
2238 my %unresolved =
2239 map {
2240 if ($verbose > 0) {
2241 printf STDERR " $_ : %s\n", join("; ", map {"[@$_]"} @{$matrix->{$_}});
2243 if ($selected_children{$_}) {
2246 else {
2247 my @parents =
2248 grep {
2249 !$selected_parents{$_->[0]}
2250 } @{$matrix->{$_}};
2251 $unresolved_parents{$_} = 1 foreach @parents;
2252 # new parents
2253 ($_ => [@parents]);
2255 } keys %$matrix;
2257 my @I = keys %unresolved;
2259 return $pairs if !scalar(keys %unresolved_parents);
2260 # NECESSARY CONDITION:
2261 # all possible parents have a child match
2263 return $pairs if !scalar(@I);
2264 # NECESSARY CONDITION:
2265 # all possible children have a parent match
2267 # give those with fewest choices highest priority
2268 @I = sort {
2269 # n possible parents
2270 scalar(@{$unresolved{$a}})
2272 scalar(@{$unresolved{$b}}) ;
2273 } @I;
2275 my $csf = shift @I;
2277 my @J = @{$unresolved{$csf}}; # array of [parent, score]
2279 # sort by score, highest first
2280 @J =
2281 sort {
2282 $b->[1] <=> $a->[1]
2283 } @J;
2285 # select pair(s) from remaining matrix of possible pairs
2286 # by iterating through possible parents
2288 my $successful_pairs;
2289 foreach my $j (@J) {
2290 my ($psf, $score) = @$j;
2291 # would selecting $csf, $psf as a pair
2292 # remove all choices from another?
2293 my $bad = 0;
2294 foreach my $sf (@I) {
2295 if (!grep {$_->[0] ne $psf} @{$unresolved{$sf}}) {
2296 # $psf was the only parent choice for $sf
2297 $bad = 1;
2298 last;
2301 if (!$bad) {
2302 my $pair = [$csf, $psf];
2303 my $new_pairs = [@$pairs, $pair];
2304 my $set = $self->find_best_matches($matrix, $new_pairs);
2305 if ($set) {
2306 $successful_pairs = $set;
2307 last;
2311 # success
2312 return $successful_pairs if $successful_pairs;
2313 # fail
2314 return 0;
2317 # ----------------------------------------------
2318 # writes a group to stdout
2320 # mostly for logging/debugging
2321 # ----------------------------------------------
2322 sub _write_group {
2323 my $self = shift;
2324 my $group = shift;
2325 my $group_tag = shift || 'gene';
2327 my $f = $group->[0];
2328 my $label = '?';
2329 if ($f->has_tag($group_tag)) {
2330 ($label) = $f->get_tag_values($group_tag);
2332 if( $self->verbose > 0 ) {
2333 printf STDERR (" GROUP [%s]:%s\n",
2334 $label,
2335 join(' ',
2336 map { $_->primary_tag } @$group));
2341 sub _write_sf {
2342 my $self = shift;
2343 my $sf = shift;
2344 printf STDERR "TYPE:%s\n", $sf->primary_tag;
2345 return;
2348 sub _write_sf_detail {
2349 my $self = shift;
2350 my $sf = shift;
2351 printf STDERR "TYPE:%s\n", $sf->primary_tag;
2352 my @locs = $sf->location->each_Location;
2353 printf STDERR " %s,%s [%s]\n", $_->start, $_->end, $_->strand foreach @locs;
2354 return;
2357 sub _write_hier {
2358 my $self = shift;
2359 my @sfs = @{shift || []};
2360 my $indent = shift || 0;
2361 if( $self->verbose > 0 ) {
2362 foreach my $sf (@sfs) {
2363 my $label = '?';
2364 if ($sf->has_tag('product')) {
2365 ($label) = $sf->get_tag_values('product');
2367 printf STDERR "%s%s $label\n", ' ' x $indent, $sf->primary_tag;
2368 my @sub_sfs = $sf->sub_SeqFeature;
2369 $self->_write_hier(\@sub_sfs, $indent+1);
2374 # -----------------------------------------------
2376 # returns all possible containers for an SF based
2377 # on splice site coordinates; splice site coords
2378 # must be contained
2379 # -----------------------------------------------
2380 sub _resolve_container_for_sf{
2381 my ($self, $sf, @possible_container_sfs) = @_;
2383 my @coords = $self->_get_splice_coords_for_sf($sf);
2384 my $start = $sf->start;
2385 my $end = $sf->end;
2386 my $splice_uniq_str = "@coords";
2388 my @sf_score_pairs = ();
2389 # a CDS is contained by a mRNA if the locations of the splice
2390 # coordinates are identical
2391 foreach (@possible_container_sfs) {
2392 my @container_coords = $self->_get_splice_coords_for_sf($_);
2393 my $inside =
2394 !$splice_uniq_str ||
2395 index("@container_coords", $splice_uniq_str) > -1;
2396 if ($inside) {
2397 # the container cannot be smaller than the thing contained
2398 if ($_->start > $start || $_->end < $end) {
2399 $inside = 0;
2404 # SPECIAL CASE FOR /ribosomal_slippage
2405 # See: https://www.ncbi.nlm.nih.gov/collab/FT/
2406 if (!$inside && $sf->has_tag('ribosomal_slippage')) {
2407 if ($self->verbose > 0) {
2408 printf STDERR " Checking for ribosomal_slippage\n";
2411 # TODO: rewrite this to match introns;
2412 # each slippage will be a "fake" small CDS exon
2413 my @transcript_splice_sites = @container_coords;
2414 my @cds_splice_sites = @coords;
2415 ##printf STDERR "xxTR SSs: @transcript_splice_sites :: %s\n", $_->get_tag_values('product');
2416 ##printf STDERR "xxCD SSs: @cds_splice_sites :: %s\n\n", $sf->get_tag_values('product');
2418 # find the the first splice site within the CDS
2419 while (scalar(@transcript_splice_sites) &&
2420 $transcript_splice_sites[0] < $cds_splice_sites[0]) {
2421 shift @transcript_splice_sites;
2424 ##print STDERR "TR SSs: @transcript_splice_sites\n";
2425 ##print STDERR "CD SSs: @cds_splice_sites\n\n";
2427 if (!(scalar(@transcript_splice_sites)) ||
2428 $transcript_splice_sites[0] == $cds_splice_sites[0]) {
2430 # we will now try and align all splice remaining sites in the transcript and CDS;
2431 # any splice site that can't be aligned is assumed to be a ribosomal slippage
2433 my @slips = ();
2434 my $in_exon = 1;
2435 $inside = 1; # innocent until proven guilty..
2436 while (@cds_splice_sites) {
2437 if (!@transcript_splice_sites) {
2439 # ribosomal slippage is after the last transcript splice site
2440 # Example: (NC_00007, isoform 3 of PEG10)
2441 # mRNA join(85682..85903,92646..99007)
2442 # mRNA join(85682..85903,92646..99007)
2443 # CDS join(85899..85903,92646..93825,93825..94994)
2445 # OR: None of the splice sites align;
2446 # may be a single CDS exon with one slippage inside it.
2447 # Example: (NC_00007, isoform 4 of PEG10)
2448 # mRNA join(85637..85892,92646..99007)
2449 # CDS join(92767..93825,93825..94994)
2451 # Yes, this code is repeated below...
2452 my $p1 = shift @cds_splice_sites;
2453 my $p2 = shift @cds_splice_sites;
2454 if ($self->verbose > 0) {
2455 printf STDERR " Found the ribosomal_slippage: $p1..$p2\n";
2457 push(@slips, ($p2-$p1)-1);
2459 elsif ($cds_splice_sites[0] == $transcript_splice_sites[0]) {
2460 # splice sites align: this is not the slippage
2461 shift @cds_splice_sites;
2462 shift @transcript_splice_sites;
2463 ##print STDERR "MATCH\n";
2465 else {
2466 # mismatch
2467 if ($cds_splice_sites[0] < $transcript_splice_sites[0]) {
2468 # potential slippage
2470 # ---TTTTTTTTTT----
2471 # ---CCCC--CCCC----
2474 my $p1 = shift @cds_splice_sites;
2475 my $p2 = shift @cds_splice_sites;
2476 if ($self->verbose > 0) {
2477 printf STDERR " Found the ribosomal_slippage: $p1..$p2\n";
2479 push(@slips, ($p2-$p1)-1);
2481 else {
2482 # not a potential ribosomal slippage
2483 $inside = 0; # guilty!
2484 ##print STDERR "FAIL\n";
2485 last;
2489 if ($inside) {
2490 # TODO: this is currently completely arbitrary. How many ribosomal slippages do we allow?
2491 # perhaps we need some mini-statistical model here....?
2492 if (@slips > 1) {
2493 $inside = 0;
2495 # TODO: this is currently completely arbitrary. What is the maximum size of a ribosomal slippage?
2496 # perhaps we need some mini-statistical model here....?
2497 if (grep {$_ > 2} @slips) {
2498 $inside = 0;
2502 else {
2503 # not a ribosomal_slippage, sorry
2506 if ($self->verbose > 0) {
2507 printf STDERR " Checking containment:[$inside] (@container_coords) IN ($splice_uniq_str)\n";
2509 if ($inside) {
2510 # SCORE: matching (ss-scoords+2)/(n-container-ss-coords+2)
2511 my $score =
2512 (scalar(@coords)+2)/(scalar(@container_coords)+2);
2513 push(@sf_score_pairs,
2514 $_=>$score);
2517 return @sf_score_pairs;
2520 sub _get_splice_coords_for_sf {
2521 my $self = shift;
2522 my $sf = shift;
2524 my @locs = $sf->location;
2525 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2526 @locs = $sf->location->each_Location;
2529 # get an ordered list of (start, end) positions
2531 # my @coords =
2532 # map {
2533 # $_->strand > 0 ? ($_->start, $_->end) : ($_->end, $_->start)
2534 # } @locs;
2536 my @coords = map {($_->start, $_->end)} @locs;
2538 # remove first and last leaving only splice sites
2539 pop @coords;
2540 shift @coords;
2541 return @coords;
2544 =head2 feature_from_splitloc
2546 Title : feature_from_splitloc
2547 Usage : $unflattener->feature_from_splitloc(-features=>$sfs);
2548 Function:
2549 Example :
2550 Returns :
2551 Args : see below
2553 At this time all this method does is generate exons for mRNA or other RNA features
2555 Arguments:
2557 -feature: a Bio::SeqFeatureI object (that conforms to Bio::FeatureHolderI)
2558 -seq: a Bio::SeqI object that contains Bio::SeqFeatureI objects
2559 -features: an arrayref of Bio::SeqFeatureI object
2562 =cut
2564 sub feature_from_splitloc{
2565 my ($self,@args) = @_;
2567 my($sf, $seq, $sfs) =
2568 $self->_rearrange([qw(FEATURE
2570 FEATURES
2572 @args);
2573 my @sfs = (@{$sfs || []});
2574 push(@sfs, $sf) if $sf;
2575 if ($seq) {
2576 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2577 @sfs = $seq->get_all_SeqFeatures;
2579 my @exons = grep {$_->primary_tag eq 'exon'} @sfs;
2580 if (@exons) {
2581 $self->problem(2,
2582 "There are already exons, so I will not infer exons");
2585 # index of features by type+location
2586 my %loc_h = ();
2588 # infer for every feature
2589 foreach my $sf (@sfs) {
2591 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2592 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2594 my $type = $sf->primary_tag;
2595 next unless $type eq 'mRNA' or $type =~ /RNA/;
2597 # an mRNA from genbank will have a discontinuous location,
2598 # with each sub-location being equivalent to an exon
2599 my @locs = $sf->location;
2601 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2602 @locs = $sf->location->each_Location;
2605 if (!@locs) {
2606 use Data::Dumper;
2607 print Dumper $sf;
2608 $self->throw("ASSERTION ERROR: sf has no location objects");
2611 # make exons from locations
2612 my @subsfs =
2613 map {
2614 my $subsf = Bio::SeqFeature::Generic->new(-location=>$_,
2615 -primary_tag=>'exon');
2616 ## Provide seq_id to new feature:
2617 $subsf->seq_id($sf->seq_id) if $sf->seq_id;
2618 $subsf->source_tag($sf->source_tag) if $sf->source_tag;
2619 ## Transfer /locus_tag and /gene tag values to inferred
2620 ## features. TODO: Perhaps? this should not be done
2621 ## indiscriminantly but rather by virtue of the setting
2622 ## of group_tag.
2623 foreach my $tag (grep /gene|locus_tag/, $sf->get_all_tags) {
2624 my @vals = $sf->get_tag_values($tag);
2625 $subsf->add_tag_value($tag, @vals);
2628 my $locstr = 'exon::'.$self->_locstr($subsf);
2630 # re-use feature if type and location the same
2631 if ($loc_h{$locstr}) {
2632 $subsf = $loc_h{$locstr};
2634 else {
2635 $loc_h{$locstr} = $subsf;
2637 $subsf;
2638 } @locs;
2640 # PARANOID CHECK
2641 $self->_check_order_is_consistent($sf->location->strand,@subsfs);
2642 #----
2644 $sf->location(Bio::Location::Simple->new());
2646 # we allow the exons to define the boundaries of the transcript
2647 $sf->add_SeqFeature($_, 'EXPAND') foreach @subsfs;
2650 if (!$sf->location->strand) {
2651 # correct weird bioperl bug in previous versions;
2652 # strand was not being set correctly
2653 $sf->location->strand($subsfs[0]->location->strand);
2658 return;
2661 #sub merge_features_with_same_loc {
2662 # my ($self,@args) = @_;
2664 # my($sfs, $seq) =
2665 # $self->_rearrange([qw(FEATURES
2666 # SEQ
2667 # )],
2668 # @args);
2669 # my @sfs = (@$sfs);
2670 # if ($seq) {
2671 # $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2672 # @sfs = $seq->get_all_SeqFeatures;
2676 # my %loc_h = ();
2677 # foreach my $sf (@sfs) {
2678 # my $type = $sf->primary_tag;
2679 # my $locstr = $self->_locstr($sf);
2680 ## $loc_h{$type.$locstr}
2681 # push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
2685 =head2 infer_mRNA_from_CDS
2687 Title : infer_mRNA_from_CDS
2688 Usage :
2689 Function:
2690 Example :
2691 Returns :
2692 Args :
2694 given a "type 1" containment hierarchy
2696 gene
2698 exon
2700 this will infer the uniform "type 0" containment hierarchy
2702 gene
2703 mRNA
2705 exon
2707 all the children of the CDS will be moved to the mRNA
2709 a "type 2" containment hierarchy is mixed type "0" and "1" (for
2710 example, see ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/)
2712 =cut
2714 sub infer_mRNA_from_CDS{
2715 my ($self,@args) = @_;
2717 my($sf, $seq, $noinfer) =
2718 $self->_rearrange([qw(FEATURE
2720 NOINFER
2722 @args);
2723 my @sfs = ($sf);
2724 if ($seq) {
2725 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2726 @sfs = $seq->get_all_SeqFeatures;
2729 foreach my $sf (@sfs) {
2731 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2732 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2733 if ($self->verbose > 0) {
2734 printf STDERR " Checking $sf %s\n", $sf->primary_tag;
2737 if ($sf->primary_tag eq 'mRNA') {
2738 $self->problem(2,
2739 "Inferring mRNAs when there are already mRNAs present");
2742 my @cdsl = grep {$_->primary_tag eq 'CDS' } $sf->get_SeqFeatures;
2743 if (@cdsl) {
2744 my @children = grep {$_->primary_tag ne 'CDS'} $sf->get_SeqFeatures;
2745 my @mrnas = ();
2748 foreach my $cds (@cdsl) {
2750 if ($self->verbose > 0) {
2751 print " Inferring mRNA from CDS $cds\n";
2753 $self->_check_order_is_consistent($cds->location->strand,$cds->location->each_Location);
2755 my $loc = Bio::Location::Split->new;
2756 foreach my $cdsexonloc ($cds->location->each_Location) {
2757 my $subloc =
2758 Bio::Location::Simple->new(-start=>$cdsexonloc->start,
2759 -end=>$cdsexonloc->end,
2760 -strand=>$cdsexonloc->strand);
2761 $loc->add_sub_Location($subloc);
2763 if ($noinfer) {
2764 push(@mrnas, $cds);
2766 else {
2767 # share the same location
2768 my $mrna =
2769 Bio::SeqFeature::Generic->new(-location=>$loc,
2770 -primary_tag=>'mRNA');
2772 ## Provide seq_id to new feature:
2773 $mrna->seq_id($cds->seq_id) if $cds->seq_id;
2774 $mrna->source_tag($cds->source_tag) if $cds->source_tag;
2776 $self->_check_order_is_consistent($mrna->location->strand,$mrna->location->each_Location);
2778 # make the mRNA hold the CDS; no EXPAND option,
2779 # the CDS cannot be wider than the mRNA
2780 $mrna->add_SeqFeature($cds);
2782 # mRNA steals children of CDS
2783 foreach my $subsf ($cds->get_SeqFeatures) {
2784 $mrna->add_SeqFeature($subsf);
2786 $cds->remove_SeqFeatures;
2787 push(@mrnas, $mrna);
2790 # change gene/CDS to gene/mRNA
2791 $sf->remove_SeqFeatures;
2792 $sf->add_SeqFeature($_) foreach (@mrnas, @children);
2795 return;
2800 =head2 remove_types
2802 Title : remove_types
2803 Usage : $unf->remove_types(-seq=>$seq, -types=>["mRNA"]);
2804 Function:
2805 Example :
2806 Returns :
2807 Args :
2809 removes features of a set type
2811 useful for pre-filtering a genbank record; eg to get rid of STSs
2813 also, there is no way to unflatten
2814 ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/ UNLESS the bogus
2815 mRNAs in these records are removed (or changed to a different type) -
2816 they just confuse things too much
2818 =cut
2820 sub remove_types{
2821 my ($self,@args) = @_;
2823 my($seq, $types) =
2824 $self->_rearrange([qw(
2826 TYPES
2828 @args);
2829 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2830 my @sfs = $seq->get_all_SeqFeatures;
2831 my %rh = map {$_=>1} @$types;
2832 @sfs = grep {!$rh{$_->primary_tag}} @sfs;
2833 $seq->remove_SeqFeatures;
2834 $seq->add_SeqFeature($_) foreach @sfs;
2835 return;
2839 # _check_order_is_consistent($strand,$ranges) RETURNS BOOL
2841 # note: the value of this test is moot - there are many valid,
2842 # if unusual cases where it would flag an anomaly. for example
2843 # transpliced genes such as mod(mdg4) in dmel on AE003744, and
2844 # the following spliced gene on NC_001284:
2846 # mRNA complement(join(20571..20717,21692..22086,190740..190761,
2847 # 140724..141939,142769..142998))
2848 # /gene="nad5"
2849 # /note="trans-splicing, RNA editing"
2850 # /db_xref="GeneID:814567"
2852 # note how the exons are not in order
2853 # this will flag a level-3 warning, the user of this module
2854 # can ignore this and deal appropriately with the resulting
2855 # unordered exons
2856 sub _check_order_is_consistent {
2857 my $self = shift;
2859 my $parent_strand = shift; # this does nothing..?
2860 my @ranges = @_;
2861 return unless @ranges;
2862 my $rangestr =
2863 join(" ",map{sprintf("[%s,%s]",$_->start,$_->end)} @ranges);
2864 my $strand = $ranges[0]->strand;
2865 for (my $i=1; $i<@ranges;$i++) {
2866 if ($ranges[$i]->strand != $strand) {
2867 $self->problem(1,"inconsistent strands. Trans-spliced gene? Range: $rangestr");
2868 return 1;
2869 # mixed ranges - autopass
2870 # some mRNAs have exons on both strands; for
2871 # example, the dmel mod(mdg4) gene which is
2872 # trans-spliced (in actual fact two mRNAs)
2875 my $pass = 1;
2876 for (my $i=1; $i<@ranges;$i++) {
2877 my $rangeP = $ranges[$i-1];
2878 my $range = $ranges[$i];
2879 if ($rangeP->start > $range->end) {
2880 if ($self->seq->is_circular) {
2881 # see for example NC_006578.gbk
2882 # we make exceptions for circular genomes here.
2883 # see Re: [Gmod-ajax] flatfile-to-json.pl error with GFF
2884 # 2010-07-26
2886 else {
2887 # failed - but still get one more chance..
2888 $pass = 0;
2889 $self->problem(2,"Ranges not in correct order. Strange ensembl genbank entry? Range: $rangestr");
2890 last;
2895 if (!$pass) {
2896 # sometimes (eg ensembl flavour genbank files)
2897 # exons on reverse strand listed in reverse order
2898 # eg join(complement(R1),...,complement(Rn))
2899 # where R1 > R2
2900 for (my $i=1; $i<@ranges;$i++) {
2901 my $rangeP = $ranges[$i-1];
2902 my $range = $ranges[$i];
2903 if ($rangeP->end < $range->start) {
2904 $self->problem(3,"inconsistent order. Range: $rangestr");
2905 return 0;
2909 return 1; # pass
2912 # PRIVATE METHOD: _locstr($sf)
2914 # returns a location string for a feature; just the outer boundaries
2915 sub _locstr {
2916 my $self = shift;
2917 my $sf = shift;
2918 return
2919 sprintf("%d..%d", $sf->start, $sf->end);
2922 sub iterate_containment_tree {
2923 my $self = shift;
2924 my $feature_holder = shift;
2925 my $sub = shift;
2926 $sub->($feature_holder);
2927 my @sfs = $feature_holder->get_SeqFeatures;
2928 $self->iterate_containment_tree($_) foreach @sfs;
2931 sub find_best_pairs {
2932 my $matrix = shift;
2933 my $size = shift;
2934 my $i = shift || 0;
2936 for (my $j=0; $j < $size; $j++) {
2937 my $score = $matrix->[$i][$j];
2938 if (!defined($score)) {
2939 next;