lib/Bio/SeqFeature/Tools/Unflattener.pm

   1 #
   2 # bioperl module for Bio::SeqFeature::Tools::Unflattener
   3 #
   4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
   5 #
   6 # Cared for by Chris Mungall <cjm@fruitfly.org>
   7 #
   8 # Copyright Chris Mungall
   9 #
  10 # You may distribute this module under the same terms as perl itself
  11
  12 # POD documentation - main docs before the code
  13
  14 =head1 NAME
  15
  16 Bio::SeqFeature::Tools::Unflattener - turns flat list of genbank-sourced features into a nested SeqFeatureI hierarchy
  17
  18 =head1 SYNOPSIS
  19
  20   # standard / generic use - unflatten a genbank record
  21   use Bio::SeqIO;
  22   use Bio::SeqFeature::Tools::Unflattener;
  23
  24   # generate an Unflattener object
  25   $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
  26
  27   # first fetch a genbank SeqI object
  28   $seqio =
  29     Bio::SeqIO->new(-file=>'AE003644.gbk',
  30                     -format=>'GenBank');
  31   my $out =
  32     Bio::SeqIO->new(-format=>'asciitree');
  33   while ($seq = $seqio->next_seq()) {
  34
  35     # get top level unflattended SeqFeatureI objects
  36     $unflattener->unflatten_seq(-seq=>$seq,
  37                                 -use_magic=>1);
  38     $out->write_seq($seq);
  39
  40     @top_sfs = $seq->get_SeqFeatures;
  41     foreach my $sf (@top_sfs) {
  42         # do something with top-level features (eg genes)
  43     }
  44   }
  45
  46
  47 =head1 DESCRIPTION
  48
  49 Most GenBank entries for annotated genomic DNA contain a B<flat> list
  50 of features. These features can be parsed into an equivalent flat list
  51 of L<Bio::SeqFeatureI> objects using the standard L<Bio::SeqIO>
  52 classes. However, it is often desirable to B<unflatten> this list into
  53 something resembling actual B<gene models>, in which genes, mRNAs and CDSs
  54 are B<nested> according to the nature of the gene model.
  55
  56 The BioPerl object model allows us to store these kind of associations
  57 between SeqFeatures in B<containment hierarchies> -- any SeqFeatureI
  58 object can contain nested SeqFeatureI objects. The
  59 Bio::SeqFeature::Tools::Unflattener object facilitates construction of
  60 these hierarchies from the underlying GenBank flat-feature-list
  61 representation.
  62
  63 For example, if you were to look at a typical GenBank DNA entry, say,
  64 B<AE003644>, you would see a flat list of features:
  65
  66   source
  67
  68   gene CG4491
  69   mRNA CG4491-RA
  70   CDS CG4491-PA
  71
  72   gene tRNA-Pro
  73   tRNA tRNA-Pro
  74
  75   gene CG32954
  76   mRNA CG32954-RA
  77   mRNA CG32954-RC
  78   mRNA CG32954-RB
  79   CDS CG32954-PA
  80   CDS CG32954-PB
  81   CDS CG32954-PC
  82
  83 These features have sequence locations, but it is not immediately
  84 clear how to write code such that each mRNA is linked to the
  85 appropriate CDS (other than relying on IDs which is very bad)
  86
  87 We would like to convert the above list into the B<containment
  88 hierarchy>, shown below:
  89
  90   source
  91   gene
  92     mRNA CG4491-RA
  93       CDS CG4491-PA
  94       exon
  95       exon
  96   gene
  97     tRNA tRNA-Pro
  98       exon
  99   gene
 100     mRNA CG32954-RA
 101       CDS CG32954-PA
 102       exon
 103       exon
 104     mRNA CG32954-RC
 105       CDS CG32954-PC
 106       exon
 107       exon
 108     mRNA CG32954-RB
 109       CDS CG32954-PB
 110       exon
 111       exon
 112
 113 Where each feature is nested underneath its container. Note that exons
 114 have been automatically inferred (even for tRNA genes).
 115
 116 We do this using a call on a L<Bio::SeqFeature::Tools::Unflattener>
 117 object
 118
 119   @sfs = $unflattener->unflatten_seq(-seq=>$seq);
 120
 121 This would return a list of the B<top level> (i.e. container)
 122 SeqFeatureI objects - in this case, genes. Other top level features
 123 are possible; for instance, the B<source> feature which is always
 124 present, and other features such as B<variation> or B<misc_feature>
 125 types.
 126
 127 The containment hierarchy can be accessed using the get_SeqFeature()
 128 call on any feature object - see L<Bio::SeqFeature::FeatureHolderI>.
 129 The following code will traverse the containment hierarchy for a
 130 feature:
 131
 132   sub traverse {
 133     $sf = shift;   #  $sf isa Bio::SeqfeatureI
 134
 135     # ...do something with $sf!
 136
 137     # depth first traversal of containment tree
 138     @contained_sfs = $sf->get_SeqFeatures;
 139     traverse($_) foreach @contained_sfs;
 140   }
 141
 142 Once you have built the hierarchy, you can do neat stuff like turn the
 143 features into 'rich' feature objects (eg
 144 L<Bio::SeqFeature::Gene::GeneStructure>) or convert to a suitable
 145 format such as GFF3 or chadoxml (after mapping to the Sequence
 146 Ontology); this step is not described here.
 147
 148 =head1 USING MAGIC
 149
 150 Due to the quixotic nature of how features are stored in
 151 GenBank/EMBL/DDBJ, there is no guarantee that the default behaviour of
 152 this module will produce perfect results. Sometimes it is hard or
 153 impossible to build a correct containment hierarchy if the information
 154 provided is simply too lossy, as is often the case. If you care deeply
 155 about your data, you should always manually inspect the resulting
 156 containment hierarchy; you may have to customise the algorithm for
 157 building the hierarchy, or even manually tweak the resulting
 158 hierarchy. This is explained in more detail further on in the document.
 159
 160 However, if you are satisfied with the default behaviour, then you do
 161 not need to read any further. Just make sure you set the parameter
 162 B<use_magic> - this will invoke incantations which will magically
 163 produce good results no matter what the idiosyncracies of the
 164 particular GenBank record in question.
 165
 166 For example
 167
 168   $unflattener->unflatten_seq(-seq=>$seq,
 169                               -use_magic=>1);
 170
 171 The success of this depends on the phase of the moon at the time the
 172 entry was submitted to GenBank. Note that the magical recipe is being
 173 constantly improved, so the results of invoking magic may vary
 174 depending on the bioperl release.
 175
 176 If you are skeptical of magic, or you wish to exact fine grained
 177 control over how the entry is unflattened, or you simply wish to
 178 understand more about how this crazy stuff works, then read on!
 179
 180 =head1 PROBLEMATIC DATA AND INCONSISTENCIES
 181
 182 Occasionally the Unflattener will have problems with certain
 183 records. For example, the record may contain inconsistent data - maybe
 184 there is an B<exon> entry that has no corresponding B<mRNA> location.
 185
 186 The default behaviour is to throw an exception reporting the problem,
 187 if the problem is relatively serious - for example, inconsistent data.
 188
 189 You can exert more fine grained control over this - perhaps you want
 190 the Unflattener to do the best it can, and report any problems. This
 191 can be done - refer to the methods.
 192
 193   error_threshold()
 194
 195   get_problems()
 196
 197   report_problems()
 198
 199   ignore_problems()
 200
 201 =head1 ALGORITHM
 202
 203 This is the default algorithm; you should be able to override any part
 204 of it to customise.
 205
 206 The core of the algorithm is in two parts
 207
 208 =over
 209
 210 =item Partitioning the flat feature list into groups
 211
 212 =item Resolving the feature containment hierarchy for each group
 213
 214 =back
 215
 216 There are other optional steps after the completion of these two
 217 steps, such as B<inferring exons>; we now describe in more detail what
 218 is going on.
 219
 220 =head2 Partitioning into groups
 221
 222 First of all the flat feature list is partitioned into B<group>s.
 223
 224 The default way of doing this is to use the B<gene> attribute; if we
 225 look at two features from GenBank accession AE003644.3:
 226
 227      gene            20111..23268
 228                      /gene="noc"
 229                      /locus_tag="CG4491"
 230                      /note="last curated on Thu Dec 13 16:51:32 PST 2001"
 231                      /map="35B2-35B2"
 232                      /db_xref="FLYBASE:FBgn0005771"
 233      mRNA            join(20111..20584,20887..23268)
 234                      /gene="noc"
 235                      /locus_tag="CG4491"
 236                      /product="CG4491-RA"
 237                      /db_xref="FLYBASE:FBgn0005771"
 238
 239 Both these features share the same /gene tag which is "noc", so they
 240 correspond to the same gene model (the CDS feature is not shown, but
 241 this also has a tag-value /gene="noc").
 242
 243 Not all groups need to correspond to gene models, but this is the most
 244 common use case; later on we shall describe how to customise the
 245 grouping.
 246
 247 Sometimes other tags have to be used; for instance, if you look at the
 248 entire record for AE003644.3 you will see you actually need the use the
 249 /locus_tag attribute. This attribute is actually B<not present> in
 250 most records!
 251
 252 You can override this:
 253
 254   $collection->unflatten_seq(-seq=>$seq, -group_tag=>'locus_tag');
 255
 256 Alternatively, if you B<-use_magic>, the object will try and make a
 257 guess as to what the correct group_tag should be.
 258
 259 At the end of this step, we should have a list of groups - there is no
 260 structure within a group; the group just serves to partition the flat
 261 features. For the example data above, we would have the following groups.
 262
 263   [ source ]
 264   [ gene mRNA CDS ]
 265   [ gene mRNA CDS ]
 266   [ gene mRNA CDS ]
 267   [ gene mRNA mRNA mRNA CDS CDS CDS ]
 268
 269 =head3 Multicopy Genes
 270
 271 Multicopy genes are usually rRNAs or tRNAs that are duplicated across
 272 the genome. Because they are functionally equivalent, and usually have
 273 the same sequence, they usually have the same group_tag (ie gene
 274 symbol); they often have a /note tag giving copy number. This means
 275 they will end up in the same group. This is undesirable, because they
 276 are spatially disconnected.
 277
 278 There is another step, which involves splitting spatially disconnected
 279 groups into distinct groups
 280
 281 this would turn this
 282
 283  [gene-rrn3 rRNA-rrn3 gene-rrn3 rRNA-rrn3]
 284
 285 into this
 286
 287  [gene-rrn3 rRNA-rrn3] [gene-rrn3 rRNA-rrn3]
 288
 289 based on the coordinates
 290
 291 =head3 What next?
 292
 293 The next step is to add some structure to each group, by making
 294 B<containment hierarchies>, trees that represent how the features
 295 interrelate
 296
 297 =head2 Resolving the containment hierarchy
 298
 299 After the grouping is done, we end up with a list of groups which
 300 probably contain features of type 'gene', 'mRNA', 'CDS' and so on.
 301
 302 Singleton groups (eg the 'source' feature) are ignored at this stage.
 303
 304 Each group is itself flat; we need to add an extra level of
 305 organisation. Usually this is because different spliceforms
 306 (represented by the 'mRNA' feature) can give rise to different
 307 protein products (indicated by the 'CDS' feature). We want to correctly
 308 associate mRNAs to CDSs.
 309
 310 We want to go from a group like this:
 311
 312   [ gene mRNA mRNA mRNA CDS CDS CDS ]
 313
 314 to a containment hierarchy like this:
 315
 316   gene
 317     mRNA
 318       CDS
 319     mRNA
 320       CDS
 321     mRNA
 322       CDS
 323
 324 In which each CDS is nested underneath the correct corresponding mRNA.
 325
 326 For entries that contain no alternate splicing, this is simple; we
 327 know that the group
 328
 329   [ gene mRNA CDS ]
 330
 331 Must resolve to the tree
 332
 333   gene
 334     mRNA
 335       CDS
 336
 337 How can we do this in entries with alternate splicing? The bad
 338 news is that there is no guaranteed way of doing this correctly for
 339 any GenBank entry. Occasionally the submission will have been done in
 340 such a way as to reconstruct the containment hierarchy. However, this
 341 is not consistent across databank entries, so no generic solution can
 342 be provided by this object. This module does provide the framework
 343 within which you can customise a solution for the particular dataset
 344 you are interested in - see later.
 345
 346 The good news is that there is an inference we can do that should
 347 produce pretty good results the vast majority of the time. It uses
 348 splice coordinate data - this is the default behaviour of this module,
 349 and is described in detail below.
 350
 351 =head2 Using splice site coordinates to infer containment
 352
 353 If an mRNA is to be the container for a CDS, then the splice site
 354 coordinates (or intron coordinates, depending on how you look at it)
 355 of the CDS must fit inside the splice site coordinates of the mRNA.
 356
 357 Ambiguities can still arise, but the results produced should still be
 358 reasonable and consistent at the sequence level. Look at this fake
 359 example:
 360
 361   mRNA    XXX---XX--XXXXXX--XXXX         join(1..3,7..8,11..16,19..23)
 362   mRNA    XXX-------XXXXXX--XXXX         join(1..3,11..16,19..23)
 363   CDS                 XXXX--XX           join(13..16,19..20)
 364   CDS                 XXXX--XX           join(13..16,19..20)
 365
 366 [obviously the positions have been scaled down]
 367
 368 We cannot unambiguously match mRNA with CDS based on splice sites,
 369 since both CDS share the splice site locations 16^17 and
 370 18^19. However, the consequences of making a wrong match are probably
 371 not very severe. Any annotation data attached to the first CDS is
 372 probably identical to the seconds CDS, other than identifiers.
 373
 374 The default behaviour of this module is to make an arbitrary call
 375 where it is ambiguous (the mapping will always be bijective; i.e. one
 376 mRNA -E<gt> one CDS).
 377
 378 [TODO: NOTE: not tested on EMBL data, which may not be bijective; ie two
 379 mRNAs can share the same CDS??]
 380
 381 This completes the building of the containment hierarchy; other
 382 optional step follow
 383
 384 =head1 POST-GROUPING STEPS
 385
 386 =head2 Inferring exons from mRNAs
 387
 388 This step always occurs if B<-use_magic> is invoked.
 389
 390 In a typical GenBank entry, the exons are B<implicit>. That is they
 391 can be inferred from the mRNA location.
 392
 393 For example:
 394
 395      mRNA            join(20111..20584,20887..23268)
 396
 397 This tells us that this particular transcript has two exons. In
 398 bioperl, the mRNA feature will have a 'split location'.
 399
 400 If we call
 401
 402   $unflattener->feature_from_splitloc(-seq=>$seq);
 403
 404 This will generate the necessary exon features, and nest them under
 405 the appropriate mRNAs. Note that the mRNAs will no longer have split
 406 locations - they will have simple locations spanning the extent of the
 407 exons. This is intentional, to avoid redundancy.
 408
 409 Occasionally a GenBank entry will have both implicit exons (from the
 410 mRNA location) B<and> explicit exon features.
 411
 412 In this case, exons will still be transferred. Tag-value data from the
 413 explicit exon will be transferred to the implicit exon. If exons are
 414 shared between mRNAs these will be represented by different
 415 objects. Any inconsistencies between implicit and explicit will be
 416 reported.
 417
 418 =head3 tRNAs and other noncoding RNAs
 419
 420 exons will also be generated from these features
 421
 422 =head2 Inferring mRNAs from CDS
 423
 424 Some GenBank entries represent gene models using features of type
 425 gene, mRNA and CDS; some entries just use gene and CDS.
 426
 427 If we only have gene and CDS, then the containment hierarchies will
 428 look like this:
 429
 430   gene
 431     CDS
 432
 433 If we want the containment hierarchies to be uniform, like this
 434
 435   gene
 436     mRNA
 437       CDS
 438
 439 Then we must create an mRNA feature. This will have identical
 440 coordinates to the CDS. The assumption is that there is either no
 441 untranslated region, or it is unknown.
 442
 443 To do this, we can call
 444
 445    $unflattener->infer_mRNA_from_CDS(-seq=>$seq);
 446
 447 This is taken care of automatically, if B<-use_magic> is invoked.
 448
 449 =head1 ADVANCED
 450
 451 =head2 Customising the grouping of features
 452
 453 The default behaviour is suited mostly to building models of protein
 454 coding genes and noncoding genes from genbank genomic DNA submissions.
 455
 456 You can change the tag used to partition the feature by passing in a
 457 different group_tag argument - see the unflatten_seq() method
 458
 459 Other behaviour may be desirable. For example, even though SNPs
 460 (features of type 'variation' in GenBank) are not actually part of the
 461 gene model, it may be desirable to group SNPs that overlap or are
 462 nearby gene models.
 463
 464 It should certainly be possible to extend this module to do
 465 this. However, I have yet to code this part!!! If anyone would find
 466 this useful let me know.
 467
 468 In the meantime, you could write your own grouping subroutine, and
 469 feed the results into unflatten_groups() [see the method documentation
 470 below]
 471
 472 =head2 Customising the resolution of the containment hierarchy
 473
 474 Once the flat list of features has been partitioned into groups, the
 475 method unflatten_group() is called on each group to build a tree.
 476
 477 The algorithm for doing this is described above; ambiguities are
 478 resolved by using splice coordinates. As discussed, this can be
 479 ambiguous.
 480
 481 Some submissions may contain information in tags/attributes that hint
 482 as to the mapping that needs to be made between the features.
 483
 484 For example, with the Drosophila Melanogaster release 3 submission, we
 485 see that CDS features in alternately spliced mRNAs have a form like
 486 this:
 487
 488      CDS             join(145588..145686,145752..146156,146227..146493)
 489                      /locus_tag="CG32954"
 490                      /note="CG32954 gene product from transcript CG32954-RA"
 491                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 492                      /codon_start=1
 493                      /product="CG32954-PA"
 494                      /protein_id="AAF53403.1"
 495                      /db_xref="GI:7298167"
 496                      /db_xref="FLYBASE:FBgn0052954"
 497                      /translation="MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENP..."
 498
 499 Here the /note tag provides the clue we need to link CDS to mRNA
 500 (highlighted with ^^^^). We just need to find the mRNA with the tag
 501
 502   /product="CG32954-RA"
 503
 504 I have no idea how consistent this practice is across submissions; it
 505 is consistent for the fruitfly genome submission.
 506
 507 We can customise the behaviour of unflatten_group() by providing our
 508 own resolver method. This obviously requires a bit of extra
 509 programming, but there is no way to get around this.
 510
 511 Here is an example of how to pass in your own resolver; this example
 512 basically checks the parent (container) /product tag to see if it
 513 matches the required string in the child (contained) /note tag.
 514
 515        $unflattener->unflatten_seq(-seq=>$seq,
 516                                  -group_tag=>'locus_tag',
 517                                  -resolver_method=>sub {
 518                                      my $self = shift;
 519                                      my ($sf, @candidate_container_sfs) = @_;
 520                                      if ($sf->has_tag('note')) {
 521                                          my @notes = $sf->get_tag_values('note');
 522                                          my @trnames = map {/from transcript\s+(.*)/;
 523                                                             $1} @notes;
 524                                          @trnames = grep {$_} @trnames;
 525                                          my $trname;
 526                                          if (@trnames == 0) {
 527                                              $self->throw("UNRESOLVABLE");
 528                                          }
 529                                          elsif (@trnames == 1) {
 530                                              $trname = $trnames[0];
 531                                          }
 532                                          else {
 533                                              $self->throw("AMBIGUOUS: @trnames");
 534                                          }
 535                                          my @container_sfs =
 536                                            grep {
 537                                                my ($product) =
 538                                                  $_->has_tag('product') ?
 539                                                    $_->get_tag_values('product') :
 540                                                      ('');
 541                                                $product eq $trname;
 542                                            } @candidate_container_sfs;
 543                                          if (@container_sfs == 0) {
 544                                              $self->throw("UNRESOLVABLE");
 545                                          }
 546                                          elsif (@container_sfs == 1) {
 547                                              # we got it!
 548                                              return $container_sfs[0];
 549                                          }
 550                                          else {
 551                                              $self->throw("AMBIGUOUS");
 552                                          }
 553                                      }
 554                                  });
 555
 556 the resolver method is only called when there is more than one spliceform.
 557
 558 =head2 Parsing mRNA records
 559
 560 Some of the entries in sequence databanks are for mRNA sequences as
 561 well as genomic DNA. We may want to build models from these too.
 562
 563 NOT YET DONE - IN PROGRESS!!!
 564
 565 Open question - what would these look like?
 566
 567 Ideally we would like a way of combining a mRNA record with the
 568 corresponding SeFeature entry from the appropriate genomic DNA
 569 record. This could be problemmatic in some cases - for example, the
 570 mRNA sequences may not match 100% (due to differences in strain,
 571 assembly problems, sequencing problems, etc). What then...?
 572
 573 =head1 SEE ALSO
 574
 575 Feature table description
 576
 577   http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
 578
 579 =head1 FEEDBACK
 580
 581 =head2 Mailing Lists
 582
 583 User feedback is an integral part of the evolution of this and other
 584 Bioperl modules. Send your comments and suggestions preferably to the
 585 Bioperl mailing lists  Your participation is much appreciated.
 586
 587   bioperl-l@bioperl.org                         - General discussion
 588   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 589
 590 =head2 Support
 591
 592 Please direct usage questions or support issues to the mailing list:
 593
 594 I<bioperl-l@bioperl.org>
 595
 596 rather than to the module maintainer directly. Many experienced and
 597 reponsive experts will be able look at the problem and quickly
 598 address it. Please include a thorough description of the problem
 599 with code and data examples if at all possible.
 600
 601 =head2 Reporting Bugs
 602
 603 report bugs to the Bioperl bug tracking system to help us keep track
 604 the bugs and their resolution.  Bug reports can be submitted via the
 605 web:
 606
 607   https://github.com/bioperl/bioperl-live/issues
 608
 609 =head1 AUTHOR - Chris Mungall
 610
 611 Email:  cjm@fruitfly.org
 612
 613 =head1 APPENDIX
 614
 615 The rest of the documentation details each of the object
 616 methods. Internal methods are usually preceded with a _
 617
 618 =cut
 619
 620
 621 # Let the code begin...
 622
 623 package Bio::SeqFeature::Tools::Unflattener;
 624
 625 use strict;
 626
 627 # Object preamble - inherits from Bio::Root::Root
 628 use Bio::Location::Simple;
 629 use Bio::SeqFeature::Generic;
 630 use Bio::Range;
 631
 632
 633 use base qw(Bio::Root::Root);
 634
 635 =head2 new
 636
 637  Title   : new
 638  Usage   : $unflattener = Bio::SeqFeature::Tools::Unflattener->new();
 639            $unflattener->unflatten_seq(-seq=>$seq);
 640  Function: constructor
 641  Example :
 642  Returns : a new Bio::SeqFeature::Tools::Unflattener
 643  Args    : see below
 644
 645 Arguments
 646
 647   -seq       : A L<Bio::SeqI> object (optional)
 648                the sequence to unflatten; this can also be passed in
 649                when we call unflatten_seq()
 650
 651   -group_tag : a string representing the /tag used to partition flat features
 652                (see discussion above)
 653
 654 =cut
 655
 656
 657 sub new {
 658     my($class,@args) = @_;
 659     my $self = $class->SUPER::new(@args);
 660
 661     my($seq, $group_tag, $trust_grouptag) =
 662         $self->_rearrange([qw(SEQ
 663                               GROUP_TAG
 664                               TRUST_GROUPTAG
 665                              )],
 666                           @args);
 667
 668     $seq  && $self->seq($seq);
 669     $group_tag  && $self->group_tag($group_tag);
 670     # $self->{'trust_grouptag'}= $trust_grouptag if($trust_grouptag); #dgg suggestion
 671     return $self; # success - we hope!
 672 }
 673
 674 sub DESTROY {
 675     my $self = shift;
 676     return if $self->{_reported_problems};
 677     return if $self->{_ignore_problems};
 678     my @probs = $self->get_problems;
 679     if (!$self->{_problems_reported} &&
 680         scalar(@probs)) {
 681         $self->warn(
 682             "WARNING: There are UNREPORTED PROBLEMS.\n".
 683             "You may wish to use the method report_problems(), \n",
 684             "or ignore_problems() on the Unflattener object\n");
 685     }
 686     return;
 687 }
 688
 689 =head2 seq
 690
 691  Title   : seq
 692  Usage   : $unflattener->seq($newval)
 693  Function:
 694  Example :
 695  Returns : value of seq (a Bio::SeqI)
 696  Args    : on set, new value (a Bio::SeqI, optional)
 697
 698 The Bio::SeqI object should hold a flat list of Bio::SeqFeatureI
 699 objects; this is the list that will be unflattened.
 700
 701 The sequence object can also be set when we call unflatten_seq()
 702
 703 =cut
 704
 705 sub seq{
 706     my $self = shift;
 707
 708     return $self->{'seq'} = shift if @_;
 709     return $self->{'seq'};
 710 }
 711
 712 =head2 group_tag
 713
 714  Title   : group_tag
 715  Usage   : $unflattener->group_tag($newval)
 716  Function:
 717  Example :
 718  Returns : value of group_tag (a scalar)
 719  Args    : on set, new value (a scalar or undef, optional)
 720
 721 This is the tag that will be used to collect elements from the flat
 722 feature list into groups; for instance, if we look at two typical
 723 GenBank features:
 724
 725      gene            20111..23268
 726                      /gene="noc"
 727                      /locus_tag="CG4491"
 728                      /note="last curated on Thu Dec 13 16:51:32 PST 2001"
 729                      /map="35B2-35B2"
 730                      /db_xref="FLYBASE:FBgn0005771"
 731      mRNA            join(20111..20584,20887..23268)
 732                      /gene="noc"
 733                      /locus_tag="CG4491"
 734                      /product="CG4491-RA"
 735                      /db_xref="FLYBASE:FBgn0005771"
 736
 737 We can see that these comprise the same gene model because they share
 738 the same /gene attribute; we want to collect these together in groups.
 739
 740 Setting group_tag is optional. The default is to use 'gene'. In the
 741 example above, we could also use /locus_tag
 742
 743 =cut
 744
 745 sub group_tag{
 746     my $self = shift;
 747
 748     return $self->{'group_tag'} = shift if @_;
 749     return $self->{'group_tag'};
 750 }
 751
 752 =head2 partonomy
 753
 754  Title   : partonomy
 755  Usage   : $unflattener->partonomy({mRNA=>'gene', CDS=>'mRNA')
 756  Function:
 757  Example :
 758  Returns : value of partonomy (a scalar)
 759  Args    : on set, new value (a scalar or undef, optional)
 760
 761 A hash representing the containment structure that the seq_feature
 762 nesting should conform to; each key represents the contained (child)
 763 type; each value represents the container (parent) type.
 764
 765 =cut
 766
 767 sub partonomy{
 768     my $self = shift;
 769
 770     return $self->{'partonomy'} = shift if @_;
 771     if (!$self->{'partonomy'}) {
 772         $self->{'partonomy'} = $self->_default_partonomy;
 773     }
 774     return $self->{'partonomy'};
 775 }
 776
 777 sub _default_partonomy{
 778     return {
 779             mRNA => 'gene',
 780             tRNA => 'gene',
 781             rRNA => 'gene',
 782             scRNA => 'gene',
 783             snRNA => 'gene',
 784             snoRNA => 'gene',
 785             misc_RNA => 'gene',
 786             CDS => 'mRNA',
 787             exon => 'mRNA',
 788             intron => 'mRNA',
 789
 790             pseudoexon => 'pseudogene',
 791             pseudointron => 'pseudogene',
 792             pseudotranscript => 'pseudogene',
 793            };
 794 }
 795
 796 =head2 structure_type
 797
 798  Title   : structure_type
 799  Usage   : $unflattener->structure_type($newval)
 800  Function:
 801  Example :
 802  Returns : value of structure_type (a scalar)
 803  Args    : on set, new value (an int or undef, optional)
 804
 805 GenBank entries conform to different flavours, or B<structure
 806 types>. Some have mRNAs, some do not.
 807
 808 Right now there are only two base structure types defined. If you set
 809 the structure type, then appropriate unflattening action will be
 810 taken.  The presence or absence of explicit exons does not affect the
 811 structure type.
 812
 813 If you invoke B<-use_magic> then this will be set automatically, based
 814 on the content of the record.
 815
 816 =over
 817
 818 =item Type 0 (DEFAULT)
 819
 820 typically contains
 821
 822   source
 823   gene
 824   mRNA
 825   CDS
 826
 827 with this structure type, we want the seq_features to be nested like this
 828
 829   gene
 830     mRNA
 831     CDS
 832       exon
 833
 834 exons and introns are implicit from the mRNA 'join' location
 835
 836 to get exons from the mRNAs, you will need this call (see below)
 837
 838   $unflattener->feature_from_splitloc(-seq=>$seq);
 839
 840 =item Type 1
 841
 842 typically contains
 843
 844   source
 845   gene
 846   CDS
 847   exon [optional]
 848   intron [optional]
 849
 850 there are no mRNA features
 851
 852 with this structure type, we want the seq_features to be nested like this
 853
 854   gene
 855     CDS
 856       exon
 857       intron
 858
 859 exon and intron may or may not be present; they may be implicit from
 860 the CDS 'join' location
 861
 862 =back
 863
 864 =cut
 865
 866 sub structure_type{
 867     my $self = shift;
 868
 869     return $self->{'structure_type'} = shift if @_;
 870     return $self->{'structure_type'};
 871 }
 872
 873 =head2 get_problems
 874
 875  Title   : get_problems
 876  Usage   : @probs = get_problems()
 877  Function: Get the list of problem(s) for this object.
 878  Example :
 879  Returns : An array of [severity, description] pairs
 880  Args    :
 881
 882 In the course of unflattening a record, problems may occur. Some of
 883 these problems are non-fatal, and can be ignored.
 884
 885 Problems are represented as arrayrefs containing a pair [severity,
 886 description]
 887
 888 severity is a number, the higher, the more severe the problem
 889
 890 the description is a text string
 891
 892 =cut
 893
 894 sub get_problems{
 895     my $self = shift;
 896
 897     return @{$self->{'_problems'}} if exists($self->{'_problems'});
 898     return ();
 899 }
 900
 901 =head2 clear_problems
 902
 903  Title   : clear_problems
 904  Usage   :
 905  Function: resets the problem list to empty
 906  Example :
 907  Returns :
 908  Args    :
 909
 910
 911 =cut
 912
 913 sub clear_problems{
 914    my ($self,@args) = @_;
 915    $self->{'_problems'} = [];
 916    return;
 917 }
 918
 919
 920 # PRIVATE
 921 # see get_problems
 922 sub add_problem{
 923     my $self = shift;
 924
 925     $self->{'_problems'} = [] unless exists($self->{'_problems'});
 926     if ($self->verbose > 0) {
 927         warn( "PROBLEM: $_\n") foreach @_;
 928     }
 929     push(@{$self->{'_problems'}}, @_);
 930 }
 931
 932 # PRIVATE
 933 # see get_problems
 934 sub problem {
 935     my $self = shift;
 936     my ($severity, $desc, @sfs) = @_;
 937     if (@sfs) {
 938         foreach my $sf (@sfs) {
 939             $desc .=
 940               sprintf("\nSF [$sf]: ". $sf->location->to_FTstring . "; %s\n",
 941                       join('; ',
 942                            $sf->primary_tag,
 943                            map {
 944                                $sf->has_tag($_) ?
 945                                  $sf->get_tag_values($_) : ()
 946                              } qw(locus_tag gene product label)));
 947         }
 948     }
 949     my $thresh = $self->error_threshold;
 950     if ($severity > $thresh) {
 951         $self->{_problems_reported} = 1;
 952         $self->throw("PROBLEM, SEVERITY==$severity\n$desc");
 953     }
 954     $self->add_problem([$severity, $desc]);
 955     return;
 956 }
 957
 958 =head2 report_problems
 959
 960  Title   : report_problems
 961  Usage   : $unflattener->report_problems(\*STDERR);
 962  Function:
 963  Example :
 964  Returns :
 965  Args    : FileHandle (defaults to STDERR)
 966
 967
 968 =cut
 969
 970 sub report_problems{
 971    my ($self, $fh) = @_;
 972
 973    if (!$fh) {
 974        $fh = \*STDERR;
 975    }
 976    foreach my $problem ($self->get_problems) {
 977        my ($sev, $desc) = @$problem;
 978        printf $fh "PROBLEM, SEVERITY==$sev\n$desc\n";
 979    }
 980    $self->{_problems_reported} = 1;
 981    return;
 982 }
 983
 984 =head2 ignore_problems
 985
 986  Title   : ignore_problems
 987  Usage   : $obj->ignore_problems();
 988  Function:
 989  Example :
 990  Returns :
 991  Args    :
 992
 993 Unflattener is very particular about problems it finds along the
 994 way. If you have set the error_threshold such that less severe
 995 problems do not cause exceptions, Unflattener still expects you to
 996 report_problems() at the end, so that the user of the module is aware
 997 of any inconsistencies or problems with the data. In fact, a warning
 998 will be produced if there are unreported problems. To silence, this
 999 warning, call the ignore_problems() method before the Unflattener
1000 object is destroyed.
1001
1002 =cut
1003
1004 sub ignore_problems{
1005    my ($self) = @_;
1006    $self->{_ignore_problems} = 1;
1007    return;
1008 }
1009
1010
1011 =head2 error_threshold
1012
1013  Title   : error_threshold
1014  Usage   : $obj->error_threshold($severity)
1015  Function:
1016  Example :
1017  Returns : value of error_threshold (a scalar)
1018  Args    : on set, new value (an integer)
1019
1020 Sets the threshold above which errors cause this module to throw an
1021 exception. The default is 0; all problems with a severity E<gt> 0 will
1022 cause an exception.
1023
1024 If you raise the threshold to 1, then the unflattening process will be
1025 more lax; problems of severity==1 are generally non-fatal, but may
1026 indicate that the results should be inspected, for example, to make
1027 sure there is no data loss.
1028
1029 =cut
1030
1031 sub error_threshold{
1032     my $self = shift;
1033
1034     return $self->{'error_threshold'} = shift if @_;
1035     return $self->{'error_threshold'} || 0;
1036 }
1037
1038
1039
1040 # PRIVATE
1041 #
1042 # given a type (eg mRNA), will return the container type (eg gene)
1043 sub get_container_type{
1044    my ($self,$type) = @_;
1045    my @roots = $self->_get_partonomy_roots;
1046    if (grep {$_ eq $type} @roots) {
1047        # it is a root - no parents/containers
1048        return;
1049    }
1050    my $ch = $self->partonomy;
1051    my $ctype = $ch->{$type};
1052    if (!$ctype) {
1053        # asterix acts as a wild card
1054        $ctype = $ch->{'*'};
1055    }
1056    return $ctype;
1057 }
1058
1059 # get root node of partonomy hierarchy (usually gene)
1060 sub _get_partonomy_roots {
1061     my $self = shift;
1062     my $ch = $self->partonomy;
1063     my @parents = values %$ch;
1064     # find parents that do not have parents themselves
1065     return grep {!$ch->{$_}} @parents;
1066 }
1067
1068
1069
1070 =head2 unflatten_seq
1071
1072  Title   : unflatten_seq
1073  Usage   : @sfs = $unflattener->unflatten_seq($seq);
1074  Function: turns a flat list of features into a list of holder features
1075  Example :
1076  Returns : list of Bio::SeqFeatureI objects
1077  Args    : see below
1078
1079 partitions a list of features then arranges them in a nested tree; see
1080 above for full explanation.
1081
1082 note - the Bio::SeqI object passed in will be modified
1083
1084 Arguments
1085
1086   -seq   :          a Bio::SeqI object; must contain Bio::SeqFeatureI objects
1087                     (this is optional if seq has already been set)
1088
1089   -use_magic:       if TRUE (ie non-zero) then magic will be invoked;
1090                     see discussion above.
1091
1092   -resolver_method: a CODE reference
1093                     see the documentation above for an example of
1094                     a subroutine that can be used to resolve hierarchies
1095                     within groups.
1096
1097                     this is optional - if nothing is supplied, a default
1098                     subroutine will be used (see below)
1099
1100   -group_tag:       a string
1101                     [ see the group_tag() method ]
1102                     this overrides the default group_tag which is 'gene'
1103
1104
1105
1106 =cut
1107
1108 sub unflatten_seq{
1109    my ($self,@args) = @_;
1110
1111     my($seq, $resolver_method, $group_tag, $partonomy,
1112        $structure_type, $resolver_tag, $use_magic, $noinfer) =
1113         $self->_rearrange([qw(SEQ
1114                               RESOLVER_METHOD
1115                               GROUP_TAG
1116                               PARTONOMY
1117                               STRUCTURE_TYPE
1118                               RESOLVER_TAG
1119                               USE_MAGIC
1120                               NOINFER
1121                              )],
1122                           @args);
1123
1124    # seq we want to unflatten
1125    $seq = $seq || $self->seq;
1126    if (!$self->seq) {
1127        $self->seq($seq);
1128    }
1129
1130
1131    # prevent bad argument combinations
1132    if ($partonomy &&
1133        defined($structure_type)) {
1134        $self->throw("You cannot set both -partonomy and -structure_type\n".
1135                     "(the former is implied by the latter)");
1136    }
1137
1138    # remember the current value of partonomy, to reset later
1139    my $old_partonomy = $self->partonomy;
1140    $self->partonomy($partonomy) if defined $partonomy;
1141
1142    # remember old structure_type
1143    my $old_structure_type = $self->structure_type;
1144    $self->structure_type($structure_type) if defined $structure_type;
1145
1146    # if we are sourcing our data from genbank, all the
1147    # features should be flat (eq no sub_SeqFeatures)
1148    my @flat_seq_features = $seq->get_SeqFeatures;
1149    my @all_seq_features = $seq->get_all_SeqFeatures;
1150
1151    # sanity checks
1152    if (@all_seq_features > @flat_seq_features) {
1153        $self->throw("It looks as if this sequence has already been unflattened");
1154    }
1155    if (@all_seq_features < @flat_seq_features) {
1156        $self->throw("ASSERTION ERROR: something is seriously wrong with your features");
1157    }
1158
1159    # tag for ungrouping; usually /gene or /locus_tag
1160    #     for example:        /gene="foo"
1161    $group_tag = $group_tag || $self->group_tag;
1162    if ($use_magic) {
1163        # use magic to guess the group tag
1164        my @sfs_with_locus_tag =
1165          grep {$_->has_tag("locus_tag")} @flat_seq_features;
1166        my @sfs_with_gene_tag =
1167          grep {$_->has_tag("gene")} @flat_seq_features;
1168        my @sfs_with_product_tag =
1169          grep {$_->has_tag("product")} @flat_seq_features;
1170
1171 #        if ($group_tag && $self->{'trust_grouptag'}) { # dgg suggestion
1172 #
1173 #         }
1174 #        elsif
1175        if (@sfs_with_locus_tag) {
1176         # dgg note: would like to -use_magic with -group_tag = 'gene' for ensembl genomes
1177         # where ensembl gene FT have both /locus_tag and /gene, but mRNA, CDS have /gene only
1178            if ($group_tag && $group_tag ne 'locus_tag') {
1179                $self->throw("You have explicitly set group_tag to be '$group_tag'\n".
1180                             "However, I detect that some features use /locus_tag\n".
1181                             "I believe that this is the correct group_tag to use\n".
1182                             "You can resolve this by either NOT setting -group_tag\n".
1183                             "OR you can unset -use_magic to regain control");
1184            }
1185
1186            # use /locus_tag instead of /gene tag for grouping
1187            # see GenBank entry AE003677 (version 3) for an example
1188            $group_tag = 'locus_tag';
1189            if ($self->verbose > 0) {
1190                warn "Set group tag to: $group_tag\n";
1191            }
1192        }
1193
1194        # on rare occasions, records will have no /gene or /locus_tag
1195        # but it WILL have /product tags. These serve the same purpose
1196        # for grouping. For an example, see AY763288 (also in t/data)
1197        if (@sfs_with_locus_tag==0 &&
1198            @sfs_with_gene_tag==0 &&
1199            @sfs_with_product_tag>0 &&
1200            !$group_tag) {
1201            $group_tag = 'product';
1202            if ($self->verbose > 0) {
1203                warn "Set group tag to: $group_tag\n";
1204            }
1205
1206        }
1207    }
1208    if (!$group_tag) {
1209        $group_tag = 'gene';
1210    }
1211
1212    # ------------------------------
1213    # GROUP FEATURES using $group_tag
1214    #     collect features into unstructured groups
1215    # ------------------------------
1216
1217    # -------------
1218    # we want to generate a list of groups;
1219    # each group is a list of SeqFeatures; this
1220    # group probably (but not necessarily)
1221    # corresponds to a gene model.
1222    #
1223    # this array will look something like this:
1224    # ([$f1], [$f2, $f3, $f4], ...., [$f97, $f98, $f99])
1225    #
1226    # there are also 'singleton' groups, with one member.
1227    # for instance, the 'source' feature is in a singleton group;
1228    # the same with others such as 'misc_feature'
1229    my @groups = ();
1230    # -------------
1231
1232    # --------------------
1233    # we hope that the genbank record allows us to group by some grouping
1234    # tag.
1235    # for instance, most of the time a gene model can be grouped using
1236    # the gene tag - that is where you see
1237    #                    /gene="foo"
1238    # in a genbank record
1239    # --------------------
1240
1241    # keep an index of groups by their
1242    # grouping tag
1243    my %group_by_tag = ();
1244
1245
1246    # iterate through all features, putting them into groups
1247    foreach my $sf (@flat_seq_features) {
1248        if (!$sf->has_tag($group_tag)) {
1249            # SINGLETON
1250            # this is an ungroupable feature;
1251            # add it to a group of its own
1252            push(@groups, [$sf]);
1253        }
1254        else {
1255            # NON-SINGLETON
1256            my @group_tagvals = $sf->get_tag_values($group_tag);
1257            if (@group_tagvals > 1) {
1258                # sanity check:
1259                # currently something can only belong to one group
1260                $self->problem(2,
1261                               ">1 value for /$group_tag: @group_tagvals\n".
1262                               "At this time this module is not equipped to handle this adequately", $sf);
1263            }
1264            # get value of group tag
1265            my $gtv = shift @group_tagvals;
1266            $gtv || $self->throw("Empty /$group_tag vals not allowed!");
1267
1268            # is this a new group?
1269            my $group = $group_by_tag{$gtv};
1270            if ($group) {
1271                # this group has been encountered before - add current
1272                # sf to the end of the group
1273                push(@$group, $sf);
1274            }
1275            else {
1276                # new group; add to index and create new group
1277                $group = [$sf];  # currently one member; probably more to come
1278                $group_by_tag{$gtv} = $group;
1279                push(@groups, $group);
1280            }
1281        }
1282    }
1283
1284    # as well as having the same group_tag, a group should be spatially
1285    # connected. if not, then the group should be split into subgroups.
1286    # this turns out to be necessary in the case of multicopy genes.
1287    # the standard way to represent these is as spatially disconnected
1288    # gene models (usually a 'gene' feature and some kind of RNA feature)
1289    # with the same group tag; the code below will split these into
1290    # seperate groups, one per copy.
1291    @groups = map { $self->_split_group_if_disconnected($_) } @groups;
1292
1293    # remove any duplicates; most of the time the method below has
1294    # no effect. there are some unusual genbank records for which
1295    # duplicate removal is necessary. see the comments in the
1296    # _remove_duplicates_from_group() method if you want to know
1297    # the ugly details
1298    foreach my $group (@groups) {
1299        $self->_remove_duplicates_from_group($group);
1300    }
1301
1302    # -
1303
1304    # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
1305    # these are indicated with the /pseudo tag
1306    # these are mapped to a different type; they should NOT
1307    # be treated as normal genes
1308    foreach my $sf (@all_seq_features) {
1309        if ($sf->has_tag('pseudo')) {
1310            my $type = $sf->primary_tag;
1311            # SO type is typically the same as the normal
1312            # type but preceeded by "pseudo"
1313            if ($type eq 'misc_RNA' || $type eq 'mRNA') {
1314             # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
1315                $sf->primary_tag("pseudotranscript");
1316            }
1317            else {
1318                $sf->primary_tag("pseudo$type");
1319            }
1320        }
1321    }
1322    # now some of the post-processing that follows which applies to
1323    # genes will NOT be applied to pseudogenes; this is deliberate
1324    # for example, gene models are normalised to be gene-transcript-exon
1325    # for pseudogenes we leave them as pseudogene-pseudoexon
1326
1327    # --- MAGIC ---
1328    my $need_to_infer_exons = 0;
1329    my $need_to_infer_mRNAs = 0;
1330    my @removed_exons = ();
1331    if ($use_magic) {
1332        if (defined($structure_type)) {
1333            $self->throw("Can't combine use_magic AND setting structure_type");
1334        }
1335        my $n_introns =
1336          scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1337        my $n_exons =
1338          scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1339        my $n_mrnas =
1340          scalar(grep {$_->primary_tag eq 'mRNA'} @flat_seq_features);
1341        my $n_mrnas_attached_to_gene =
1342          scalar(grep {$_->primary_tag eq 'mRNA' &&
1343                         $_->has_tag($group_tag)} @flat_seq_features);
1344        my $n_cdss =
1345          scalar(grep {$_->primary_tag eq 'CDS'} @flat_seq_features);
1346        my $n_rnas =
1347          scalar(grep {$_->primary_tag =~ /RNA/} @flat_seq_features);
1348        # Are there any CDS features in the record?
1349        if ($n_cdss > 0) {
1350            # YES
1351
1352            # - a pc gene model should contain at the least a CDS
1353
1354            # Are there any mRNA features in the record?
1355            if ($n_mrnas == 0) {
1356                # NO mRNAs:
1357                # looks like structure_type == 1
1358                $structure_type = 1;
1359                $need_to_infer_mRNAs = 1;
1360            }
1361            elsif ($n_mrnas_attached_to_gene == 0) {
1362                # $n_mrnas > 0
1363                # $n_mrnas_attached_to_gene = 0
1364                #
1365                # The entries _do_ contain mRNA features,
1366                # but none of them are part of a group/gene, i.e. they
1367                # are 'floating'
1368
1369                # this is an annoying weird file that has some floating
1370                # mRNA features;
1371                # eg ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/
1372
1373                if ($self->verbose) {
1374                    my @floating_mrnas =
1375                      grep {$_->primary_tag eq 'mRNA' &&
1376                              !$_->has_tag($group_tag)} @flat_seq_features;
1377                    printf STDERR "Unattached mRNAs:\n";
1378                    foreach my $mrna (@floating_mrnas) {
1379                        $self->_write_sf_detail($mrna);
1380                    }
1381                    printf STDERR "Don't know how to deal with these; filter at source?\n";
1382                }
1383
1384                foreach (@flat_seq_features) {
1385                    if ($_->primary_tag eq 'mRNA') {
1386                        # what should we do??
1387
1388                        # I think for pombe we just have to filter
1389                        # out bogus mRNAs prior to starting
1390                    }
1391                }
1392
1393                # looks like structure_type == 2
1394                $structure_type = 2;
1395                $need_to_infer_mRNAs = 1;
1396            }
1397            else {
1398            }
1399
1400            # we always infer exons in magic mode
1401            $need_to_infer_exons = 1;
1402        }
1403        else {
1404            # this doesn't seem to be any kind of protein coding gene model
1405            if ( $n_rnas > 0 ) {
1406                $need_to_infer_exons = 1;
1407            }
1408        }
1409
1410        $need_to_infer_exons = 0 if $noinfer; #NML
1411
1412        if ($need_to_infer_exons) {
1413            # remove exons and introns from group -
1414            # we will infer exons later, and we
1415            # can always infer introns from exons
1416            foreach my $group (@groups) {
1417                @$group =
1418                  grep {
1419                      my $type = $_->primary_tag();
1420                      if ($type eq 'exon') {
1421                          # keep track of all removed exons,
1422                          # so we can do a sanity check later
1423                          push(@removed_exons, $_);
1424                      }
1425                      $type ne 'exon' && $type ne 'intron'
1426                  } @$group;
1427            }
1428            # get rid of any groups that have zero members
1429            @groups = grep {scalar(@$_)} @groups;
1430        }
1431    }
1432    # --- END OF MAGIC ---
1433
1434    # LOGICAL ASSERTION
1435    if (grep {!scalar(@$_)} @groups) {
1436        $self->throw("ASSERTION ERROR: empty group");
1437    }
1438
1439    # LOGGING
1440    if ($self->verbose > 0) {
1441        printf STDERR "GROUPS:\n";
1442        foreach my $group (@groups) {
1443            $self->_write_group($group, $group_tag);
1444        }
1445    }
1446    # -
1447
1448    # --------- FINISHED GROUPING -------------
1449
1450
1451    # TYPE CONTAINMENT HIERARCHY (aka partonomy)
1452    # set the containment hierarchy if desired
1453    # see docs for structure_type() method
1454    if ($structure_type) {
1455        if ($structure_type == 1) {
1456            $self->partonomy(
1457                             {CDS => 'gene',
1458                              exon => 'CDS',
1459                              intron => 'CDS',
1460                             }
1461                            );
1462        }
1463        else {
1464            $self->throw("structure_type $structure_type is currently unknown");
1465        }
1466    }
1467
1468    # see if we have an obvious resolver_tag
1469    if ($use_magic) {
1470        foreach my $sf (@all_seq_features) {
1471            if ($sf->has_tag('derived_from')) {
1472                $resolver_tag = 'derived_from';
1473            }
1474        }
1475    }
1476
1477    if ($use_magic) {
1478        # point all feature types without a container type to the root type.
1479        #
1480        # for example, if we have an unanticipated feature_type, say
1481        # 'aberration', this should by default point to the parent 'gene'
1482        foreach my $group (@groups) {
1483            my @sfs = @$group;
1484            if (@sfs > 1) {
1485                foreach my $sf (@sfs) {
1486                    my $type = $sf->primary_tag;
1487                    next if $type eq 'gene';
1488                    my $container_type = $self->get_container_type($type);
1489                    if (!$container_type) {
1490                        $self->partonomy->{$type} = 'gene';
1491                    }
1492                }
1493            }
1494        }
1495    }
1496
1497    # we have done the first part of the unflattening.
1498    # we now have a list of groups; each group is a list of seqfeatures.
1499    # the actual group itself is flat; we may want to unflatten this further;
1500    # for instance, a gene model can contain multiple mRNAs and CDSs. We may want
1501    # to link the correct mRNA to the correct CDS via the bioperl sub_SeqFeature tree.
1502    #
1503    # what we would end up with would be
1504    #  gene1
1505    #    mRNA-a
1506    #      CDS-a
1507    #    mRNA-b
1508    #      CDS-b
1509    my @top_sfs = $self->unflatten_groups(-groups=>\@groups,
1510                                          -resolver_method=>$resolver_method,
1511                                          -resolver_tag=>$resolver_tag);
1512
1513    # restore settings
1514    $self->partonomy($old_partonomy);
1515
1516    # restore settings
1517    $self->structure_type($old_structure_type);
1518
1519    # modify the original Seq object - the top seqfeatures are now
1520    # the top features from each group
1521    $seq->remove_SeqFeatures;
1522    $seq->add_SeqFeature($_) foreach @top_sfs;
1523
1524    # --------- FINISHED UNFLATTENING -------------
1525
1526    # lets see if there are any post-unflattening tasks we need to do
1527
1528
1529
1530    # INFERRING mRNAs
1531    if ($need_to_infer_mRNAs) {
1532        if ($self->verbose > 0) {
1533            printf STDERR "** INFERRING mRNA from CDS\n";
1534        }
1535        $self->infer_mRNA_from_CDS(-seq=>$seq, -noinfer=>$noinfer);
1536    }
1537
1538    # INFERRING exons
1539    if ($need_to_infer_exons) {
1540
1541        # infer exons, one group/gene at a time
1542        foreach my $sf (@top_sfs) {
1543            my @sub_sfs = ($sf, $sf->get_all_SeqFeatures);
1544            $self->feature_from_splitloc(-features=>\@sub_sfs);
1545        }
1546
1547        # some exons are stated explicitly; ie there is an "exon" feature
1548        # most exons are inferred; ie there is a "mRNA" feature with
1549        # split locations
1550        #
1551        # if there were exons explicitly stated in the entry, we need to
1552        # do two things:
1553        #
1554        # make sure these exons are consistent with the inferred exons
1555        #  (you never know)
1556        #
1557        # transfer annotation (tag-vals) from the explicit exon to the
1558        # new inferred exon
1559        if (@removed_exons) {
1560            my @allfeats = $seq->get_all_SeqFeatures;
1561
1562            # find all the inferred exons that are children of mRNA
1563            my @mrnas =  grep {$_->primary_tag eq 'mRNA'} @allfeats;
1564            my @exons =
1565              grep {$_->primary_tag eq 'exon'}
1566                map {$_->get_SeqFeatures} @mrnas;
1567
1568            my %exon_h = ();        # index of exons by location;
1569
1570            # there CAN be >1 exon at a location; we can represent these redundantly
1571            # (ie as a tree, not a graph)
1572            push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
1573            my @problems = ();      # list of problems;
1574                                    # each problem is a
1575                                    # [$severity, $description] pair
1576            my $problem = '';
1577            my ($n_exons, $n_removed_exons) =
1578              (scalar(keys %exon_h), scalar(@removed_exons));
1579            foreach my $removed_exon (@removed_exons) {
1580                my $locstr = $self->_locstr($removed_exon);
1581                my $inferred_exons = $exon_h{$locstr};
1582                delete $exon_h{$locstr};
1583                if ($inferred_exons) {
1584                    my %exons_done = ();
1585                    foreach my $exon (@$inferred_exons) {
1586
1587                        # make sure we don't move stuff twice
1588                        next if $exons_done{$exon};
1589                        $exons_done{$exon} = 1;
1590
1591                        # we need to tranfer any tag-values from the explicit
1592                        # exon to the implicit exon
1593                        foreach my $tag ($removed_exon->get_all_tags) {
1594                            my @vals = $removed_exon->get_tag_values($tag);
1595                            if (!$exon->can("add_tag_value")) {
1596                                # I'm puzzled as to what should be done here;
1597                                # SeqFeatureIs are not necessarily mutable,
1598                                # but we know that in practice the implementing
1599                                # class is mutable
1600                                $self->throw("The SeqFeature object does not ".
1601                                             "implement add_tag_value()");
1602                            }
1603                            $exon->add_tag_value($tag, @vals);
1604                        }
1605                    }
1606                }
1607                else {
1608                    # no exons inferred at $locstr
1609                    push(@problems,
1610                         [1,
1611                          "there is a conflict with exons; there was an explicitly ".
1612                          "stated exon with location $locstr, yet I cannot generate ".
1613                          "this exon from the supplied mRNA locations\n"]);
1614                }
1615            }
1616            # do we have any inferred exons left over, that were not
1617            # covered in the explicit exons?
1618            if (keys %exon_h) {
1619                # TODO - we ignore this problem for now
1620                push(@problems,
1621                     [1,
1622                      sprintf("There are some inferred exons that are not in the ".
1623                              "explicit exon list; they are the exons at locations:\n".
1624                              join("\n", keys %exon_h)."\n")]);
1625            }
1626
1627            # report any problems
1628            if (@problems) {
1629                my $thresh = $self->error_threshold;
1630                my @bad_problems = grep {$_->[0] > $thresh} @problems;
1631                if (@bad_problems) {
1632                    printf STDERR "PROBLEM:\n";
1633                    $self->_write_hier(\@top_sfs);
1634                    # TODO - allow more fine grained control over this
1635                    $self->{_problems_reported} = 1;
1636                    $self->throw(join("\n",
1637                                      map {"@$_"} @bad_problems));
1638                }
1639                $self->problem(@$_) foreach @problems;
1640            }
1641        }
1642    }
1643    # --- end of inferring exons --
1644
1645    # return new top level features; this can also
1646    # be retrieved via
1647    #   $seq->get_SeqFeatures();
1648 #   return @top_sfs;
1649    return $seq->get_SeqFeatures;
1650 }
1651
1652 # _split_group_if_disconnected([@sfs])
1653 #
1654 # as well as having the same group_tag, a group should be spatially
1655 # connected. if not, then the group should be split into subgroups.
1656 # this turns out to be necessary in the case of multicopy genes.
1657 # the standard way to represent these is as spatially disconnected
1658 # gene models (usually a 'gene' feature and some kind of RNA feature)
1659 # with the same group tag; the code below will split these into
1660 # seperate groups, one per copy.
1661
1662 sub _split_group_if_disconnected {
1663     my $self = shift;
1664     my $group = shift;
1665     my @sfs = @$group;
1666     my @ranges =
1667       Bio::Range->disconnected_ranges(@sfs);
1668     my @groups;
1669     if (@ranges == 0) {
1670         $self->throw("ASSERTION ERROR");
1671     }
1672     elsif (@ranges == 1) {
1673         # no need to split the group
1674         @groups = ($group);
1675     }
1676     else {
1677         # @ranges > 1
1678         # split the group into disconnected ranges
1679         if ($self->verbose > 0) {
1680             printf STDERR "GROUP PRE-SPLIT:\n";
1681             $self->_write_group($group, $self->group_tag);
1682         }
1683         @groups =
1684           map {
1685               my $range = $_;
1686               [grep {
1687                   $_->intersection($range);
1688               } @sfs]
1689           } @ranges;
1690         if ($self->verbose > 0) {
1691             printf STDERR "SPLIT GROUPS:\n";
1692             $self->_write_group($_, $self->group_tag) foreach @groups;
1693         }
1694     }
1695     return @groups;
1696 }
1697
1698 sub _remove_duplicates_from_group {
1699     my $self = shift;
1700     my $group = shift;
1701
1702     # ::: WEIRD BOUNDARY CASE CODE :::
1703     # for some reason, there are some gb records with two gene
1704     # features for one gene; for example, see ATF14F8.gbk
1705     # in the t/data directory
1706     #
1707     # in this case, we get rid of one of the genes
1708
1709     my @genes = grep {$_->primary_tag eq 'gene'} @$group;
1710     if (@genes > 1) {
1711         # OK, if we look at ATF14F8.gbk we see that some genes
1712         # just exist as a single location, some exist as a multisplit location;
1713         #
1714         # eg
1715
1716         #     gene            16790..26395
1717         #                     /gene="F14F8_60"
1718         #     ...
1719         #     gene            complement(join(16790..19855,20136..20912,21378..21497,
1720         #                     21654..21876,22204..22400,22527..23158,23335..23448,
1721         #                     23538..23938,24175..24536,24604..24715,24889..24984,
1722         #                     25114..25171,25257..25329,25544..25589,25900..26018,
1723         #                     26300..26395))
1724         #                     /gene="F14F8_60"
1725
1726         # the former is the 'standard' way of representing the gene in genbank;
1727         # the latter is redundant with the CDS entry. So we shall get rid of
1728         # the latter with the following filter
1729
1730         if ($self->verbose > 0) {
1731             printf STDERR "REMOVING DUPLICATES:\n";
1732         }
1733
1734         @genes =
1735           grep {
1736               my $loc = $_->location;
1737               if ($loc->isa("Bio::Location::SplitLocationI")) {
1738                   my @locs = $loc->each_Location;
1739                   if (@locs > 1) {
1740                       0;
1741                   }
1742                   else {
1743                       1;
1744                   }
1745               }
1746               else {
1747                   1;
1748               }
1749           } @genes;
1750
1751         if (@genes > 1) {
1752             # OK, that didn't work. Our only resort is to just pick one at random
1753             @genes = ($genes[0]);
1754         }
1755         if (@genes) {
1756             @genes == 1 || $self->throw("ASSERTION ERROR");
1757             @$group =
1758               ($genes[0], grep {$_->primary_tag ne 'gene'} @$group);
1759         }
1760     }
1761     # its a dirty job but someone's gotta do it
1762     return;
1763 }
1764
1765
1766 =head2 unflatten_groups
1767
1768  Title   : unflatten_groups
1769  Usage   :
1770  Function: iterates over groups, calling unflatten_group() [see below]
1771  Example :
1772  Returns : list of Bio::SeqFeatureI objects that are holders
1773  Args    : see below
1774
1775 Arguments
1776
1777   -groups:          list of list references; inner list is of Bio::SeqFeatureI objects
1778                     e.g.  ( [$sf1], [$sf2, $sf3, $sf4], [$sf5, ...], ...)
1779
1780   -resolver_method: a CODE reference
1781                     see the documentation above for an example of
1782                     a subroutine that can be used to resolve hierarchies
1783                     within groups.
1784
1785                     this is optional - a default subroutine will be used
1786
1787
1788 NOTE: You should not need to call this method, unless you want fine
1789 grained control over how the unflattening process.
1790
1791 =cut
1792
1793 sub unflatten_groups{
1794    my ($self,@args) = @_;
1795    my($groups, $resolver_method, $resolver_tag) =
1796      $self->_rearrange([qw(GROUPS
1797                            RESOLVER_METHOD
1798                            RESOLVER_TAG
1799                           )],
1800                           @args);
1801
1802    # this is just a simple wrapper for unflatten_group()
1803    return
1804      map {
1805          $self->unflatten_group(-group=>$_,
1806                                 -resolver_method=>$resolver_method,
1807                                 -resolver_tag=>$resolver_tag)
1808      } @$groups;
1809 }
1810
1811 =head2 unflatten_group
1812
1813  Title   : unflatten_group
1814  Usage   :
1815  Function: nests a group of features into a feature containment hierarchy
1816  Example :
1817  Returns : Bio::SeqFeatureI objects that holds other features
1818  Args    : see below
1819
1820 Arguments
1821
1822   -group:           reference to list of Bio::SeqFeatureI objects
1823
1824   -resolver_method: a CODE reference
1825                     see the documentation above for an example of
1826                     a subroutine that can be used to resolve hierarchies
1827                     within groups
1828
1829                     this is optional - a default subroutine will be used
1830
1831
1832 NOTE: You should not need to call this method, unless you want fine
1833 grained control over how the unflattening process.
1834
1835 =cut
1836
1837 sub unflatten_group{
1838    my ($self,@args) = @_;
1839
1840    my($group, $resolver_method, $resolver_tag) =
1841      $self->_rearrange([qw(GROUP
1842                            RESOLVER_METHOD
1843                            RESOLVER_TAG
1844                           )],
1845                           @args);
1846
1847    if ($self->verbose > 0) {
1848        printf STDERR "UNFLATTENING GROUP:\n";
1849        $self->_write_group($group, $self->group_tag);
1850    }
1851
1852    my @sfs = @$group;
1853
1854    # we can safely ignore singletons (e.g. [source])
1855    return $sfs[0] if @sfs == 1;
1856
1857    my $partonomy = $self->partonomy;
1858
1859    # $resolver_method is a reference to a SUB that will resolve
1860    # ambiguous parent/child containment; for example, determining
1861    # which mRNAs go with which CDSs
1862    $resolver_method = $resolver_method || \&_resolve_container_for_sf;
1863
1864    # TAG BASED RESOLVING OF HIERARCHIES
1865    #
1866    # if the user specifies $resolver_tag, then we use this tag
1867    # to pair up ambiguous parents and children;
1868    #
1869    # for example, the CDS feature may have a resolver tag of /derives_from
1870    # which is a 'foreign key' into the /label tag of the mRNA feature
1871    #
1872    # this kind of tag-based resolution is possible for a certain subset
1873    # of genbank records
1874    #
1875    # if no resolver tag is specified, we revert to the normal
1876    # resolver_method
1877    if ($resolver_tag) {
1878        my $backup_resolver_method = $resolver_method;
1879        # closure: $resolver_tag is remembered by this sub
1880        my $sub =
1881          sub {
1882              my ($self, $sf, @possible_container_sfs) = @_;
1883              my @container_sfs = ();
1884              if ($sf->has_tag($resolver_tag)) {
1885                  my ($resolver_tagval) = $sf->get_tag_values($resolver_tag);
1886                  # if a feature has a resolver_tag (e.g. /derives_from)
1887                  # this specifies the /product, /symbol or /label for the
1888                  # parent feature
1889                  @container_sfs =
1890                    grep {
1891                        my $match = 0;
1892                        $self->_write_sf($_) if $self->verbose > 0;
1893                        foreach my $tag (qw(product symbol label)) {
1894                            if ($_->has_tag($tag)) {
1895                                my @vals =
1896                                  $_->get_tag_values($tag);
1897                                if (grep {$_ eq $resolver_tagval} @vals) {
1898                                    $match = 1;
1899                                    last;
1900                                }
1901                            }
1902                        }
1903                        $match;
1904                    } @possible_container_sfs;
1905              }
1906              else {
1907                  return $backup_resolver_method->($sf, @possible_container_sfs);
1908              }
1909              return map {$_=>0} @container_sfs;
1910          };
1911        $resolver_method = $sub;
1912    }
1913    else {
1914        # CONDITION: $resolver_tag is NOT set
1915        $self->throw("assertion error") if $resolver_tag;
1916    }
1917    # we have now set $resolver_method to a subroutine for
1918    # disambiguatimng parent/child relationships. we will
1919    # now build the whole containment hierarchy for this group
1920
1921
1922    # FIND TOP/ROOT SEQFEATURES
1923    #
1924    # find all the features for which there is no
1925    # containing feature type (eg genes)
1926    my @top_sfs =
1927      grep {
1928          !$self->get_container_type($_->primary_tag);
1929      } @sfs;
1930
1931    # CONDITION: there must be at most one root
1932    if (@top_sfs > 1) {
1933        $self->_write_group($group, $self->group_tag);
1934        printf STDERR "TOP SFS:\n";
1935        $self->_write_sf($_) foreach @top_sfs;
1936        $self->throw("multiple top-sfs in group");
1937    }
1938    my $top_sf = $top_sfs[0];
1939
1940    # CREATE INDEX OF SEQFEATURES BY TYPE
1941    my %sfs_by_type = ();
1942    foreach my $sf (@sfs) {
1943        push(@{$sfs_by_type{$sf->primary_tag}}, $sf);
1944    }
1945
1946    # containment index; keyed by child; lookup parent
1947    # note: this index uses the stringified object reference of
1948    # the object as a surrogate lookup key
1949
1950    my %container = ();   # child -> parent
1951
1952    # ALGORITHM: build containment graph
1953    #
1954    # find all possible containers for each SF;
1955    # for instance, for a CDS, the possible containers are all
1956    # the mRNAs in the same group. For a mRNA, the possible
1957    # containers are any SFs of type 'gene' (should only be 1).
1958    # (these container-type mappings can be overridden)
1959    #
1960    # contention is resolved by checking coordinates of splice sites
1961    # (this is the default, but can be overridden)
1962    #
1963    # most of the time, there is no problem identifying a unique
1964    # parent for every child; this can be ambiguous when constructing
1965    # CDS to mRNA relationships with lots of alternate splicing
1966    #
1967    # a hash of child->parent relationships is constructed (%container)
1968    # any mappings that need further resolution (eg CDS to mRNA) are
1969    # placed in %unresolved
1970
1971    # %unresolved index
1972    # (keyed by stringified object reference of child seqfeature)
1973    my %unresolved = ();    # child -> [parent,score] to be resolved
1974
1975    # index of seqfeatures by their stringified object reference;
1976    # this is essentially a way of 'reviving' an object from its stringified
1977    # reference
1978    # (see NOTE ON USING OBJECTS AS KEYS IN HASHES, below)
1979    my %idxsf = map {$_=>$_} @sfs;
1980
1981    foreach my $sf (@sfs) {
1982        my $type = $sf->primary_tag;
1983
1984        # container type (e.g. the container type for CDS is usually mRNA)
1985        my $container_type =
1986          $self->get_container_type($type);
1987        if ($container_type) {
1988
1989            my @possible_container_sfs =
1990              @{$sfs_by_type{$container_type} || []};
1991            # we now have a list of possible containers
1992            # (eg for a CDS in an alternately spliced gene, this
1993            #  would be a list of all the mRNAs for this gene)
1994
1995            if (!@possible_container_sfs) {
1996                # root of hierarchy
1997            }
1998            else {
1999                if (@possible_container_sfs == 1) {
2000                    # this is the easy situation, whereby the containment
2001                    # hierarchy is unambiguous. this will probably be the
2002                    # case if the genbank record has no alternate splicing
2003                    # within it
2004
2005                    # ONE OPTION ONLY - resolved!
2006                    $container{$sf} = $possible_container_sfs[0];
2007
2008                }
2009                else {
2010                    # MULTIPLE CONTAINER CHOICES
2011                    $self->throw("ASSERTION ERROR") unless @possible_container_sfs > 1;
2012
2013                    # push this onto the %unresolved graph, and deal with it
2014                    # later
2015
2016                    # for now we hardcode things such that the only type
2017                    # with ambiguous parents is a CDS; if this is violated,
2018                    # it has a weak problem class of '1' so the API user
2019                    # can easily set things to ignore these
2020                    if ($sf->primary_tag ne 'CDS') {
2021                        $self->problem(1,
2022                                       "multiple container choice for non-CDS; ".
2023                                       "CDS to mRNA should be the only ".
2024                                       "relationships requiring resolving",
2025                                       $sf);
2026                    }
2027
2028                    # previously we set the SUB $resolver_method
2029                    $self->throw("ASSERTION ERROR")
2030                      unless $resolver_method;
2031
2032                    # $resolver_method will assign scores to
2033                    # parent/child combinations; later on we
2034                    # will use these scores to find the optimal
2035                    # parent/child pairings
2036
2037                    # the default $resolver_method uses splice sites to
2038                    # score possible parent/child matches
2039
2040                    my %container_sfh =
2041                      $resolver_method->($self, $sf, @possible_container_sfs);
2042                    if (!%container_sfh) {
2043                        $self->problem(2,
2044                                       "no containers possible for SeqFeature of ".
2045                                       "type: $type; this SF is being placed at ".
2046                                       "root level",
2047                                       $sf);
2048                        # RESOLVED! (sort of - placed at root/gene level)
2049                        $container{$sf} = $top_sf;
2050
2051                        # this sort of thing happens if the record is
2052                        # badly messed up and there is absolutely no indication
2053                        # of where to put the CDS. Perhaps we should just
2054                        # place it with a random mRNA?
2055                    }
2056                    foreach my $jsf (keys %container_sfh) {
2057
2058                        # add [score, parent] pairs to the %unresolved
2059                        # lookup table/graph
2060                        push(@{$unresolved{$sf}},
2061                             [$idxsf{$jsf}, $container_sfh{$jsf} || 0]);
2062                    }
2063                }
2064            }
2065        }
2066        else {
2067            # CONDITION:
2068            # not container type for $sf->primary_tag
2069
2070            # CONDITION:
2071            # $sf must be a root/top node (eg gene)
2072        }
2073    }
2074
2075    if (0) {
2076
2077        # CODE CURRENTLY DISABLED
2078
2079        # we require a 1:1 mapping between mRNAs and CDSs;
2080        # create artificial duplicates if we can't do this...
2081        if (%unresolved) {
2082            my %childh = map {$_=>1} keys %unresolved;
2083            my %parenth = map {$_->[0]=>1} map {@$_} values %unresolved;
2084            if ($self->verbose > 0) {
2085                printf STDERR "MATCHING %d CHILDREN TO %d PARENTS\n",
2086                  scalar(keys %childh), scalar(keys %parenth);
2087            }
2088            # 99.99% of the time in genbank genomic record of structure type 0, we
2089            # see one CDS for every mRNA; one exception is the S Pombe
2090            # genome, which is all CDS, bar a few spurious mRNAs; we have to
2091            # filter out the spurious mRNAs in this case
2092            #
2093            # another strange case is in the mouse genome, NT_078847.1
2094            # for Pcdh13 you will notice there is 4 mRNAs and 5 CDSs.
2095            # most unusual!
2096            # I'm at a loss for a really clever thing to do here. I think the
2097            # best thing is to create duplicate features to preserve the 1:1 mapping
2098            #       my $suffix_id = 1;
2099            #       while (keys %childh > keys %parenth) {
2100            #
2101            #       }
2102        }
2103    }
2104
2105    # DEBUGGING CODE
2106    if ($self->verbose > 0 && scalar(keys %unresolved)) {
2107        printf STDERR "UNRESOLVED PAIRS:\n";
2108        foreach my $childsf (keys %unresolved) {
2109            my @poss = @{$unresolved{$childsf}};
2110            foreach my $p (@poss) {
2111                my $parentsf = $p->[0];
2112                $childsf = $idxsf{$childsf};
2113                my @clabels = ($childsf->get_tagset_values(qw(protein_id label product)), "?");
2114                my @plabels = ($parentsf->get_tagset_values(qw(transcript_id label product)), "?");
2115                printf STDERR
2116                       ("  PAIR: $clabels[0] => $plabels[0]  (of %d)\n",
2117                        scalar(@poss));
2118            }
2119        }
2120    } # -- end of verbose
2121
2122    # Now we have to fully resolve the containment hierarchy; remember,
2123    # the graph %container has the fully resolved child->parent links;
2124    #
2125    # the graph %unresolved is keyed by children missing parents; we
2126    # need to put all these orphans in the %container graph
2127    #
2128    # we do this using the scores in %unresolved, with the
2129    # find_best_matches() algorithm
2130    my $unresolved_problem_reported = 0;
2131    if (%unresolved) {
2132        my $new_pairs =
2133          $self->find_best_matches(\%unresolved, []);
2134        if (!$new_pairs) {
2135            my ($g) = $sfs[0]->get_tagset_values($self->group_tag || 'gene');
2136            $self->problem(2,
2137                           "Could not resolve hierarchy for $g");
2138            $new_pairs = [];
2139            $unresolved_problem_reported = 1;
2140        }
2141        foreach my $pair (@$new_pairs) {
2142            if ($self->verbose > 0) {
2143                printf STDERR "  resolved pair @$pair\n";
2144            }
2145            $container{$pair->[0]} = $pair->[1];
2146            delete $unresolved{$pair->[0]};
2147        }
2148    }
2149
2150    # CONDITION: containment hierarchy resolved
2151    if (%unresolved) {
2152        $self->throw("UNRESOLVED: %unresolved")
2153          unless $unresolved_problem_reported;
2154    }
2155
2156    # make nested SeqFeature hierarchy from @containment_pairs
2157    # ie put child SeqFeatures into parent SeqFeatures
2158    my @top = ();
2159    foreach my $sf (@sfs) {
2160        my $container_sf = $container{$sf};
2161        if ($container_sf) {
2162            # make $sf nested inside $container_sf
2163
2164            # first check if the container spatially contains the containee
2165            if ($container_sf->contains($sf)) {
2166                # add containee
2167                $container_sf->add_SeqFeature($sf);
2168            }
2169            else {
2170                # weird case - the container does NOT spatially
2171                # contain the containee;
2172                # we expand and throw a warning
2173                #
2174                # for an example of this see ZFP91-CNTF dicistronic gene
2175                # in NCBI chrom 11 build 34.3
2176                $self->problem(1,
2177                               "Container feature does not spatially contain ".
2178                               "subfeature. Perhaps this is a dicistronic gene? ".
2179                               "I am expanding the parent feature",
2180                               $container_sf,
2181                               $sf);
2182                $container_sf->add_SeqFeature($sf, 'EXPAND');
2183            }
2184        }
2185        else {
2186            push(@top, $sf);
2187        }
2188    }
2189    return @top;
2190 } # -- end of unflatten_group
2191
2192 # -------
2193 # A NOTE ON USING OBJECTS AS KEYS IN HASHES (stringified objects)
2194 #
2195 # Often we with to use seqfeatures as keys in a hashtable; because seqfeatures
2196 # in bioperl have no unique ID, we use a surrogate ID in the form of the
2197 # stringified object references - this is just what you get if you say
2198 #
2199 #  print "$sf\n";
2200 #
2201 # this is guaranteed to be unique (within a particular perl execution)
2202 #
2203 # often we want to 'revive' the objects used as keys in a hash - once the
2204 # objects are used as keys, remember it is the *strings* used as keys and
2205 # not the object itself, so the object needs to be revived using another
2206 # hashtable that looks like this
2207 #
2208 #    %sfidx = map { $_ => $_ } @sfs
2209 #
2210 # -------
2211
2212
2213 # recursively finds the best set of pairings from a matrix of possible pairings
2214 #
2215 # tries to make sure nothing is unpaired
2216 #
2217 # given a matrix of POSSIBLE matches
2218 #  (matrix expressed as hash/lookup; keyed by child object; val = [parent, score]
2219 #
2220 #
2221 sub find_best_matches {
2222     my $self = shift;
2223     my $matrix = shift;
2224     my $pairs = shift;        # [child,parent] pairs already selected
2225
2226     my $verbose = $self->verbose;
2227     #################################print "I";
2228     if ($verbose > 0) {
2229         printf STDERR "find_best_matches: (/%d)\n", scalar(@$pairs);
2230     }
2231
2232     my %selected_children = map {($_->[0]=>1)} @$pairs;
2233     my %selected_parents = map {($_->[1]=>1)} @$pairs;
2234
2235     # make a copy of the matrix with the portions still to be
2236     # resolved
2237     my %unresolved_parents = ();
2238     my %unresolved =
2239       map {
2240           if ($verbose > 0) {
2241               printf STDERR "  $_ : %s\n", join("; ", map {"[@$_]"} @{$matrix->{$_}});
2242           }
2243           if ($selected_children{$_}) {
2244               ();
2245           }
2246           else {
2247               my @parents =
2248                 grep {
2249                     !$selected_parents{$_->[0]}
2250                 } @{$matrix->{$_}};
2251               $unresolved_parents{$_} = 1 foreach @parents;
2252               # new parents
2253               ($_ => [@parents]);
2254           }
2255       } keys %$matrix;
2256
2257     my @I = keys %unresolved;
2258
2259     return $pairs if !scalar(keys %unresolved_parents);
2260     # NECESSARY CONDITION:
2261     # all possible parents have a child match
2262
2263     return $pairs if !scalar(@I);
2264     # NECESSARY CONDITION:
2265     # all possible children have a parent match
2266
2267     # give those with fewest choices highest priority
2268     @I = sort {
2269         # n possible parents
2270         scalar(@{$unresolved{$a}})
2271           <=>
2272             scalar(@{$unresolved{$b}}) ;
2273     } @I;
2274
2275     my $csf = shift @I;
2276
2277     my @J = @{$unresolved{$csf}};  # array of [parent, score]
2278
2279     # sort by score, highest first
2280     @J =
2281       sort {
2282           $b->[1] <=> $a->[1]
2283       } @J;
2284
2285     # select pair(s) from remaining matrix of possible pairs
2286     # by iterating through possible parents
2287
2288     my $successful_pairs;
2289     foreach my $j (@J) {
2290         my ($psf, $score) = @$j;
2291         # would selecting $csf, $psf as a pair
2292         # remove all choices from another?
2293         my $bad = 0;
2294         foreach my $sf (@I) {
2295             if (!grep {$_->[0] ne $psf} @{$unresolved{$sf}}) {
2296                 # $psf was the only parent choice for $sf
2297                 $bad = 1;
2298                 last;
2299             }
2300         }
2301         if (!$bad) {
2302             my $pair = [$csf, $psf];
2303             my $new_pairs = [@$pairs, $pair];
2304             my $set = $self->find_best_matches($matrix, $new_pairs);
2305             if ($set) {
2306                 $successful_pairs = $set;
2307                 last;
2308             }
2309         }
2310     }
2311     # success
2312     return $successful_pairs if $successful_pairs;
2313     # fail
2314     return 0;
2315 }
2316
2317 # ----------------------------------------------
2318 # writes a group to stdout
2319 #
2320 # mostly for logging/debugging
2321 # ----------------------------------------------
2322 sub _write_group {
2323     my $self = shift;
2324     my $group = shift;
2325     my $group_tag = shift || 'gene';
2326
2327     my $f = $group->[0];
2328     my $label = '?';
2329     if ($f->has_tag($group_tag)) {
2330         ($label) = $f->get_tag_values($group_tag);
2331     }
2332     if( $self->verbose > 0 ) {
2333         printf STDERR ("  GROUP [%s]:%s\n",
2334                $label,
2335                join(' ',
2336                     map { $_->primary_tag } @$group));
2337     }
2338
2339 }
2340
2341 sub _write_sf {
2342     my $self = shift;
2343     my $sf = shift;
2344     printf STDERR "TYPE:%s\n", $sf->primary_tag;
2345     return;
2346 }
2347
2348 sub _write_sf_detail {
2349     my $self = shift;
2350     my $sf = shift;
2351     printf STDERR "TYPE:%s\n", $sf->primary_tag;
2352     my @locs = $sf->location->each_Location;
2353     printf STDERR "  %s,%s [%s]\n", $_->start, $_->end, $_->strand foreach @locs;
2354     return;
2355 }
2356
2357 sub _write_hier {
2358     my $self = shift;
2359     my @sfs = @{shift || []};
2360     my $indent = shift || 0;
2361     if( $self->verbose > 0 ) {
2362         foreach my $sf (@sfs) {
2363             my $label = '?';
2364             if ($sf->has_tag('product')) {
2365                 ($label) = $sf->get_tag_values('product');
2366             }
2367             printf STDERR "%s%s $label\n", '  ' x $indent, $sf->primary_tag;
2368             my @sub_sfs = $sf->sub_SeqFeature;
2369             $self->_write_hier(\@sub_sfs, $indent+1);
2370         }
2371     }
2372 }
2373
2374 # -----------------------------------------------
2375 #
2376 # returns all possible containers for an SF based
2377 # on splice site coordinates; splice site coords
2378 # must be contained
2379 # -----------------------------------------------
2380 sub _resolve_container_for_sf{
2381    my ($self, $sf, @possible_container_sfs) = @_;
2382
2383    my @coords = $self->_get_splice_coords_for_sf($sf);
2384    my $start = $sf->start;
2385    my $end = $sf->end;
2386    my $splice_uniq_str = "@coords";
2387
2388    my @sf_score_pairs = ();
2389    # a CDS is contained by a mRNA if the locations of the splice
2390    # coordinates are identical
2391    foreach (@possible_container_sfs) {
2392        my @container_coords = $self->_get_splice_coords_for_sf($_);
2393        my $inside =
2394          !$splice_uniq_str ||
2395            index("@container_coords", $splice_uniq_str) > -1;
2396        if ($inside) {
2397            # the container cannot be smaller than the thing contained
2398            if ($_->start > $start || $_->end < $end) {
2399                $inside = 0;
2400            }
2401        }
2402
2403
2404        # SPECIAL CASE FOR /ribosomal_slippage
2405        # See: https://www.ncbi.nlm.nih.gov/collab/FT/
2406        if (!$inside && $sf->has_tag('ribosomal_slippage')) {
2407            if ($self->verbose > 0) {
2408                printf STDERR "    Checking for ribosomal_slippage\n";
2409            }
2410
2411            # TODO: rewrite this to match introns;
2412            #  each slippage will be a "fake" small CDS exon
2413            my @transcript_splice_sites = @container_coords;
2414            my @cds_splice_sites = @coords;
2415            ##printf STDERR "xxTR SSs: @transcript_splice_sites :: %s\n", $_->get_tag_values('product');
2416            ##printf STDERR "xxCD SSs: @cds_splice_sites :: %s\n\n", $sf->get_tag_values('product');
2417
2418            # find the the first splice site within the CDS
2419            while (scalar(@transcript_splice_sites) &&
2420                   $transcript_splice_sites[0] < $cds_splice_sites[0]) {
2421                shift @transcript_splice_sites;
2422            }
2423
2424            ##print STDERR "TR SSs: @transcript_splice_sites\n";
2425            ##print STDERR "CD SSs: @cds_splice_sites\n\n";
2426
2427            if (!(scalar(@transcript_splice_sites)) ||
2428                  $transcript_splice_sites[0] == $cds_splice_sites[0]) {
2429
2430                # we will now try and align all splice remaining sites in the transcript and CDS;
2431                # any splice site that can't be aligned is assumed to be a ribosomal slippage
2432
2433                my @slips = ();
2434                my $in_exon = 1;
2435                $inside = 1;   # innocent until proven guilty..
2436                while (@cds_splice_sites) {
2437                    if (!@transcript_splice_sites) {
2438
2439                        # ribosomal slippage is after the last transcript splice site
2440                        # Example: (NC_00007, isoform 3 of PEG10)
2441                        #     mRNA            join(85682..85903,92646..99007)
2442                        #     mRNA            join(85682..85903,92646..99007)
2443                        #     CDS             join(85899..85903,92646..93825,93825..94994)
2444
2445                        # OR: None of the splice sites align;
2446                        #  may be a single CDS exon with one slippage inside it.
2447                        # Example: (NC_00007, isoform 4 of PEG10)
2448                        #     mRNA            join(85637..85892,92646..99007)
2449                        #     CDS             join(92767..93825,93825..94994)
2450
2451                        # Yes, this code is repeated below...
2452                        my $p1 = shift @cds_splice_sites;
2453                        my $p2 = shift @cds_splice_sites;
2454                        if ($self->verbose > 0) {
2455                            printf STDERR "    Found the ribosomal_slippage: $p1..$p2\n";
2456                        }
2457                        push(@slips, ($p2-$p1)-1);
2458                    }
2459                    elsif ($cds_splice_sites[0] == $transcript_splice_sites[0]) {
2460                        # splice sites align: this is not the slippage
2461                        shift @cds_splice_sites;
2462                        shift @transcript_splice_sites;
2463                        ##print STDERR "MATCH\n";
2464                    }
2465                    else {
2466                        # mismatch
2467                        if ($cds_splice_sites[0] < $transcript_splice_sites[0]) {
2468                            # potential slippage
2469                            #             v
2470                            # ---TTTTTTTTTT----
2471                            # ---CCCC--CCCC----
2472                            #       ^
2473
2474                            my $p1 = shift @cds_splice_sites;
2475                            my $p2 = shift @cds_splice_sites;
2476                            if ($self->verbose > 0) {
2477                                printf STDERR "    Found the ribosomal_slippage: $p1..$p2\n";
2478                            }
2479                            push(@slips, ($p2-$p1)-1);
2480                        }
2481                        else {
2482                            # not a potential ribosomal slippage
2483                            $inside = 0; # guilty!
2484                            ##print STDERR "FAIL\n";
2485                            last;
2486                        }
2487                    }
2488                }
2489                if ($inside) {
2490                    # TODO: this is currently completely arbitrary. How many ribosomal slippages do we allow?
2491                    # perhaps we need some mini-statistical model here....?
2492                    if (@slips > 1) {
2493                        $inside = 0;
2494                    }
2495                    # TODO: this is currently completely arbitrary. What is the maximum size of a ribosomal slippage?
2496                    # perhaps we need some mini-statistical model here....?
2497                    if (grep {$_ > 2} @slips) {
2498                        $inside = 0;
2499                    }
2500                }
2501            }
2502            else {
2503                # not a ribosomal_slippage, sorry
2504            }
2505        }
2506        if ($self->verbose > 0) {
2507            printf STDERR "    Checking containment:[$inside] (@container_coords) IN ($splice_uniq_str)\n";
2508        }
2509        if ($inside) {
2510            # SCORE: matching (ss-scoords+2)/(n-container-ss-coords+2)
2511            my $score =
2512              (scalar(@coords)+2)/(scalar(@container_coords)+2);
2513            push(@sf_score_pairs,
2514                 $_=>$score);
2515        }
2516    }
2517    return @sf_score_pairs;
2518 }
2519
2520 sub _get_splice_coords_for_sf {
2521     my $self = shift;
2522     my $sf = shift;
2523
2524    my @locs = $sf->location;
2525    if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2526        @locs = $sf->location->each_Location;
2527    }
2528
2529    # get an ordered list of (start, end) positions
2530
2531 #   my @coords =
2532 #     map {
2533 #         $_->strand > 0 ? ($_->start, $_->end) : ($_->end, $_->start)
2534 #     } @locs;
2535
2536     my @coords = map {($_->start, $_->end)} @locs;
2537
2538    # remove first and last leaving only splice sites
2539    pop @coords;
2540    shift @coords;
2541     return @coords;
2542 }
2543
2544 =head2 feature_from_splitloc
2545
2546  Title   : feature_from_splitloc
2547  Usage   : $unflattener->feature_from_splitloc(-features=>$sfs);
2548  Function:
2549  Example :
2550  Returns :
2551  Args    : see below
2552
2553 At this time all this method does is generate exons for mRNA or other RNA features
2554
2555 Arguments:
2556
2557   -feature:    a Bio::SeqFeatureI object (that conforms to Bio::FeatureHolderI)
2558   -seq:        a Bio::SeqI object that contains Bio::SeqFeatureI objects
2559   -features:   an arrayref of Bio::SeqFeatureI object
2560
2561
2562 =cut
2563
2564 sub feature_from_splitloc{
2565    my ($self,@args) = @_;
2566
2567    my($sf, $seq, $sfs) =
2568      $self->_rearrange([qw(FEATURE
2569                            SEQ
2570                            FEATURES
2571                           )],
2572                           @args);
2573    my @sfs = (@{$sfs || []});
2574    push(@sfs, $sf) if $sf;
2575    if ($seq) {
2576        $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2577        @sfs = $seq->get_all_SeqFeatures;
2578    }
2579    my @exons = grep {$_->primary_tag eq 'exon'} @sfs;
2580    if (@exons) {
2581        $self->problem(2,
2582                       "There are already exons, so I will not infer exons");
2583    }
2584
2585    # index of features by type+location
2586    my %loc_h = ();
2587
2588    # infer for every feature
2589    foreach my $sf (@sfs) {
2590
2591        $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2592        $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2593
2594        my $type = $sf->primary_tag;
2595        next unless $type eq 'mRNA' or $type =~ /RNA/;
2596
2597        # an mRNA from genbank will have a discontinuous location,
2598        # with each sub-location being equivalent to an exon
2599        my @locs = $sf->location;
2600
2601        if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2602            @locs = $sf->location->each_Location;
2603        }
2604
2605        if (!@locs) {
2606            use Data::Dumper;
2607            print Dumper $sf;
2608            $self->throw("ASSERTION ERROR: sf has no location objects");
2609        }
2610
2611        # make exons from locations
2612        my @subsfs =
2613          map {
2614              my $subsf = Bio::SeqFeature::Generic->new(-location=>$_,
2615                                                        -primary_tag=>'exon');
2616              ## Provide seq_id to new feature:
2617              $subsf->seq_id($sf->seq_id) if $sf->seq_id;
2618              $subsf->source_tag($sf->source_tag) if $sf->source_tag;
2619              ## Transfer /locus_tag and /gene tag values to inferred
2620              ## features.  TODO: Perhaps? this should not be done
2621              ## indiscriminantly but rather by virtue of the setting
2622              ## of group_tag.
2623              foreach my $tag (grep /gene|locus_tag/, $sf->get_all_tags) {
2624                  my @vals = $sf->get_tag_values($tag);
2625                  $subsf->add_tag_value($tag, @vals);
2626              }
2627
2628              my $locstr = 'exon::'.$self->_locstr($subsf);
2629
2630              # re-use feature if type and location the same
2631              if ($loc_h{$locstr}) {
2632                  $subsf = $loc_h{$locstr};
2633              }
2634              else {
2635                  $loc_h{$locstr} = $subsf;
2636              }
2637              $subsf;
2638          } @locs;
2639
2640        # PARANOID CHECK
2641        $self->_check_order_is_consistent($sf->location->strand,@subsfs);
2642        #----
2643
2644        $sf->location(Bio::Location::Simple->new());
2645
2646        # we allow the exons to define the boundaries of the transcript
2647        $sf->add_SeqFeature($_, 'EXPAND') foreach @subsfs;
2648
2649
2650        if (!$sf->location->strand) {
2651            # correct weird bioperl bug in previous versions;
2652            # strand was not being set correctly
2653            $sf->location->strand($subsfs[0]->location->strand);
2654        }
2655
2656
2657    }
2658    return;
2659 }
2660
2661 #sub merge_features_with_same_loc {
2662 #   my ($self,@args) = @_;
2663
2664 #   my($sfs, $seq) =
2665 #     $self->_rearrange([qw(FEATURES
2666 #                           SEQ
2667 #                          )],
2668 #                          @args);
2669 #   my @sfs = (@$sfs);
2670 #   if ($seq) {
2671 #       $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2672 #       @sfs = $seq->get_all_SeqFeatures;
2673 #   }
2674
2675
2676 #   my %loc_h = ();
2677 #   foreach my $sf (@sfs) {
2678 #       my $type = $sf->primary_tag;
2679 #       my $locstr = $self->_locstr($sf);
2680 ##       $loc_h{$type.$locstr}
2681 #       push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
2682 #   }
2683 #}
2684
2685 =head2 infer_mRNA_from_CDS
2686
2687  Title   : infer_mRNA_from_CDS
2688  Usage   :
2689  Function:
2690  Example :
2691  Returns :
2692  Args    :
2693
2694 given a "type 1" containment hierarchy
2695
2696   gene
2697     CDS
2698       exon
2699
2700 this will infer the uniform "type 0" containment hierarchy
2701
2702   gene
2703     mRNA
2704       CDS
2705       exon
2706
2707 all the children of the CDS will be moved to the mRNA
2708
2709 a "type 2" containment hierarchy is mixed type "0" and "1" (for
2710 example, see ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/)
2711
2712 =cut
2713
2714 sub infer_mRNA_from_CDS{
2715    my ($self,@args) = @_;
2716
2717    my($sf, $seq, $noinfer) =
2718      $self->_rearrange([qw(FEATURE
2719                            SEQ
2720                            NOINFER
2721                           )],
2722                           @args);
2723    my @sfs = ($sf);
2724    if ($seq) {
2725        $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2726        @sfs = $seq->get_all_SeqFeatures;
2727    }
2728
2729    foreach my $sf (@sfs) {
2730
2731        $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2732        $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2733        if ($self->verbose > 0) {
2734            printf STDERR "    Checking $sf %s\n", $sf->primary_tag;
2735        }
2736
2737        if ($sf->primary_tag eq 'mRNA') {
2738            $self->problem(2,
2739                           "Inferring mRNAs when there are already mRNAs present");
2740        }
2741
2742        my @cdsl = grep {$_->primary_tag eq 'CDS' } $sf->get_SeqFeatures;
2743        if (@cdsl) {
2744            my @children = grep {$_->primary_tag ne 'CDS'} $sf->get_SeqFeatures;
2745            my @mrnas = ();
2746
2747
2748            foreach my $cds (@cdsl) {
2749
2750                if ($self->verbose > 0) {
2751                    print "    Inferring mRNA from CDS $cds\n";
2752                }
2753                $self->_check_order_is_consistent($cds->location->strand,$cds->location->each_Location);
2754
2755                my $loc = Bio::Location::Split->new;
2756                foreach my $cdsexonloc ($cds->location->each_Location) {
2757                    my $subloc =
2758                      Bio::Location::Simple->new(-start=>$cdsexonloc->start,
2759                                                 -end=>$cdsexonloc->end,
2760                                                 -strand=>$cdsexonloc->strand);
2761                    $loc->add_sub_Location($subloc);
2762                }
2763                 if ($noinfer) {
2764                     push(@mrnas, $cds);
2765                 }
2766                 else {
2767 #                   share the same location
2768                     my $mrna =
2769                         Bio::SeqFeature::Generic->new(-location=>$loc,
2770                                 -primary_tag=>'mRNA');
2771
2772 ##                  Provide seq_id to new feature:
2773                     $mrna->seq_id($cds->seq_id) if $cds->seq_id;
2774                     $mrna->source_tag($cds->source_tag) if $cds->source_tag;
2775
2776                     $self->_check_order_is_consistent($mrna->location->strand,$mrna->location->each_Location);
2777
2778 #                   make the mRNA hold the CDS; no EXPAND option,
2779 #                   the CDS cannot be wider than the mRNA
2780                     $mrna->add_SeqFeature($cds);
2781
2782 #                   mRNA steals children of CDS
2783                     foreach my $subsf ($cds->get_SeqFeatures) {
2784                         $mrna->add_SeqFeature($subsf);
2785                     }
2786                     $cds->remove_SeqFeatures;
2787                     push(@mrnas, $mrna);
2788                 }
2789            }
2790 #          change gene/CDS to gene/mRNA
2791            $sf->remove_SeqFeatures;
2792            $sf->add_SeqFeature($_) foreach (@mrnas, @children);
2793        }
2794    }
2795    return;
2796
2797
2798 }
2799
2800 =head2 remove_types
2801
2802  Title   : remove_types
2803  Usage   : $unf->remove_types(-seq=>$seq, -types=>["mRNA"]);
2804  Function:
2805  Example :
2806  Returns :
2807  Args    :
2808
2809 removes features of a set type
2810
2811 useful for pre-filtering a genbank record; eg to get rid of STSs
2812
2813 also, there is no way to unflatten
2814 ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/ UNLESS the bogus
2815 mRNAs in these records are removed (or changed to a different type) -
2816 they just confuse things too much
2817
2818 =cut
2819
2820 sub remove_types{
2821    my ($self,@args) = @_;
2822
2823    my($seq, $types) =
2824      $self->_rearrange([qw(
2825                            SEQ
2826                            TYPES
2827                           )],
2828                           @args);
2829    $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2830    my @sfs = $seq->get_all_SeqFeatures;
2831    my %rh = map {$_=>1} @$types;
2832    @sfs = grep {!$rh{$_->primary_tag}} @sfs;
2833    $seq->remove_SeqFeatures;
2834    $seq->add_SeqFeature($_) foreach @sfs;
2835    return;
2836 }
2837
2838
2839 # _check_order_is_consistent($strand,$ranges) RETURNS BOOL
2840 #
2841 # note: the value of this test is moot - there are many valid,
2842 # if unusual cases where it would flag an anomaly. for example
2843 # transpliced genes such as mod(mdg4) in dmel on AE003744, and
2844 # the following spliced gene on NC_001284:
2845 #
2846 #     mRNA            complement(join(20571..20717,21692..22086,190740..190761,
2847 #                     140724..141939,142769..142998))
2848 #                     /gene="nad5"
2849 #                     /note="trans-splicing, RNA editing"
2850 #                     /db_xref="GeneID:814567"
2851 #
2852 # note how the exons are not in order
2853 #  this will flag a level-3 warning, the user of this module
2854 #  can ignore this and deal appropriately with the resulting
2855 #  unordered exons
2856 sub _check_order_is_consistent {
2857     my $self = shift;
2858
2859     my $parent_strand = shift; # this does nothing..?
2860     my @ranges = @_;
2861     return unless @ranges;
2862     my $rangestr =
2863       join(" ",map{sprintf("[%s,%s]",$_->start,$_->end)} @ranges);
2864     my $strand = $ranges[0]->strand;
2865     for (my $i=1; $i<@ranges;$i++) {
2866         if ($ranges[$i]->strand != $strand) {
2867             $self->problem(1,"inconsistent strands. Trans-spliced gene? Range: $rangestr");
2868             return 1;
2869             # mixed ranges - autopass
2870             # some mRNAs have exons on both strands; for
2871             # example, the dmel mod(mdg4) gene which is
2872             # trans-spliced (in actual fact two mRNAs)
2873         }
2874     }
2875     my $pass = 1;
2876     for (my $i=1; $i<@ranges;$i++) {
2877         my $rangeP = $ranges[$i-1];
2878         my $range = $ranges[$i];
2879             if ($rangeP->start > $range->end) {
2880                 if ($self->seq->is_circular) {
2881                     # see for example NC_006578.gbk
2882                     # we make exceptions for circular genomes here.
2883                     # see Re: [Gmod-ajax] flatfile-to-json.pl error with GFF
2884                     # 2010-07-26
2885                 }
2886                 else {
2887                     # failed - but still get one more chance..
2888                     $pass = 0;
2889                     $self->problem(2,"Ranges not in correct order. Strange ensembl genbank entry? Range: $rangestr");
2890                     last;
2891                 }
2892             }
2893     }
2894
2895     if (!$pass) {
2896         # sometimes (eg ensembl flavour genbank files)
2897         # exons on reverse strand listed in reverse order
2898         # eg join(complement(R1),...,complement(Rn))
2899         # where R1 > R2
2900         for (my $i=1; $i<@ranges;$i++) {
2901             my $rangeP = $ranges[$i-1];
2902             my $range = $ranges[$i];
2903             if ($rangeP->end < $range->start) {
2904                 $self->problem(3,"inconsistent order. Range: $rangestr");
2905                 return 0;
2906             }
2907         }
2908     }
2909     return 1; # pass
2910 }
2911
2912 # PRIVATE METHOD: _locstr($sf)
2913 #
2914 # returns a location string for a feature; just the outer boundaries
2915 sub _locstr {
2916     my $self = shift;
2917     my $sf = shift;
2918     return
2919       sprintf("%d..%d", $sf->start, $sf->end);
2920 }
2921
2922 sub iterate_containment_tree {
2923     my $self = shift;
2924     my $feature_holder = shift;
2925     my $sub = shift;
2926     $sub->($feature_holder);
2927     my @sfs = $feature_holder->get_SeqFeatures;
2928     $self->iterate_containment_tree($_) foreach @sfs;
2929 }
2930
2931 sub find_best_pairs {
2932     my $matrix = shift;
2933     my $size = shift;
2934     my $i = shift || 0;
2935
2936     for (my $j=0; $j < $size; $j++) {
2937         my $score = $matrix->[$i][$j];
2938         if (!defined($score)) {
2939             next;
2940         }
2941
2942     }
2943
2944 }
2945
2946 1;