tag fourth (and hopefully last) alpha
[bioperl-live.git] / branch-1-6 / Bio / DB / GFF / Aggregator.pm
blobd9162d791562a6ddfdefa7f9f7f12a2bb17b62cd
1 =head1 NAME
3 Bio::DB::GFF::Aggregator -- Aggregate GFF groups into composite features
5 =head1 SYNOPSIS
7 use Bio::DB::GFF;
9 my $agg1 = Bio::DB::GFF::Aggregator->new(-method => 'cistron',
10 -main_method => 'locus',
11 -sub_parts => ['allele','variant']
14 my $agg2 = Bio::DB::GFF::Aggregator->new(-method => 'splice_group',
15 -sub_parts => 'transcript');
17 my $db = Bio::DB::GFF->new( -adaptor => 'dbi:mysql',
18 -aggregator => [$agg1,$agg2],
19 -dsn => 'dbi:mysql:elegans42',
23 =head1 DESCRIPTION
25 Bio::DB::GFF::Aggregator is used to aggregate GFF groups into
26 composite features. Each composite feature has a "main part", the
27 top-level feature, and a series of zero or more subparts, retrieved
28 with the sub_SeqFeature() method. The aggregator class is designed to
29 be subclassable, allowing a variety of GFF feature types to be
30 supported.
32 The base Bio::DB::GFF::Aggregator class is generic, and can be used to
33 create specific instances to be passed to the -aggregator argument of
34 Bio::DB::GFF-E<gt>new() call. The various subclasses of
35 Bio::DB::GFF::Aggregator are tuned for specific common feature types
36 such as clones, gapped alignments and transcripts.
38 Instances of Bio::DB::GFF::Aggregator have three attributes:
40 =over 3
42 =item *
44 method
46 This is the GFF method field of the composite feature as a whole. For
47 example, "transcript" may be used for a composite feature created by
48 aggregating individual intron, exon and UTR features.
50 =item *
52 main method
54 Sometimes GFF groups are organized hierarchically, with one feature
55 logically containing another. For example, in the C. elegans schema,
56 methods of type "Sequence:curated" correspond to regions covered by
57 curated genes. There can be zero or one main methods.
59 =item *
61 subparts
63 This is a list of one or more methods that correspond to the component
64 features of the aggregates. For example, in the C. elegans database,
65 the subparts of transcript are "intron", "exon" and "CDS".
67 =back
69 Aggregators have two main methods that can be overridden in
70 subclasses:
72 =over 4
74 =item *
76 disaggregate()
78 This method is called by the Adaptor object prior to fetching a list
79 of features. The method is passed an associative array containing the
80 [method,source] pairs that the user has requested, and it returns a
81 list of raw features that it would like the adaptor to fetch.
83 =item *
85 aggregate()
87 This method is called by the Adaptor object after it has fetched
88 features. The method is passed a list of raw features and is expected
89 to add its composite features to the list.
91 =back
93 The disaggregate() and aggregate() methods provided by the base
94 Aggregator class should be sufficient for many applications. In this
95 case, it suffices for subclasses to override the following methods:
97 =over 4
99 =item *
101 method()
103 Return the default method for the composite feature as a whole.
105 =item *
107 main_name()
109 Return the default main method name.
111 =item *
113 part_names()
115 Return a list of subpart method names.
117 =back
119 Provided that method() and part_names() are overridden (and optionally
120 main_name() as well), then the bare name of the aggregator subclass
121 can be passed to the -aggregator of Bio::DB::GFF-E<gt>new(). For example,
122 this is a small subclass that will aggregate features of type "allele"
123 and "polymorphism" into an aggregate named "mutant":
125 package Bio::DB::GFF::Aggregator::mutant;
127 use strict;
128 use Bio::DB::GFF::Aggregator;
130 use base qw(Bio::DB::GFF::Aggregator);
132 sub method { 'mutant' }
134 sub part_names {
135 return qw(allele polymorphism);
140 Once installed, this aggregator can be passed to Bio::DB::GFF-E<gt>new()
141 by name like so:
143 my $db = Bio::DB::GFF->new( -adaptor => 'dbi:mysql',
144 -aggregator => 'mutant',
145 -dsn => 'dbi:mysql:elegans42',
148 =head1 API
150 The remainder of this document describes the public and private
151 methods implemented by this module.
153 =cut
155 package Bio::DB::GFF::Aggregator;
157 use strict;
158 use Bio::DB::GFF::Util::Rearrange; # for rearrange()
159 use Bio::DB::GFF::Feature;
161 use base qw(Bio::Root::Root);
163 my $ALWAYS_TRUE = sub { 1 };
165 =head2 new
167 Title : new
168 Usage : $a = Bio::DB::GFF::Aggregator->new(@args)
169 Function: create a new aggregator
170 Returns : a Bio::DB::GFF::Aggregator object
171 Args : see below
172 Status : Public
174 This is the constructor for Bio::DB::GFF::Aggregator. Named arguments
175 are as follows:
177 -method the method for the composite feature
179 -main_method the top-level raw feature, if any
181 -sub_parts the list of raw features that will form the subparts
182 of the composite feature (array reference or scalar)
184 =cut
186 sub new {
187 my $class = shift;
188 my ($method,$main,$sub_parts,$whole_object) = rearrange(['METHOD',
189 ['MAIN_PART','MAIN_METHOD'],
190 ['SUB_METHODS','SUB_PARTS'],
191 'WHOLE_OBJECT'
192 ],@_);
193 return bless {
194 method => $method,
195 main_method => $main,
196 sub_parts => $sub_parts,
197 require_whole_object => $whole_object,
198 },$class;
201 =head2 disaggregate
203 Title : disaggregate
204 Usage : $a->disaggregate($types,$factory)
205 Function: disaggregate type list into components
206 Returns : a true value if this aggregator should be called to reaggregate
207 Args : see below
208 Status : Public
210 This method is called to disaggregate a list of types into the set of
211 low-level features to be retrieved from the GFF database. The list of
212 types is passed as an array reference containing a series of
213 [method,source] pairs. This method synthesizes a new set of
214 [method,source] pairs, and appends them to the list of requested
215 types, changing the list in situ.
217 Arguments:
219 $types reference to an array of [method,source] pairs
221 $factory reference to the Adaptor object that is calling
222 this method
224 Note that the API allows disaggregate() to remove types from the type
225 list. This feature is probably not desirable and may be deprecated in
226 the future.
228 =cut
230 # this is called at the beginning to turn the pseudo-type
231 # into its component feature types
232 sub disaggregate {
233 my $self = shift;
234 my $types = shift;
235 my $factory = shift;
237 my $sub_features = $factory->parse_types($self->get_part_names);
238 my $main_feature = $factory->parse_types($self->get_main_name);
240 if (@$types) {
241 my (@synthetic_types,@unchanged);
242 foreach (@$types) {
243 my ($method,$source) = @$_;
244 if (lc $method eq lc $self->get_method) { # e.g. "transcript"
245 push @synthetic_types,map { [$_->[0],$_->[1] || $source] } @$sub_features,@$main_feature;
247 else {
248 push @unchanged,$_;
251 # remember what we're searching for
252 $self->components(\@synthetic_types);
253 $self->passthru(\@unchanged);
254 @$types = (@unchanged,@synthetic_types);
257 # we get here when no search types are listed
258 else {
259 my @stypes = map { [$_->[0],$_->[1]] } @$sub_features,@$main_feature;
260 $self->components(\@stypes);
261 $self->passthru(undef);
264 return $self->component_count > 0;
268 =head2 aggregate
270 Title : aggregate
271 Usage : $features = $a->aggregate($features,$factory)
272 Function: aggregate a feature list into composite features
273 Returns : an array reference containing modified features
274 Args : see below
275 Status : Public
277 This method is called to aggregate a list of raw GFF features into the
278 set of composite features. The method is called an array reference to
279 a set of Bio::DB::GFF::Feature objects. It runs through the list,
280 creating new composite features when appropriate. The method result
281 is an array reference containing the composite features.
283 Arguments:
285 $features reference to an array of Bio::DB::GFF::Feature objects
287 $factory reference to the Adaptor object that is calling
288 this method
290 NOTE: The reason that the function result contains the raw features as
291 well as the aggregated ones is to allow queries like this one:
293 @features = $segment->features('exon','transcript:curated');
295 Assuming that "transcript" is the name of an aggregated feature and
296 that "exon" is one of its components, we do not want the transcript
297 aggregator to remove features of type "exon" because the user asked
298 for them explicitly.
300 =cut
302 sub aggregate {
303 my $self = shift;
304 my $features = shift;
305 my $factory = shift;
307 my $main_method = $self->get_main_name;
308 my $matchsub = $self->match_sub($factory) or return;
309 my $strictmatch = $self->strict_match();
310 my $passthru = $self->passthru_sub($factory);
312 my (%aggregates,@result);
313 for my $feature (@$features) {
315 if ($feature->group && $matchsub->($feature)) {
316 my $key = $strictmatch->{lc $feature->method,lc $feature->source}
317 ? join ($;,$feature->group,$feature->refseq,$feature->source)
318 : join ($;,$feature->group,$feature->refseq);
319 if ($main_method && lc $feature->method eq lc $main_method) {
320 $aggregates{$key}{base} ||= $feature->clone;
321 } else {
322 push @{$aggregates{$key}{subparts}},$feature;
324 push @result,$feature if $passthru && $passthru->($feature);
326 } else {
327 push @result,$feature;
331 # aggregate components
332 my $pseudo_method = $self->get_method;
333 my $require_whole_object = $self->require_whole_object;
334 foreach (keys %aggregates) {
335 if ($require_whole_object && $self->components) {
336 next unless $aggregates{$_}{base}; # && $aggregates{$_}{subparts};
338 my $base = $aggregates{$_}{base};
339 unless ($base) { # no base, so create one
340 my $first = $aggregates{$_}{subparts}[0];
341 $base = $first->clone; # to inherit parent coordinate system, etc
342 $base->score(undef);
343 $base->phase(undef);
345 $base->method($pseudo_method);
346 $base->add_subfeature($_) foreach @{$aggregates{$_}{subparts}};
347 $base->adjust_bounds;
348 $base->compound(1); # set the compound flag
349 push @result,$base;
351 @$features = @result;
355 =head2 method
357 Title : method
358 Usage : $string = $a->method
359 Function: get the method type for the composite feature
360 Returns : a string
361 Args : none
362 Status : Protected
364 This method is called to get the method to be assigned to the
365 composite feature once it is aggregated. It is called if the user did
366 not explicitly supply a -method argument when the aggregator was
367 created.
369 This is the method that should be overridden in aggregator subclasses.
371 =cut
373 # default method - override in subclasses
374 sub method {
375 my $self = shift;
376 $self->{method};
379 =head2 main_name
381 Title : main_name
382 Usage : $string = $a->main_name
383 Function: get the method type for the "main" component of the feature
384 Returns : a string
385 Args : none
386 Status : Protected
388 This method is called to get the method of the "main component" of the
389 composite feature. It is called if the user did not explicitly supply
390 a -main-method argument when the aggregator was created.
392 This is the method that should be overridden in aggregator subclasses.
394 =cut
396 # no default main method
397 sub main_name {
398 my $self = shift;
399 return;
402 =head2 part_names
404 Title : part_names
405 Usage : @methods = $a->part_names
406 Function: get the methods for the non-main various components of the feature
407 Returns : a list of strings
408 Args : none
409 Status : Protected
411 This method is called to get the list of methods of the "main component" of the
412 composite feature. It is called if the user did not explicitly supply
413 a -main-method argument when the aggregator was created.
415 This is the method that should be overridden in aggregator subclasses.
417 =cut
419 # no default part names
420 sub part_names {
421 my $self = shift;
422 return;
425 =head2 require_whole_object
427 Title : require_whole_object
428 Usage : $bool = $a->require_whole_object
429 Function: see below
430 Returns : a boolean flag
431 Args : none
432 Status : Internal
434 This method returns true if the aggregator should refuse to aggregate
435 an object unless both its main part and its subparts are present.
437 =cut
439 sub require_whole_object {
440 my $self = shift;
441 my $d = $self->{require_whole_object};
442 $self->{require_whole_object} = shift if @_;
446 =head2 match_sub
448 Title : match_sub
449 Usage : $coderef = $a->match_sub($factory)
450 Function: generate a code reference that will match desired features
451 Returns : a code reference
452 Args : see below
453 Status : Internal
455 This method is used internally to generate a code sub that will
456 quickly filter out the raw features that we're interested in
457 aggregating. The returned sub accepts a Feature and returns true if
458 we should aggregate it, false otherwise.
460 =cut
462 #' make emacs happy
464 sub match_sub {
465 my $self = shift;
466 my $factory = shift;
467 my $types_to_aggregate = $self->components() or return; # saved from disaggregate call
468 return unless @$types_to_aggregate;
469 return $factory->make_match_sub($types_to_aggregate);
472 =head2 strict_match
474 Title : strict_match
475 Usage : $strict = $a->strict_match
476 Function: generate a hashref that indicates which subfeatures
477 need to be tested strictly for matching sources before
478 aggregating
479 Returns : a hash ref
480 Status : Internal
482 =cut
484 sub strict_match {
485 my $self = shift;
486 my $types_to_aggregate = $self->components();
487 my %strict;
488 for my $t (@$types_to_aggregate) {
489 $strict{lc $t->[0],lc $t->[1]}++ if defined $t->[1];
491 \%strict;
494 sub passthru_sub {
495 my $self = shift;
496 my $factory = shift;
497 my $passthru = $self->passthru() or return;
498 return unless @$passthru;
499 return $factory->make_match_sub($passthru);
502 =head2 components
504 Title : components
505 Usage : @array= $a->components([$components])
506 Function: get/set stored list of parsed raw feature types
507 Returns : an array in list context, an array ref in scalar context
508 Args : new arrayref of feature types
509 Status : Internal
511 This method is used internally to remember the parsed list of raw
512 features that we will aggregate. The need for this subroutine is
513 seen when a user requests a composite feature of type
514 "clone:cosmid". This generates a list of components in which the
515 source is appended to the method, like "clone_left_end:cosmid" and
516 "clone_right_end:cosmid". components() stores this information for
517 later use.
519 =cut
521 sub components {
522 my $self = shift;
523 my $d = $self->{components};
524 $self->{components} = shift if @_;
525 return unless ref $d;
526 return wantarray ? @$d : $d;
529 sub component_count {
530 my @c = shift->components;
531 scalar @c;
534 sub passthru {
535 my $self = shift;
536 my $d = $self->{passthru};
537 $self->{passthru} = shift if @_;
538 return unless ref $d;
539 return wantarray ? @$d : $d;
542 sub clone {
543 my $self = shift;
544 my %new = %{$self};
545 return bless \%new,ref($self);
548 =head2 get_part_names
550 Title : get_part_names
551 Usage : @array = $a->get_part_names
552 Function: get list of sub-parts for this type of feature
553 Returns : an array
554 Args : none
555 Status : Internal
557 This method is used internally to fetch the list of feature types that
558 form the components of the composite feature. Type names in the
559 format "method:source" are recognized, as are "method" and
560 Bio::DB::GFF::Typename objects as well. It checks instance variables
561 first, and if not defined calls the part_names() method.
563 =cut
565 sub get_part_names {
566 my $self = shift;
567 if ($self->{sub_parts}) {
568 return ref $self->{sub_parts} ? @{$self->{sub_parts}} : $self->{sub_parts};
569 } else {
570 return $self->part_names;
574 =head2 get_main_name
576 Title : get_main_name
577 Usage : $string = $a->get_main_name
578 Function: get the "main" method type for this feature
579 Returns : a string
580 Args : none
581 Status : Internal
583 This method is used internally to fetch the type of the "main part" of
584 the feature. It checks instance variables first, and if not defined
585 calls the main_name() method.
587 =cut
589 sub get_main_name {
590 my $self = shift;
591 return $self->{main_method} if defined $self->{main_method};
592 return $self->main_name;
595 =head2 get_method
597 Title : get_method
598 Usage : $string = $a->get_method
599 Function: get the method type for the composite feature
600 Returns : a string
601 Args : none
602 Status : Internal
604 This method is used internally to fetch the type of the method that
605 will be assigned to the composite feature once it is synthesized.
607 =cut
609 sub get_method {
610 my $self = shift;
611 return $self->{method} if defined $self->{method};
612 return $self->method;
617 =head1 BUGS
619 None known yet.
621 =head1 SEE ALSO
623 L<Bio::DB::GFF>,
624 L<Bio::DB::GFF::Aggregator::alignment>,
625 L<Bio::DB::GFF::Aggregator::clone>,
626 L<Bio::DB::GFF::Aggregator::coding>,
627 L<Bio::DB::GFF::Aggregator::match>,
628 L<Bio::DB::GFF::Aggregator::processed_transcript>,
629 L<Bio::DB::GFF::Aggregator::transcript>,
630 L<Bio::DB::GFF::Aggregator::none>
632 =head1 AUTHOR
634 Lincoln Stein E<lt>lstein@cshl.orgE<gt>.
636 Copyright (c) 2001 Cold Spring Harbor Laboratory.
638 This library is free software; you can redistribute it and/or modify
639 it under the same terms as Perl itself.
641 =cut