GFF3Loader: add parent_id only if DEBUG is defined
[bioperl-live.git] / Bio / DB / SeqFeature / Store / GFF3Loader.pm
blobcc132dd8a26da09ca9a48f0005821689af82905d
1 package Bio::DB::SeqFeature::Store::GFF3Loader;
4 =head1 NAME
6 Bio::DB::SeqFeature::Store::GFF3Loader -- GFF3 file loader for Bio::DB::SeqFeature::Store
8 =head1 SYNOPSIS
10 use Bio::DB::SeqFeature::Store;
11 use Bio::DB::SeqFeature::Store::GFF3Loader;
13 # Open the sequence database
14 my $db = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
15 -dsn => 'dbi:mysql:test',
16 -write => 1 );
18 my $loader = Bio::DB::SeqFeature::Store::GFF3Loader->new(-store => $db,
19 -verbose => 1,
20 -fast => 1);
22 $loader->load('./my_genome.gff3');
25 =head1 DESCRIPTION
27 The Bio::DB::SeqFeature::Store::GFF3Loader object parsers GFF3-format
28 sequence annotation files and loads Bio::DB::SeqFeature::Store
29 databases. For certain combinations of SeqFeature classes and
30 SeqFeature::Store databases it features a "fast load" mode which will
31 greatly accelerate the loading of GFF3 databases by a factor of 5-10.
33 The GFF3 file format has been extended very slightly to accommodate
34 Bio::DB::SeqFeature::Store. First, the loader recognizes is a new
35 directive:
37 # #index-subfeatures [0|1]
39 Note that you can place a space between the two #'s in order to
40 prevent GFF3 validators from complaining.
42 If this is true, then subfeatures are indexed (the default) so that
43 they can be retrieved with a query. See L<Bio::DB::SeqFeature::Store>
44 for an explanation of this. If false, then subfeatures can only be
45 accessed through their parent feature.
47 Second, the loader recognizes a new attribute tag called index, which
48 if present, controls indexing of the current feature. Example:
50 ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;index=1
52 You can use this to turn indexing on and off, overriding the default
53 for a particular feature.
55 Note that the loader keeps a record -- in memory -- of each feature
56 that it has processed. If you find the loader running out of memory on
57 particularly large GFF3 files, please split the input file into
58 smaller pieces and do the load in steps.
60 =cut
63 # load utility - incrementally load the store based on GFF3 file
65 # two modes:
66 # slow mode -- features can occur in any order in the GFF3 file
67 # fast mode -- all features with same ID must be contiguous in GFF3 file
69 use strict;
70 use Carp 'croak';
71 use Bio::DB::GFF::Util::Rearrange;
72 use Bio::DB::SeqFeature::Store::LoadHelper;
73 use constant DEBUG => 0;
75 use base 'Bio::DB::SeqFeature::Store::Loader';
78 my %Special_attributes =(
79 Gap => 1, Target => 1,
80 Parent => 1, Name => 1,
81 Alias => 1, ID => 1,
82 index => 1, Index => 1,
84 my %Strandedness = ( '+' => 1,
85 '-' => -1,
86 '.' => 0,
87 '' => 0,
88 0 => 0,
89 1 => 1,
90 -1 => -1,
91 +1 => 1,
92 undef => 0,
95 =head2 new
97 Title : new
98 Usage : $loader = Bio::DB::SeqFeature::Store::GFF3Loader->new(@options)
99 Function: create a new parser
100 Returns : a Bio::DB::SeqFeature::Store::GFF3Loader gff3 parser and loader
101 Args : several - see below
102 Status : public
104 This method creates a new GFF3 loader and establishes its connection
105 with a Bio::DB::SeqFeature::Store database. Arguments are -name=E<gt>$value
106 pairs as described in this table:
108 Name Value
109 ---- -----
111 -store A writable Bio::DB::SeqFeature::Store database handle.
113 -seqfeature_class The name of the type of Bio::SeqFeatureI object to create
114 and store in the database (Bio::DB::SeqFeature by default)
116 -sf_class A shorter alias for -seqfeature_class
118 -verbose Send progress information to standard error.
120 -fast If true, activate fast loading (see below)
122 -chunk_size Set the storage chunk size for nucleotide/protein sequences
123 (default 2000 bytes)
125 -tmp Indicate a temporary directory to use when loading non-normalized
126 features.
128 -ignore_seqregion Ignore ##sequence-region directives. The default is to create a
129 feature corresponding to the directive.
131 -noalias_target Don't create an Alias attribute for a target_id named in a
132 Target attribute. The default is to create an Alias
133 attribute containing the target_id found in a Target
134 attribute.
136 When you call new(), a connection to a Bio::DB::SeqFeature::Store
137 database should already have been established and the database
138 initialized (if appropriate).
140 Some combinations of Bio::SeqFeatures and Bio::DB::SeqFeature::Store
141 databases support a fast loading mode. Currently the only reliable
142 implementation of fast loading is the combination of DBI::mysql with
143 Bio::DB::SeqFeature. The other important restriction on fast loading
144 is the requirement that a feature that contains subfeatures must occur
145 in the GFF3 file before any of its subfeatures. Otherwise the
146 subfeatures that occurred before the parent feature will not be
147 attached to the parent correctly. This restriction does not apply to
148 normal (slow) loading.
150 If you use an unnormalized feature class, such as
151 Bio::SeqFeature::Generic, then the loader needs to create a temporary
152 database in which to cache features until all their parts and subparts
153 have been seen. This temporary databases uses the "berkeleydb"
154 adaptor. The -tmp option specifies the directory in which that
155 database will be created. If not present, it defaults to the system
156 default tmp directory specified by File::Spec-E<gt>tmpdir().
158 The -chunk_size option allows you to tune the representation of
159 DNA/Protein sequence in the Store database. By default, sequences are
160 split into 2000 base/residue chunks and then reassembled as
161 needed. This avoids the problem of pulling a whole chromosome into
162 memory in order to fetch a short subsequence from somewhere in the
163 middle. Depending on your usage patterns, you may wish to tune this
164 parameter using a chunk size that is larger or smaller than the
165 default.
167 =cut
169 sub new {
170 my $class = shift;
171 my $self = $class->SUPER::new(@_);
172 my ($ignore_seqregion) = rearrange(['IGNORE_SEQREGION'],@_);
173 $self->ignore_seqregion($ignore_seqregion);
174 my ($noalias_target) = rearrange(['NOALIAS_TARGET'],@_);
175 $self->noalias_target($noalias_target);
176 $self;
179 =head2 ignore_seqregion
181 $ignore_it = $loader->ignore_seqregion([$new_flag])
183 Get or set the ignore_seqregion flag, which if true, will cause
184 GFF3 ##sequence-region directives to be ignored. The default behavior
185 is to create a feature corresponding to the region.
187 =cut
189 sub ignore_seqregion {
190 my $self = shift;
191 my $d = $self->{ignore_seqregion};
192 $self->{ignore_seqregion} = shift if @_;
196 =head2 noalias_target
198 $noalias_target = $loader->noalias_target([$new_flag])
200 Get or set the noalias_target flag, which if true, will disable the creation of
201 an Alias attribute for a target_id named in a Target attribute. The default is
202 to create an Alias attribute containing the target_id found in a Target
203 attribute.
205 =cut
207 sub noalias_target {
208 my $self = shift;
209 my $d = $self->{noalias_target};
210 $self->{noalias_target} = shift if @_;
214 =head2 load
216 Title : load
217 Usage : $count = $loader->load(@ARGV)
218 Function: load the indicated files or filehandles
219 Returns : number of feature lines loaded
220 Args : list of files or filehandles
221 Status : public
223 Once the loader is created, invoke its load() method with a list of
224 GFF3 or FASTA file paths or previously-opened filehandles in order to
225 load them into the database. Compressed files ending with .gz, .Z and
226 .bz2 are automatically recognized and uncompressed on the fly. Paths
227 beginning with http: or ftp: are treated as URLs and opened using the
228 LWP GET program (which must be on your path).
230 FASTA files are recognized by their initial "E<gt>" character. Do not feed
231 the loader a file that is neither GFF3 nor FASTA; I don't know what
232 will happen, but it will probably not be what you expect.
234 =cut
236 # sub load { } inherited
238 =head2 accessors
240 The following read-only accessors return values passed or created during new():
242 store() the long-term Bio::DB::SeqFeature::Store object
244 tmp_store() the temporary Bio::DB::SeqFeature::Store object used
245 during loading
247 sfclass() the Bio::SeqFeatureI class
249 fast() whether fast loading is active
251 seq_chunk_size() the sequence chunk size
253 verbose() verbose progress messages
255 =cut
257 # sub store inherited
258 # sub tmp_store inherited
259 # sub sfclass inherited
260 # sub fast inherited
261 # sub seq_chunk_size inherited
262 # sub verbose inherited
264 =head2 Internal Methods
266 The following methods are used internally and may be overidden by
267 subclasses.
269 =over 4
271 =item default_seqfeature_class
273 $class = $loader->default_seqfeature_class
275 Return the default SeqFeatureI class (Bio::DB::SeqFeature).
277 =cut
279 # sub default_seqfeature_class { } inherited
281 =item subfeatures_normalized
283 $flag = $loader->subfeatures_normalized([$new_flag])
285 Get or set a flag that indicates that the subfeatures are
286 normalized. This is deduced from the SeqFeature class information.
288 =cut
290 # sub subfeatures_normalized { } inherited
292 =item subfeatures_in_table
294 $flag = $loader->subfeatures_in_table([$new_flag])
296 Get or set a flag that indicates that feature/subfeature relationships
297 are stored in a table. This is deduced from the SeqFeature class and
298 Store information.
300 =cut
302 # sub subfeatures_in_table { } inherited
304 =item load_fh
306 $count = $loader->load_fh($filehandle)
308 Load the GFF3 data at the other end of the filehandle and return true
309 if successful. Internally, load_fh() invokes:
311 start_load();
312 do_load($filehandle);
313 finish_load();
315 =cut
317 # sub load_fh { } inherited
319 =item start_load, finish_load
321 These methods are called at the start and end of a filehandle load.
323 =cut
325 sub create_load_data { #overridden
326 my $self = shift;
327 $self->SUPER::create_load_data;
328 $self->{load_data}{TemporaryID} = "GFFLoad0000000";
329 $self->{load_data}{IndexSubfeatures} = $self->index_subfeatures();
330 $self->{load_data}{mode} = 'gff';
332 $self->{load_data}{Helper} =
333 Bio::DB::SeqFeature::Store::LoadHelper->new($self->{tmpdir});
336 sub finish_load { #overridden
337 my $self = shift;
339 $self->store_current_feature(); # during fast loading, we will have a feature left at the very end
340 $self->start_or_finish_sequence(); # finish any half-loaded sequences
342 $self->msg("Building object tree...");
343 my $start = $self->time();
344 $self->build_object_tree;
345 $self->msg(sprintf "%5.2fs\n",$self->time()-$start);
347 if ($self->fast) {
348 $self->msg("Loading bulk data into database...");
349 $start = $self->time();
350 $self->store->finish_bulk_update;
351 $self->msg(sprintf "%5.2fs\n",$self->time()-$start);
353 eval {$self->store->commit};
355 # don't delete load data so that caller can ask for the loaded IDs
356 # $self->delete_load_data;
359 =item do_load
361 $count = $loader->do_load($fh)
363 This is called by load_fh() to load the GFF3 file's filehandle and
364 return the number of lines loaded.
366 =cut
368 # sub do_load { } inherited
370 =item load_line
372 $loader->load_line($data);
374 Load a line of a GFF3 file. You must bracket this with calls to
375 start_load() and finish_load()!
377 $loader->start_load();
378 $loader->load_line($_) while <FH>;
379 $loader->finish_load();
381 =cut
383 sub load_line { #overridden
384 my $self = shift;
385 my $line = shift;
387 chomp($line);
388 my $load_data = $self->{load_data};
389 $load_data->{line}++;
391 return unless $line =~ /^\S/; # blank line
393 # if it has a tab in it or looks like a chrom.sizes file, switch to gff mode
394 $load_data->{mode} = 'gff' if $line =~ /\t/
395 or $line =~ /^\w+\s+\d+\s*$/;
397 if ($line =~ /^\#\s?\#\s*(.+)/) { ## meta instruction
398 $load_data->{mode} = 'gff';
399 $self->handle_meta($1);
401 } elsif ($line =~ /^\#/) {
402 $load_data->{mode} = 'gff'; # just to be safe
403 return; # comment
406 elsif ($line =~ /^>\s*(\S+)/) { # FASTA lines are coming
407 $load_data->{mode} = 'fasta';
408 $self->start_or_finish_sequence($1);
411 elsif ($load_data->{mode} eq 'fasta') {
412 $self->load_sequence($line);
415 elsif ($load_data->{mode} eq 'gff') {
416 $self->handle_feature($line);
417 if (++$load_data->{count} % 1000 == 0) {
418 my $now = $self->time();
419 my $nl = -t STDOUT && !$ENV{EMACS} ? "\r" : "\n";
420 local $^W = 0; # kill uninit variable warning
421 $self->msg(sprintf("%d features loaded in %5.2fs (%5.2fs/1000 features)...%s$nl",
422 $load_data->{count},$now - $load_data->{start_time},
423 $now - $load_data->{millenium_time},
424 ' ' x 80
426 $load_data->{millenium_time} = $now;
430 else {
431 $self->throw("I don't know what to do with this line:\n$line");
436 =item handle_meta
438 $loader->handle_meta($meta_directive)
440 This method is called to handle meta-directives such as
441 ##sequence-region. The method will receive the directive with the
442 initial ## stripped off.
444 =cut
446 sub handle_meta {
447 my $self = shift;
448 my $instruction = shift;
450 if ( $instruction =~ /^#$/ ) {
451 $self->store_current_feature() ; # during fast loading, we will have a feature left at the very end
452 $self->start_or_finish_sequence(); # finish any half-loaded sequences
453 if ( $self->store->can('handle_resolution_meta') ) {
454 $self->store->handle_resolution_meta($instruction);
456 return;
459 if ($instruction =~ /sequence-region\s+(.+)\s+(-?\d+)\s+(-?\d+)/i
460 && !$self->ignore_seqregion()) {
461 my($ref,$start,$end,$strand) = $self->_remap($1,$2,$3,+1);
462 my $feature = $self->sfclass->new(-name => $ref,
463 -seq_id => $ref,
464 -start => $start,
465 -end => $end,
466 -strand => $strand,
467 -primary_tag => 'region');
468 $self->store->store($feature);
469 return;
472 if ($instruction =~/index-subfeatures\s+(\S+)/i) {
473 $self->{load_data}{IndexSubfeatures} = $1;
474 $self->store->index_subfeatures($1);
475 return;
478 if ( $self->store->can('handle_unrecognized_meta') ) {
479 $self->store->handle_unrecognized_meta($instruction);
480 return;
484 =item handle_feature
486 $loader->handle_feature($gff3_line)
488 This method is called to process a single GFF3 line. It manipulates
489 information stored a data structure called $self-E<gt>{load_data}.
491 =cut
493 sub handle_feature { #overridden
494 my $self = shift;
495 my $gff_line = shift;
496 my $ld = $self->{load_data};
498 my $allow_whitespace = $self->allow_whitespace;
500 # special case for a chrom.sizes-style line
501 my @columns;
502 if ($gff_line =~ /^(\w+)\s+(\d+)\s*$/) {
503 @columns = ($1,undef,'chromosome',1,$2,undef,undef,undef,"Name=$1");
504 } else {
505 $gff_line =~ s/\s+/\t/g if $allow_whitespace;
506 @columns = map {$_ eq '.' ? undef : $_ } split /\t/,$gff_line;
509 $self->invalid_gff($gff_line) if @columns < 4;
510 $self->invalid_gff($gff_line) if @columns > 9 && $allow_whitespace;
513 local $^W = 0;
514 if (@columns > 9) { #oops, split too much due to whitespace
515 $columns[8] = join(' ',@columns[8..$#columns]);
519 my ($refname,$source,$method,$start,$end,$score,$strand,$phase,$attributes) = @columns;
521 $self->invalid_gff($gff_line) unless defined $refname;
522 $self->invalid_gff($gff_line) unless !defined $start || $start =~ /^[\d.-]+$/;
523 $self->invalid_gff($gff_line) unless !defined $end || $end =~ /^[\d.-]+$/;
524 $self->invalid_gff($gff_line) unless defined $method;
526 $strand = $Strandedness{$strand||0};
527 my ($reserved,$unreserved) = $attributes ? $self->parse_attributes($attributes) : ();
529 my $name = ($reserved->{Name} && $reserved->{Name}[0]);
531 my $has_loadid = defined $reserved->{ID}[0];
533 my $feature_id = defined $reserved->{ID}[0] ? $reserved->{ID}[0] : $ld->{TemporaryID}++;
534 my @parent_ids = @{$reserved->{Parent}} if defined $reserved->{Parent};
536 my $index_it = $ld->{IndexSubfeatures};
537 if (exists $reserved->{Index} || exists $reserved->{index}) {
538 $index_it = $reserved->{Index}[0] || $reserved->{index}[0];
541 # Everything in the unreserved hash becomes an attribute, so we copy
542 # some attributes over
543 $unreserved->{Note} = $reserved->{Note} if exists $reserved->{Note};
544 $unreserved->{Alias} = $reserved->{Alias} if exists $reserved->{Alias};
545 $unreserved->{Target} = $reserved->{Target} if exists $reserved->{Target};
546 $unreserved->{Gap} = $reserved->{Gap} if exists $reserved->{Gap};
547 $unreserved->{load_id}= $reserved->{ID} if exists $reserved->{ID};
549 # mec@stowers-institute.org, wondering why not all attributes are
550 # carried forward, adds ID tag in particular service of
551 # round-tripping ID, which, though present in database as load_id
552 # attribute, was getting lost as itself
553 # $unreserved->{ID}= $reserved->{ID} if exists $reserved->{ID};
555 # TEMPORARY HACKS TO SIMPLIFY DEBUGGING
556 $feature_id = '' unless defined $feature_id;
557 $name = '' unless defined $name; # prevent uninit variable warnings
558 # push @{$unreserved->{Alias}},$feature_id if $has_loadid && $feature_id ne $name;
559 $unreserved->{parent_id} = \@parent_ids if DEBUG && @parent_ids;
561 # POSSIBLY A PERMANENT HACK -- TARGETS BECOME ALIASES
562 # THIS IS TO ALLOW FOR TARGET-BASED LOOKUPS
563 if (exists $reserved->{Target} && !$self->{noalias_target}) {
564 my %aliases = map {$_=>1} @{$unreserved->{Alias}};
565 for my $t (@{$reserved->{Target}}) {
566 (my $tc = $t) =~ s/\s+.*$//; # get rid of coordinates
567 $name ||= $tc;
568 push @{$unreserved->{Alias}},$tc unless $name eq $tc || $aliases{$tc};
572 ($refname,$start,$end,$strand) = $self->_remap($refname,$start,$end,$strand) or return;
574 my @args = (-display_name => $name,
575 -seq_id => $refname,
576 -start => $start,
577 -end => $end,
578 -strand => $strand || 0,
579 -score => $score,
580 -phase => $phase,
581 -primary_tag => $method || 'feature',
582 -source => $source,
583 -tag => $unreserved,
584 -attributes => $unreserved,
587 # Here's where we handle feature lines that have the same ID (multiple locations, not
588 # parent/child relationships)
590 my $old_feat;
592 # Current feature is the same as the previous feature, which hasn't yet been loaded
593 if (defined $ld->{CurrentID} && $ld->{CurrentID} eq $feature_id) {
594 $old_feat = $ld->{CurrentFeature};
597 # Current feature is the same as a feature that was loaded earlier
598 elsif (defined(my $id = $self->{load_data}{Helper}->local2global($feature_id))) {
599 $old_feat = $self->fetch($feature_id)
600 or $self->warn(<<END);
601 ID=$feature_id has been used more than once, but it cannot be found in the database.
602 This can happen if you have specified fast loading, but features sharing the same ID
603 are not contiguous in the GFF file. This will be loaded as a separate feature.
604 Line $.: "$_"
608 # contiguous feature, so add a segment
609 warn $old_feat if defined $old_feat and !ref $old_feat;
610 if (defined $old_feat) {
611 # set this to 1 to disable split-location behavior
612 if (0 && @parent_ids) { # If multiple features are held together by the same ID
613 $feature_id = $ld->{TemporaryID}++; # AND they have a Parent attribute, this causes an undesirable
614 } # additional layer of aggregation. Changing the ID fixes this.
615 elsif (
616 $old_feat->seq_id ne $refname ||
617 $old_feat->start != $start ||
618 $old_feat->end != $end # make sure endpoints are distinct
621 $self->add_segment($old_feat,$self->sfclass->new(@args));
622 return;
626 # we get here if this is a new feature
627 # first of all, store the current feature if it is there
628 $self->store_current_feature() if defined $ld->{CurrentID};
630 # now create the new feature
631 # (index top-level features only if policy asks us to)
632 my $feature = $self->sfclass->new(@args);
633 $feature->object_store($self->store) if $feature->can('object_store'); # for lazy table features
634 $ld->{CurrentFeature} = $feature;
635 $ld->{CurrentID} = $feature_id;
637 my $top_level = !@parent_ids;
638 my $has_id = defined $reserved->{ID}[0];
639 $index_it ||= $top_level;
641 my $helper = $ld->{Helper};
642 $helper->indexit($feature_id=>1) if $index_it;
643 $helper->toplevel($feature_id=>1) if !$self->{fast}
644 && $top_level; # need to track top level features
647 # remember parentage
648 for my $parent (@parent_ids) {
649 $helper->add_children($parent=>$feature_id);
654 sub invalid_gff {
655 my $self = shift;
656 my $line = shift;
657 $self->throw("invalid GFF line at line $self->{load_data}{line}.\n".$line);
660 =item allow_whitespace
662 $allow_it = $loader->allow_whitespace([$newvalue]);
664 Get or set the allow_whitespace flag. If true, then GFF3 files are
665 allowed to be delimited with whitespace in addition to tabs.
667 =cut
669 sub allow_whitespace {
670 my $self = shift;
671 my $d = $self->{allow_whitespace};
672 $self->{allow_whitespace} = shift if @_;
676 =item store_current_feature
678 $loader->store_current_feature()
680 This method is called to store the currently active feature in the
681 database. It uses a data structure stored in $self-E<gt>{load_data}.
683 =cut
685 # sub store_current_feature { } inherited
687 =item build_object_tree
689 $loader->build_object_tree()
691 This method gathers together features and subfeatures and builds the graph that connects them.
693 =cut
696 # put objects together
698 sub build_object_tree {
699 my $self = shift;
700 $self->subfeatures_in_table ? $self->build_object_tree_in_tables : $self->build_object_tree_in_features;
703 =item build_object_tree_in_tables
705 $loader->build_object_tree_in_tables()
707 This method gathers together features and subfeatures and builds the
708 graph that connects them, assuming that parent/child relationships
709 will be stored in a database table.
711 =cut
713 sub build_object_tree_in_tables {
714 my $self = shift;
715 my $store = $self->store;
716 my $helper = $self->{load_data}{Helper};
718 while (my ($load_id,$children) = $helper->each_family()) {
720 my $parent_id = $helper->local2global($load_id);
721 die $self->throw("$load_id doesn't have a primary id")
722 unless defined $parent_id;
724 my @children = map {$helper->local2global($_)} @$children;
725 # this updates the table that keeps track of parent/child relationships,
726 # but does not update the parent object -- so (start,end) had better be right!!!
727 $store->add_SeqFeature($parent_id,@children);
733 =item build_object_tree_in_features
735 $loader->build_object_tree_in_features()
737 This method gathers together features and subfeatures and builds the
738 graph that connects them, assuming that parent/child relationships are
739 stored in the seqfeature objects themselves.
741 =cut
743 sub build_object_tree_in_features {
744 my $self = shift;
745 my $store = $self->store;
746 my $tmp = $self->tmp_store;
747 my $ld = $self->{load_data};
748 my $normalized = $self->subfeatures_normalized;
750 my $helper = $ld->{Helper};
752 while (my $load_id = $helper->each_toplevel) {
753 my $feature = $self->fetch($load_id)
754 or $self->throw("$load_id (id="
755 .$helper->local2global($load_id)
756 ." should have a database entry, but doesn't");
757 $self->attach_children($store,$ld,$load_id,$feature);
758 # Indexed objects are updated, not created anew
759 $feature->primary_id(undef) unless $helper->indexit($load_id);
760 $store->store($feature);
765 =item attach_children
767 $loader->attach_children($store,$load_data,$load_id,$feature)
769 This recursively adds children to features and their subfeatures. It
770 is called when subfeatures are directly contained within other
771 features, rather than stored in a relational table.
773 =cut
775 sub attach_children {
776 my $self = shift;
777 my ($store,$ld,$load_id,$feature) = @_;
779 my $children = $ld->{Helper}->children() or return;
780 for my $child_id (@$children) {
781 my $child = $self->fetch($child_id)
782 or $self->throw("$child_id should have a database entry, but doesn't");
783 $self->attach_children($store,$ld,$child_id,$child); # recursive call
784 $feature->add_SeqFeature($child);
788 =item fetch
790 my $feature = $loader->fetch($load_id)
792 Given a load ID (from the ID= attribute) this method returns the
793 feature from the temporary database or the permanent one, depending on
794 where it is stored.
796 =cut
798 sub fetch {
799 my $self = shift;
800 my $load_id = shift;
801 my $helper = $self->{load_data}{Helper};
802 my $id = $helper->local2global($load_id);
804 return
805 ($self->subfeatures_normalized || $helper->indexit($load_id)
806 ? $self->store->fetch($id)
807 : $self->tmp_store->fetch($id)
811 =item add_segment
813 $loader->add_segment($parent,$child)
815 This method is used to add a split location to the parent.
817 =cut
819 sub add_segment {
820 my $self = shift;
821 my ($parent,$child) = @_;
823 if ($parent->can('add_segment')) { # probably a lazy table feature
824 my $segment_count = $parent->can('denormalized_segment_count') ? $parent->denormalized_segment_count
825 : $parent->can('denormalized_segments ') ? $parent->denormalized_segments
826 : $parent->can('segments') ? $parent->segments
827 : 0;
828 unless ($segment_count) { # convert into a segmented object
829 my $segment;
830 if ($parent->can('clone')) {
831 $segment = $parent->clone;
832 } else {
833 my %clone = %$parent;
834 $segment = bless \%clone,ref $parent;
836 delete $segment->{segments};
837 eval {$segment->object_store(undef) };
838 $segment->primary_id(undef);
840 # this updates the object and expands its start and end positions without writing
841 # the segments into the database as individual objects
842 $parent->add_segment($segment);
844 $parent->add_segment($child);
845 1; # for debugging
848 # a conventional Bio::SeqFeature::Generic object - create a split location
849 else {
850 my $current_location = $parent->location;
851 if ($current_location->can('add_sub_Location')) {
852 $current_location->add_sub_Location($child->location);
853 } else {
854 eval "require Bio::Location::Split" unless Bio::Location::Split->can('add_sub_Location');
855 my $new_location = Bio::Location::Split->new();
856 $new_location->add_sub_Location($current_location);
857 $new_location->add_sub_Location($child->location);
858 $parent->location($new_location);
863 =item parse_attributes
865 ($reserved,$unreserved) = $loader->parse_attributes($attribute_line)
867 This method parses the information contained in the $attribute_line
868 into two hashrefs, one containing the values of reserved attribute
869 tags (e.g. ID) and the other containing the values of unreserved ones.
871 =cut
873 sub parse_attributes {
874 my $self = shift;
875 my $att = shift;
877 unless ($att =~ /=/) { # ouch! must be a GFF line
878 require Bio::DB::SeqFeature::Store::GFF2Loader
879 unless Bio::DB::SeqFeature::Store::GFF2Loader->can('parse_attributes');
880 return $self->Bio::DB::SeqFeature::Store::GFF2Loader::parse_attributes($att);
883 my @pairs = map { my ($name,$value) = split '=';
884 [$self->unescape($name) => $value];
885 } split ';',$att;
886 my (%reserved,%unreserved);
887 foreach (@pairs) {
888 my $tag = $_->[0];
890 unless (defined $_->[1]) {
891 warn "$tag does not have a value at GFF3 file line $.\n";
892 next;
895 my @values = split ',',$_->[1];
896 map {$_ = $self->unescape($_);} @values;
897 if ($Special_attributes{$tag}) { # reserved attribute
898 push @{$reserved{$tag}},@values;
899 } else {
900 push @{$unreserved{$tag}},@values
903 return (\%reserved,\%unreserved);
906 =item start_or_finish_sequence
908 $loader->start_or_finish_sequence('Chr9')
910 This method is called at the beginning and end of a fasta section.
912 =cut
914 # sub start_or_finish_sequence { } inherited
916 =item load_sequence
918 $loader->load_sequence('gatttcccaaa')
920 This method is called to load some amount of sequence after
921 start_or_finish_sequence() is first called.
923 =cut
925 # sub load_sequence { } inherited
927 =item open_fh
929 my $io_file = $loader->open_fh($filehandle_or_path)
931 This method opens up the indicated file or pipe, using some
932 intelligence to recognized compressed files and URLs and doing the
933 right thing.
935 =cut
937 # sub open_fh { } inherited
939 # sub msg { } inherited
941 =item time
943 my $time = $loader->time
945 This method returns the current time in seconds, using Time::HiRes if available.
947 =cut
949 # sub time { } inherited
951 =item unescape
953 my $unescaped = GFF3Loader::unescape($escaped)
955 This is an internal utility. It is the same as CGI::Util::unescape,
956 but doesn't change pluses into spaces and ignores unicode escapes.
958 =cut
960 # sub unescape { } inherited
962 sub _remap {
963 my $self = shift;
964 my ($ref,$start,$end,$strand) = @_;
965 my $mapper = $self->coordinate_mapper;
966 return ($ref,$start,$end,$strand) unless $mapper;
968 my ($newref,$coords) = $mapper->($ref,[$start,$end]);
969 return unless defined $coords->[0];
970 if ($coords->[0] > $coords->[1]) {
971 @{$coords} = reverse(@{$coords});
972 $strand *= -1;
974 return ($newref,@{$coords},$strand);
977 sub _indexit { # override
978 my $self = shift;
979 return $self->{load_data}{Helper}->indexit(@_);
982 sub _local2global { # override
983 my $self = shift;
984 return $self->{load_data}{Helper}->local2global(@_);
987 =item local_ids
989 my $ids = $self->local_ids;
990 my $id_cnt = @$ids;
992 After performing a load, this returns an array ref containing all the
993 load file IDs that were contained within the file just loaded.
995 =cut
997 sub local_ids { # override
998 my $self = shift;
999 return $self->{load_data}{Helper}->local_ids(@_);
1002 =item loaded_ids
1004 my $ids = $loader->loaded_ids;
1005 my $id_cnt = @$ids;
1007 After performing a load, this returns an array ref containing all the
1008 feature primary ids that were created during the load.
1010 =cut
1012 sub loaded_ids { # override
1013 my $self = shift;
1014 return $self->{load_data}{Helper}->loaded_ids(@_);
1019 __END__
1021 =back
1023 =head1 BUGS
1025 This is an early version, so there are certainly some bugs. Please
1026 use the BioPerl bug tracking system to report bugs.
1028 =head1 SEE ALSO
1030 L<Bio::DB::SeqFeature::Store>,
1031 L<Bio::DB::SeqFeature::Segment>,
1032 L<Bio::DB::SeqFeature::NormalizedFeature>,
1033 L<Bio::DB::SeqFeature::GFF2Loader>,
1034 L<Bio::DB::SeqFeature::Store::DBI::mysql>,
1035 L<Bio::DB::SeqFeature::Store::berkeleydb>
1037 =head1 AUTHOR
1039 Lincoln Stein E<lt>lstein@cshl.orgE<gt>.
1041 Copyright (c) 2006 Cold Spring Harbor Laboratory.
1043 This library is free software; you can redistribute it and/or modify
1044 it under the same terms as Perl itself.
1046 =cut