fix spelling errors, fixes #3228
[bioperl-live.git] / Bio / DB / SeqFeature / Store / GFF3Loader.pm
blobb8dc956f4060bb13a30f72f404378afe5ae40f6c
1 package Bio::DB::SeqFeature::Store::GFF3Loader;
4 =head1 NAME
6 Bio::DB::SeqFeature::Store::GFF3Loader -- GFF3 file loader for Bio::DB::SeqFeature::Store
8 =head1 SYNOPSIS
10 use Bio::DB::SeqFeature::Store;
11 use Bio::DB::SeqFeature::Store::GFF3Loader;
13 # Open the sequence database
14 my $db = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
15 -dsn => 'dbi:mysql:test',
16 -write => 1 );
18 my $loader = Bio::DB::SeqFeature::Store::GFF3Loader->new(-store => $db,
19 -verbose => 1,
20 -fast => 1);
22 $loader->load('./my_genome.gff3');
25 =head1 DESCRIPTION
27 The Bio::DB::SeqFeature::Store::GFF3Loader object parsers GFF3-format
28 sequence annotation files and loads Bio::DB::SeqFeature::Store
29 databases. For certain combinations of SeqFeature classes and
30 SeqFeature::Store databases it features a "fast load" mode which will
31 greatly accelerate the loading of GFF3 databases by a factor of 5-10.
33 The GFF3 file format has been extended very slightly to accommodate
34 Bio::DB::SeqFeature::Store. First, the loader recognizes is a new
35 directive:
37 # #index-subfeatures [0|1]
39 Note that you can place a space between the two #'s in order to
40 prevent GFF3 validators from complaining.
42 If this is true, then subfeatures are indexed (the default) so that
43 they can be retrieved with a query. See L<Bio::DB::SeqFeature::Store>
44 for an explanation of this. If false, then subfeatures can only be
45 accessed through their parent feature.
47 Second, the loader recognizes a new attribute tag called index, which
48 if present, controls indexing of the current feature. Example:
50 ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;index=1
52 You can use this to turn indexing on and off, overriding the default
53 for a particular feature.
55 Note that the loader keeps a record -- in memory -- of each feature
56 that it has processed. If you find the loader running out of memory on
57 particularly large GFF3 files, please split the input file into
58 smaller pieces and do the load in steps.
60 =cut
63 # load utility - incrementally load the store based on GFF3 file
65 # two modes:
66 # slow mode -- features can occur in any order in the GFF3 file
67 # fast mode -- all features with same ID must be contiguous in GFF3 file
69 use strict;
70 use Carp 'croak';
71 use Bio::DB::GFF::Util::Rearrange;
72 use Bio::DB::SeqFeature::Store::LoadHelper;
74 use base 'Bio::DB::SeqFeature::Store::Loader';
77 my %Special_attributes =(
78 Gap => 1, Target => 1,
79 Parent => 1, Name => 1,
80 Alias => 1, ID => 1,
81 index => 1, Index => 1,
83 my %Strandedness = ( '+' => 1,
84 '-' => -1,
85 '.' => 0,
86 '' => 0,
87 0 => 0,
88 1 => 1,
89 -1 => -1,
90 +1 => 1,
91 undef => 0,
94 =head2 new
96 Title : new
97 Usage : $loader = Bio::DB::SeqFeature::Store::GFF3Loader->new(@options)
98 Function: create a new parser
99 Returns : a Bio::DB::SeqFeature::Store::GFF3Loader gff3 parser and loader
100 Args : several - see below
101 Status : public
103 This method creates a new GFF3 loader and establishes its connection
104 with a Bio::DB::SeqFeature::Store database. Arguments are -name=E<gt>$value
105 pairs as described in this table:
107 Name Value
108 ---- -----
110 -store A writable Bio::DB::SeqFeature::Store database handle.
112 -seqfeature_class The name of the type of Bio::SeqFeatureI object to create
113 and store in the database (Bio::DB::SeqFeature by default)
115 -sf_class A shorter alias for -seqfeature_class
117 -verbose Send progress information to standard error.
119 -fast If true, activate fast loading (see below)
121 -chunk_size Set the storage chunk size for nucleotide/protein sequences
122 (default 2000 bytes)
124 -tmp Indicate a temporary directory to use when loading non-normalized
125 features.
127 -ignore_seqregion Ignore ##sequence-region directives. The default is to create a
128 feature corresponding to the directive.
130 -noalias_target Don't create an Alias attribute for a target_id named in a
131 Target attribute. The default is to create an Alias
132 attribute containing the target_id found in a Target
133 attribute.
135 When you call new(), a connection to a Bio::DB::SeqFeature::Store
136 database should already have been established and the database
137 initialized (if appropriate).
139 Some combinations of Bio::SeqFeatures and Bio::DB::SeqFeature::Store
140 databases support a fast loading mode. Currently the only reliable
141 implementation of fast loading is the combination of DBI::mysql with
142 Bio::DB::SeqFeature. The other important restriction on fast loading
143 is the requirement that a feature that contains subfeatures must occur
144 in the GFF3 file before any of its subfeatures. Otherwise the
145 subfeatures that occurred before the parent feature will not be
146 attached to the parent correctly. This restriction does not apply to
147 normal (slow) loading.
149 If you use an unnormalized feature class, such as
150 Bio::SeqFeature::Generic, then the loader needs to create a temporary
151 database in which to cache features until all their parts and subparts
152 have been seen. This temporary databases uses the "berkeleydb"
153 adaptor. The -tmp option specifies the directory in which that
154 database will be created. If not present, it defaults to the system
155 default tmp directory specified by File::Spec-E<gt>tmpdir().
157 The -chunk_size option allows you to tune the representation of
158 DNA/Protein sequence in the Store database. By default, sequences are
159 split into 2000 base/residue chunks and then reassembled as
160 needed. This avoids the problem of pulling a whole chromosome into
161 memory in order to fetch a short subsequence from somewhere in the
162 middle. Depending on your usage patterns, you may wish to tune this
163 parameter using a chunk size that is larger or smaller than the
164 default.
166 =cut
168 sub new {
169 my $class = shift;
170 my $self = $class->SUPER::new(@_);
171 my ($ignore_seqregion) = rearrange(['IGNORE_SEQREGION'],@_);
172 $self->ignore_seqregion($ignore_seqregion);
173 my ($noalias_target) = rearrange(['NOALIAS_TARGET'],@_);
174 $self->noalias_target($noalias_target);
175 $self;
178 =head2 ignore_seqregion
180 $ignore_it = $loader->ignore_seqregion([$new_flag])
182 Get or set the ignore_seqregion flag, which if true, will cause
183 GFF3 ##sequence-region directives to be ignored. The default behavior
184 is to create a feature corresponding to the region.
186 =cut
188 sub ignore_seqregion {
189 my $self = shift;
190 my $d = $self->{ignore_seqregion};
191 $self->{ignore_seqregion} = shift if @_;
195 =head2 noalias_target
197 $noalias_target = $loader->noalias_target([$new_flag])
199 Get or set the noalias_target flag, which if true, will disable the creation of
200 an Alias attribute for a target_id named in a Target attribute. The default is
201 to create an Alias attribute containing the target_id found in a Target
202 attribute.
204 =cut
206 sub noalias_target {
207 my $self = shift;
208 my $d = $self->{noalias_target};
209 $self->{noalias_target} = shift if @_;
213 =head2 load
215 Title : load
216 Usage : $count = $loader->load(@ARGV)
217 Function: load the indicated files or filehandles
218 Returns : number of feature lines loaded
219 Args : list of files or filehandles
220 Status : public
222 Once the loader is created, invoke its load() method with a list of
223 GFF3 or FASTA file paths or previously-opened filehandles in order to
224 load them into the database. Compressed files ending with .gz, .Z and
225 .bz2 are automatically recognized and uncompressed on the fly. Paths
226 beginning with http: or ftp: are treated as URLs and opened using the
227 LWP GET program (which must be on your path).
229 FASTA files are recognized by their initial "E<gt>" character. Do not feed
230 the loader a file that is neither GFF3 nor FASTA; I don't know what
231 will happen, but it will probably not be what you expect.
233 =cut
235 # sub load { } inherited
237 =head2 accessors
239 The following read-only accessors return values passed or created during new():
241 store() the long-term Bio::DB::SeqFeature::Store object
243 tmp_store() the temporary Bio::DB::SeqFeature::Store object used
244 during loading
246 sfclass() the Bio::SeqFeatureI class
248 fast() whether fast loading is active
250 seq_chunk_size() the sequence chunk size
252 verbose() verbose progress messages
254 =cut
256 # sub store inherited
257 # sub tmp_store inherited
258 # sub sfclass inherited
259 # sub fast inherited
260 # sub seq_chunk_size inherited
261 # sub verbose inherited
263 =head2 Internal Methods
265 The following methods are used internally and may be overidden by
266 subclasses.
268 =over 4
270 =item default_seqfeature_class
272 $class = $loader->default_seqfeature_class
274 Return the default SeqFeatureI class (Bio::DB::SeqFeature).
276 =cut
278 # sub default_seqfeature_class { } inherited
280 =item subfeatures_normalized
282 $flag = $loader->subfeatures_normalized([$new_flag])
284 Get or set a flag that indicates that the subfeatures are
285 normalized. This is deduced from the SeqFeature class information.
287 =cut
289 # sub subfeatures_normalized { } inherited
291 =item subfeatures_in_table
293 $flag = $loader->subfeatures_in_table([$new_flag])
295 Get or set a flag that indicates that feature/subfeature relationships
296 are stored in a table. This is deduced from the SeqFeature class and
297 Store information.
299 =cut
301 # sub subfeatures_in_table { } inherited
303 =item load_fh
305 $count = $loader->load_fh($filehandle)
307 Load the GFF3 data at the other end of the filehandle and return true
308 if successful. Internally, load_fh() invokes:
310 start_load();
311 do_load($filehandle);
312 finish_load();
314 =cut
316 # sub load_fh { } inherited
318 =item start_load, finish_load
320 These methods are called at the start and end of a filehandle load.
322 =cut
324 sub create_load_data { #overridden
325 my $self = shift;
326 $self->SUPER::create_load_data;
327 $self->{load_data}{TemporaryID} = "GFFLoad0000000";
328 $self->{load_data}{IndexSubfeatures} = $self->index_subfeatures();
329 $self->{load_data}{mode} = 'gff';
331 $self->{load_data}{Helper} =
332 Bio::DB::SeqFeature::Store::LoadHelper->new($self->{tmpdir});
335 sub finish_load { #overridden
336 my $self = shift;
338 $self->store_current_feature(); # during fast loading, we will have a feature left at the very end
339 $self->start_or_finish_sequence(); # finish any half-loaded sequences
341 $self->msg("Building object tree...");
342 my $start = $self->time();
343 $self->build_object_tree;
344 $self->msg(sprintf "%5.2fs\n",$self->time()-$start);
346 if ($self->fast) {
347 $self->msg("Loading bulk data into database...");
348 $start = $self->time();
349 $self->store->finish_bulk_update;
350 $self->msg(sprintf "%5.2fs\n",$self->time()-$start);
352 eval {$self->store->commit};
354 # don't delete load data so that caller can ask for the loaded IDs
355 # $self->delete_load_data;
358 =item do_load
360 $count = $loader->do_load($fh)
362 This is called by load_fh() to load the GFF3 file's filehandle and
363 return the number of lines loaded.
365 =cut
367 # sub do_load { } inherited
369 =item load_line
371 $loader->load_line($data);
373 Load a line of a GFF3 file. You must bracket this with calls to
374 start_load() and finish_load()!
376 $loader->start_load();
377 $loader->load_line($_) while <FH>;
378 $loader->finish_load();
380 =cut
382 sub load_line { #overridden
383 my $self = shift;
384 my $line = shift;
386 chomp($line);
387 my $load_data = $self->{load_data};
388 $load_data->{line}++;
390 return unless $line =~ /^\S/; # blank line
392 # if it has a tab in it or looks like a chrom.sizes file, switch to gff mode
393 $load_data->{mode} = 'gff' if $line =~ /\t/
394 or $line =~ /^\w+\s+\d+\s*$/;
396 if ($line =~ /^\#\s?\#\s*(.+)/) { ## meta instruction
397 $load_data->{mode} = 'gff';
398 $self->handle_meta($1);
400 } elsif ($line =~ /^\#/) {
401 $load_data->{mode} = 'gff'; # just to be safe
402 return; # comment
405 elsif ($line =~ /^>\s*(\S+)/) { # FASTA lines are coming
406 $load_data->{mode} = 'fasta';
407 $self->start_or_finish_sequence($1);
410 elsif ($load_data->{mode} eq 'fasta') {
411 $self->load_sequence($line);
414 elsif ($load_data->{mode} eq 'gff') {
415 $self->handle_feature($line);
416 if (++$load_data->{count} % 1000 == 0) {
417 my $now = $self->time();
418 my $nl = -t STDOUT && !$ENV{EMACS} ? "\r" : "\n";
419 local $^W = 0; # kill uninit variable warning
420 $self->msg(sprintf("%d features loaded in %5.2fs (%5.2fs/1000 features)...%s$nl",
421 $load_data->{count},$now - $load_data->{start_time},
422 $now - $load_data->{millenium_time},
423 ' ' x 80
425 $load_data->{millenium_time} = $now;
429 else {
430 $self->throw("I don't know what to do with this line:\n$line");
435 =item handle_meta
437 $loader->handle_meta($meta_directive)
439 This method is called to handle meta-directives such as
440 ##sequence-region. The method will receive the directive with the
441 initial ## stripped off.
443 =cut
445 sub handle_meta {
446 my $self = shift;
447 my $instruction = shift;
449 if ( $instruction =~ /^#$/ ) {
450 $self->store_current_feature() ; # during fast loading, we will have a feature left at the very end
451 $self->start_or_finish_sequence(); # finish any half-loaded sequences
452 if ( $self->store->can('handle_resolution_meta') ) {
453 $self->store->handle_resolution_meta($instruction);
455 return;
458 if ($instruction =~ /sequence-region\s+(.+)\s+(-?\d+)\s+(-?\d+)/i
459 && !$self->ignore_seqregion()) {
460 my($ref,$start,$end,$strand) = $self->_remap($1,$2,$3,+1);
461 my $feature = $self->sfclass->new(-name => $ref,
462 -seq_id => $ref,
463 -start => $start,
464 -end => $end,
465 -strand => $strand,
466 -primary_tag => 'region');
467 $self->store->store($feature);
468 return;
471 if ($instruction =~/index-subfeatures\s+(\S+)/i) {
472 $self->{load_data}{IndexSubfeatures} = $1;
473 $self->store->index_subfeatures($1);
474 return;
477 if ( $self->store->can('handle_unrecognized_meta') ) {
478 $self->store->handle_unrecognized_meta($instruction);
479 return;
483 =item handle_feature
485 $loader->handle_feature($gff3_line)
487 This method is called to process a single GFF3 line. It manipulates
488 information stored a data structure called $self-E<gt>{load_data}.
490 =cut
492 sub handle_feature { #overridden
493 my $self = shift;
494 my $gff_line = shift;
495 my $ld = $self->{load_data};
497 my $allow_whitespace = $self->allow_whitespace;
499 # special case for a chrom.sizes-style line
500 my @columns;
501 if ($gff_line =~ /^(\w+)\s+(\d+)\s*$/) {
502 @columns = ($1,undef,'chromosome',1,$2,undef,undef,undef,"Name=$1");
503 } else {
504 $gff_line =~ s/\s+/\t/g if $allow_whitespace;
505 @columns = map {$_ eq '.' ? undef : $_ } split /\t/,$gff_line;
508 $self->invalid_gff($gff_line) if @columns < 4;
509 $self->invalid_gff($gff_line) if @columns > 9 && $allow_whitespace;
512 local $^W = 0;
513 if (@columns > 9) { #oops, split too much due to whitespace
514 $columns[8] = join(' ',@columns[8..$#columns]);
518 my ($refname,$source,$method,$start,$end,$score,$strand,$phase,$attributes) = @columns;
520 $self->invalid_gff($gff_line) unless defined $refname;
521 $self->invalid_gff($gff_line) unless !defined $start || $start =~ /^[\d.-]+$/;
522 $self->invalid_gff($gff_line) unless !defined $end || $end =~ /^[\d.-]+$/;
523 $self->invalid_gff($gff_line) unless defined $method;
525 $strand = $Strandedness{$strand||0};
526 my ($reserved,$unreserved) = $attributes ? $self->parse_attributes($attributes) : ();
528 my $name = ($reserved->{Name} && $reserved->{Name}[0]);
530 my $has_loadid = defined $reserved->{ID}[0];
532 my $feature_id = defined $reserved->{ID}[0] ? $reserved->{ID}[0] : $ld->{TemporaryID}++;
533 my @parent_ids = @{$reserved->{Parent}} if defined $reserved->{Parent};
535 my $index_it = $ld->{IndexSubfeatures};
536 if (exists $reserved->{Index} || exists $reserved->{index}) {
537 $index_it = $reserved->{Index}[0] || $reserved->{index}[0];
540 # Everything in the unreserved hash becomes an attribute, so we copy
541 # some attributes over
542 $unreserved->{Note} = $reserved->{Note} if exists $reserved->{Note};
543 $unreserved->{Alias} = $reserved->{Alias} if exists $reserved->{Alias};
544 $unreserved->{Target} = $reserved->{Target} if exists $reserved->{Target};
545 $unreserved->{Gap} = $reserved->{Gap} if exists $reserved->{Gap};
546 $unreserved->{load_id}= $reserved->{ID} if exists $reserved->{ID};
548 # mec@stowers-institute.org, wondering why not all attributes are
549 # carried forward, adds ID tag in particular service of
550 # round-tripping ID, which, though present in database as load_id
551 # attribute, was getting lost as itself
552 # $unreserved->{ID}= $reserved->{ID} if exists $reserved->{ID};
554 # TEMPORARY HACKS TO SIMPLIFY DEBUGGING
555 $feature_id = '' unless defined $feature_id;
556 $name = '' unless defined $name; # prevent uninit variable warnings
557 # push @{$unreserved->{Alias}},$feature_id if $has_loadid && $feature_id ne $name;
558 $unreserved->{parent_id} = \@parent_ids if @parent_ids;
560 # POSSIBLY A PERMANENT HACK -- TARGETS BECOME ALIASES
561 # THIS IS TO ALLOW FOR TARGET-BASED LOOKUPS
562 if (exists $reserved->{Target} && !$self->{noalias_target}) {
563 my %aliases = map {$_=>1} @{$unreserved->{Alias}};
564 for my $t (@{$reserved->{Target}}) {
565 (my $tc = $t) =~ s/\s+.*$//; # get rid of coordinates
566 $name ||= $tc;
567 push @{$unreserved->{Alias}},$tc unless $name eq $tc || $aliases{$tc};
571 ($refname,$start,$end,$strand) = $self->_remap($refname,$start,$end,$strand) or return;
573 my @args = (-display_name => $name,
574 -seq_id => $refname,
575 -start => $start,
576 -end => $end,
577 -strand => $strand || 0,
578 -score => $score,
579 -phase => $phase,
580 -primary_tag => $method || 'feature',
581 -source => $source,
582 -tag => $unreserved,
583 -attributes => $unreserved,
586 # Here's where we handle feature lines that have the same ID (multiple locations, not
587 # parent/child relationships)
589 my $old_feat;
591 # Current feature is the same as the previous feature, which hasn't yet been loaded
592 if (defined $ld->{CurrentID} && $ld->{CurrentID} eq $feature_id) {
593 $old_feat = $ld->{CurrentFeature};
596 # Current feature is the same as a feature that was loaded earlier
597 elsif (defined(my $id = $self->{load_data}{Helper}->local2global($feature_id))) {
598 $old_feat = $self->fetch($feature_id)
599 or $self->warn(<<END);
600 ID=$feature_id has been used more than once, but it cannot be found in the database.
601 This can happen if you have specified fast loading, but features sharing the same ID
602 are not contiguous in the GFF file. This will be loaded as a separate feature.
603 Line $.: "$_"
607 # contiguous feature, so add a segment
608 warn $old_feat if defined $old_feat and !ref $old_feat;
609 if (defined $old_feat) {
610 # set this to 1 to disable split-location behavior
611 if (0 && @parent_ids) { # If multiple features are held together by the same ID
612 $feature_id = $ld->{TemporaryID}++; # AND they have a Parent attribute, this causes an undesirable
613 } # additional layer of aggregation. Changing the ID fixes this.
614 elsif (
615 $old_feat->seq_id ne $refname ||
616 $old_feat->start != $start ||
617 $old_feat->end != $end # make sure endpoints are distinct
620 $self->add_segment($old_feat,$self->sfclass->new(@args));
621 return;
625 # we get here if this is a new feature
626 # first of all, store the current feature if it is there
627 $self->store_current_feature() if defined $ld->{CurrentID};
629 # now create the new feature
630 # (index top-level features only if policy asks us to)
631 my $feature = $self->sfclass->new(@args);
632 $feature->object_store($self->store) if $feature->can('object_store'); # for lazy table features
633 $ld->{CurrentFeature} = $feature;
634 $ld->{CurrentID} = $feature_id;
636 my $top_level = !@parent_ids;
637 my $has_id = defined $reserved->{ID}[0];
638 $index_it ||= $top_level;
640 my $helper = $ld->{Helper};
641 $helper->indexit($feature_id=>1) if $index_it;
642 $helper->toplevel($feature_id=>1) if !$self->{fast}
643 && $top_level; # need to track top level features
646 # remember parentage
647 for my $parent (@parent_ids) {
648 $helper->add_children($parent=>$feature_id);
653 sub invalid_gff {
654 my $self = shift;
655 my $line = shift;
656 $self->throw("invalid GFF line at line $self->{load_data}{line}.\n".$line);
659 =item allow_whitespace
661 $allow_it = $loader->allow_whitespace([$newvalue]);
663 Get or set the allow_whitespace flag. If true, then GFF3 files are
664 allowed to be delimited with whitespace in addition to tabs.
666 =cut
668 sub allow_whitespace {
669 my $self = shift;
670 my $d = $self->{allow_whitespace};
671 $self->{allow_whitespace} = shift if @_;
675 =item store_current_feature
677 $loader->store_current_feature()
679 This method is called to store the currently active feature in the
680 database. It uses a data structure stored in $self-E<gt>{load_data}.
682 =cut
684 # sub store_current_feature { } inherited
686 =item build_object_tree
688 $loader->build_object_tree()
690 This method gathers together features and subfeatures and builds the graph that connects them.
692 =cut
695 # put objects together
697 sub build_object_tree {
698 my $self = shift;
699 $self->subfeatures_in_table ? $self->build_object_tree_in_tables : $self->build_object_tree_in_features;
702 =item build_object_tree_in_tables
704 $loader->build_object_tree_in_tables()
706 This method gathers together features and subfeatures and builds the
707 graph that connects them, assuming that parent/child relationships
708 will be stored in a database table.
710 =cut
712 sub build_object_tree_in_tables {
713 my $self = shift;
714 my $store = $self->store;
715 my $helper = $self->{load_data}{Helper};
717 while (my ($load_id,$children) = $helper->each_family()) {
719 my $parent_id = $helper->local2global($load_id);
720 die $self->throw("$load_id doesn't have a primary id")
721 unless defined $parent_id;
723 my @children = map {$helper->local2global($_)} @$children;
724 # this updates the table that keeps track of parent/child relationships,
725 # but does not update the parent object -- so (start,end) had better be right!!!
726 $store->add_SeqFeature($parent_id,@children);
732 =item build_object_tree_in_features
734 $loader->build_object_tree_in_features()
736 This method gathers together features and subfeatures and builds the
737 graph that connects them, assuming that parent/child relationships are
738 stored in the seqfeature objects themselves.
740 =cut
742 sub build_object_tree_in_features {
743 my $self = shift;
744 my $store = $self->store;
745 my $tmp = $self->tmp_store;
746 my $ld = $self->{load_data};
747 my $normalized = $self->subfeatures_normalized;
749 my $helper = $ld->{Helper};
751 while (my $load_id = $helper->each_toplevel) {
752 my $feature = $self->fetch($load_id)
753 or $self->throw("$load_id (id="
754 .$helper->local2global($load_id)
755 ." should have a database entry, but doesn't");
756 $self->attach_children($store,$ld,$load_id,$feature);
757 # Indexed objects are updated, not created anew
758 $feature->primary_id(undef) unless $helper->indexit($load_id);
759 $store->store($feature);
764 =item attach_children
766 $loader->attach_children($store,$load_data,$load_id,$feature)
768 This recursively adds children to features and their subfeatures. It
769 is called when subfeatures are directly contained within other
770 features, rather than stored in a relational table.
772 =cut
774 sub attach_children {
775 my $self = shift;
776 my ($store,$ld,$load_id,$feature) = @_;
778 my $children = $ld->{Helper}->children() or return;
779 for my $child_id (@$children) {
780 my $child = $self->fetch($child_id)
781 or $self->throw("$child_id should have a database entry, but doesn't");
782 $self->attach_children($store,$ld,$child_id,$child); # recursive call
783 $feature->add_SeqFeature($child);
787 =item fetch
789 my $feature = $loader->fetch($load_id)
791 Given a load ID (from the ID= attribute) this method returns the
792 feature from the temporary database or the permanent one, depending on
793 where it is stored.
795 =cut
797 sub fetch {
798 my $self = shift;
799 my $load_id = shift;
800 my $helper = $self->{load_data}{Helper};
801 my $id = $helper->local2global($load_id);
803 return
804 ($self->subfeatures_normalized || $helper->indexit($load_id)
805 ? $self->store->fetch($id)
806 : $self->tmp_store->fetch($id)
810 =item add_segment
812 $loader->add_segment($parent,$child)
814 This method is used to add a split location to the parent.
816 =cut
818 sub add_segment {
819 my $self = shift;
820 my ($parent,$child) = @_;
822 if ($parent->can('add_segment')) { # probably a lazy table feature
823 my $segment_count = $parent->can('denormalized_segment_count') ? $parent->denormalized_segment_count
824 : $parent->can('denormalized_segments ') ? $parent->denormalized_segments
825 : $parent->can('segments') ? $parent->segments
826 : 0;
827 unless ($segment_count) { # convert into a segmented object
828 my $segment;
829 if ($parent->can('clone')) {
830 $segment = $parent->clone;
831 } else {
832 my %clone = %$parent;
833 $segment = bless \%clone,ref $parent;
835 delete $segment->{segments};
836 eval {$segment->object_store(undef) };
837 $segment->primary_id(undef);
839 # this updates the object and expands its start and end positions without writing
840 # the segments into the database as individual objects
841 $parent->add_segment($segment);
843 $parent->add_segment($child);
844 1; # for debugging
847 # a conventional Bio::SeqFeature::Generic object - create a split location
848 else {
849 my $current_location = $parent->location;
850 if ($current_location->can('add_sub_Location')) {
851 $current_location->add_sub_Location($child->location);
852 } else {
853 eval "require Bio::Location::Split" unless Bio::Location::Split->can('add_sub_Location');
854 my $new_location = Bio::Location::Split->new();
855 $new_location->add_sub_Location($current_location);
856 $new_location->add_sub_Location($child->location);
857 $parent->location($new_location);
862 =item parse_attributes
864 ($reserved,$unreserved) = $loader->parse_attributes($attribute_line)
866 This method parses the information contained in the $attribute_line
867 into two hashrefs, one containing the values of reserved attribute
868 tags (e.g. ID) and the other containing the values of unreserved ones.
870 =cut
872 sub parse_attributes {
873 my $self = shift;
874 my $att = shift;
876 unless ($att =~ /=/) { # ouch! must be a GFF line
877 require Bio::DB::SeqFeature::Store::GFF2Loader
878 unless Bio::DB::SeqFeature::Store::GFF2Loader->can('parse_attributes');
879 return $self->Bio::DB::SeqFeature::Store::GFF2Loader::parse_attributes($att);
882 my @pairs = map { my ($name,$value) = split '=';
883 [$self->unescape($name) => $value];
884 } split ';',$att;
885 my (%reserved,%unreserved);
886 foreach (@pairs) {
887 my $tag = $_->[0];
889 unless (defined $_->[1]) {
890 warn "$tag does not have a value at GFF3 file line $.\n";
891 next;
894 my @values = split ',',$_->[1];
895 map {$_ = $self->unescape($_);} @values;
896 if ($Special_attributes{$tag}) { # reserved attribute
897 push @{$reserved{$tag}},@values;
898 } else {
899 push @{$unreserved{$tag}},@values
902 return (\%reserved,\%unreserved);
905 =item start_or_finish_sequence
907 $loader->start_or_finish_sequence('Chr9')
909 This method is called at the beginning and end of a fasta section.
911 =cut
913 # sub start_or_finish_sequence { } inherited
915 =item load_sequence
917 $loader->load_sequence('gatttcccaaa')
919 This method is called to load some amount of sequence after
920 start_or_finish_sequence() is first called.
922 =cut
924 # sub load_sequence { } inherited
926 =item open_fh
928 my $io_file = $loader->open_fh($filehandle_or_path)
930 This method opens up the indicated file or pipe, using some
931 intelligence to recognized compressed files and URLs and doing the
932 right thing.
934 =cut
936 # sub open_fh { } inherited
938 # sub msg { } inherited
940 =item time
942 my $time = $loader->time
944 This method returns the current time in seconds, using Time::HiRes if available.
946 =cut
948 # sub time { } inherited
950 =item unescape
952 my $unescaped = GFF3Loader::unescape($escaped)
954 This is an internal utility. It is the same as CGI::Util::unescape,
955 but doesn't change pluses into spaces and ignores unicode escapes.
957 =cut
959 # sub unescape { } inherited
961 sub _remap {
962 my $self = shift;
963 my ($ref,$start,$end,$strand) = @_;
964 my $mapper = $self->coordinate_mapper;
965 return ($ref,$start,$end,$strand) unless $mapper;
967 my ($newref,$coords) = $mapper->($ref,[$start,$end]);
968 return unless defined $coords->[0];
969 if ($coords->[0] > $coords->[1]) {
970 @{$coords} = reverse(@{$coords});
971 $strand *= -1;
973 return ($newref,@{$coords},$strand);
976 sub _indexit { # override
977 my $self = shift;
978 return $self->{load_data}{Helper}->indexit(@_);
981 sub _local2global { # override
982 my $self = shift;
983 return $self->{load_data}{Helper}->local2global(@_);
986 =item local_ids
988 my $ids = $self->local_ids;
989 my $id_cnt = @$ids;
991 After performing a load, this returns an array ref containing all the
992 load file IDs that were contained within the file just loaded.
994 =cut
996 sub local_ids { # override
997 my $self = shift;
998 return $self->{load_data}{Helper}->local_ids(@_);
1001 =item loaded_ids
1003 my $ids = $loader->loaded_ids;
1004 my $id_cnt = @$ids;
1006 After performing a load, this returns an array ref containing all the
1007 feature primary ids that were created during the load.
1009 =cut
1011 sub loaded_ids { # override
1012 my $self = shift;
1013 return $self->{load_data}{Helper}->loaded_ids(@_);
1018 __END__
1020 =back
1022 =head1 BUGS
1024 This is an early version, so there are certainly some bugs. Please
1025 use the BioPerl bug tracking system to report bugs.
1027 =head1 SEE ALSO
1029 L<Bio::DB::SeqFeature::Store>,
1030 L<Bio::DB::SeqFeature::Segment>,
1031 L<Bio::DB::SeqFeature::NormalizedFeature>,
1032 L<Bio::DB::SeqFeature::GFF2Loader>,
1033 L<Bio::DB::SeqFeature::Store::DBI::mysql>,
1034 L<Bio::DB::SeqFeature::Store::berkeleydb>
1036 =head1 AUTHOR
1038 Lincoln Stein E<lt>lstein@cshl.orgE<gt>.
1040 Copyright (c) 2006 Cold Spring Harbor Laboratory.
1042 This library is free software; you can redistribute it and/or modify
1043 it under the same terms as Perl itself.
1045 =cut