1 package Bio
::DB
::SeqFeature
::Store
::FeatureFileLoader
;
7 Bio::DB::SeqFeature::Store::FeatureFileLoader -- feature file loader for Bio::DB::SeqFeature::Store
11 use Bio::DB::SeqFeature::Store;
12 use Bio::DB::SeqFeature::Store::FeatureFileLoader;
14 # Open the sequence database
15 my $db = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
16 -dsn => 'dbi:mysql:test',
20 Bio::DB::SeqFeature::Store::FeatureFileLoader->new(-store => $db,
24 $loader->load('./my_genome.fff');
29 The Bio::DB::SeqFeature::Store::FeatureFileLoader object parsers
30 FeatureFile-format sequence annotation files and loads
31 Bio::DB::SeqFeature::Store databases. For certain combinations of
32 SeqFeature classes and SeqFeature::Store databases it features a "fast
33 load" mode which will greatly accelerate the loading of databases by a
36 FeatureFile Format (.fff) is very simple:
38 mRNA B0511.1 Chr1:1..100 Type=UTR;Note="putative primase"
39 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
40 mRNA B0511.1 Chr1:801..1000 Type=UTR
44 Cosmid B0511 3185..3294
45 Cosmid B0511 10946..11208
46 Cosmid B0511 13126..13511
47 Cosmid B0511 11394..11539
48 EST yk260e10.5 15569..15724
49 EST yk672a12.5 537..618,3187..3294
50 EST yk595e6.5 552..618
51 EST yk595e6.5 3187..3294
52 EST yk846e07.3 11015..11208
54 yk53c10.3 15000..15500,15700..15800
55 yk53c10.5 18892..19154
56 EST yk53c10.5 16032..16105
57 SwissProt PECANEX 13153-13656 Note="Swedish fish"
58 FGENESH "Predicted gene 1" 1-205,518-616,661-735,3187-3365,3436-3846 "Pfam domain"
61 There are up to four columns of WHITESPACE (not necessarily tab)
62 delimited text. Embedded whitespace must be escaped using shell
63 escaping rules (quoting the column or backslashing whitespace).
65 Column 1: The feature type. You may use type:subtype as a convention
68 Column 2: The feature name/ID.
70 Column 3: The position of this feature in base pair
71 coordinates. Ranges can be given as either
72 start-end or start..end. A chromosome position
73 can be specified using the format "reference:start..end".
74 A discontinuous feature can be specified by giving
75 multiple ranges separated by commas. Minus-strand features
76 are indicated by specifying a start > end.
78 Column 4: Comment/attribute field. A single Note can be given, or
79 a series of attribute=value pairs, separated by
80 spaces or semicolons, as in "score=23;type=transmembrane"
82 =head2 Specifying Positions and Ranges
84 A feature position is specified using a sequence ID (a genbank
85 accession number, a chromosome name, a contig, or any other meaningful
86 reference system, followed by a colon and a position range. Ranges are
87 two integers separated by double dots or the hyphen. Examples:
88 "Chr1:516..11208", "ctgA:1-5000". Negative coordinates are allowed, as
91 A discontinuous range ("split location") uses commas to separate the
94 Gene B0511.1 Chr1:516..619,3185..3294,10946..11208
96 In the case of a split location, the sequence id only has to appear in
97 front of the first range.
99 Alternatively, a split location can be indicated by repeating the
100 features type and name on multiple adjacent lines:
102 Gene B0511.1 Chr1:516..619
103 Gene B0511.1 Chr1:3185..3294
104 Gene B0511.1 Chr1:10946..11208
106 If all the locations are on the same reference sequence, you can
107 specify a default chromosome using a "reference=<seqid>":
110 Gene B0511.1 516..619
111 Gene B0511.1 3185..3294
112 Gene B0511.1 10946..11208
114 The default seqid is in effect until the next "reference" line
119 Tags can be added to features by adding a fourth column consisting of
122 Gene B0511.1 Chr1:516..619,3185..3294 Note="Putative primase"
124 Tags and their values take any form you want, and multiple tags can be
125 separated by semicolons. You can also repeat tags multiple times:
127 Gene B0511.1 Chr1:516..619,3185..3294 GO_Term=GO:100;GO_Term=GO:2087
129 Several tags have special meanings:
134 Type The primary tag for a subfeature.
135 Score The score of a feature or subfeature.
136 Phase The phase of a feature or subfeature.
137 URL A URL to link to (via the Bio::Graphics library).
138 Note A note to attach to the feature for display by the Bio::Graphics library.
140 For example, in the common case of an mRNA, you can use the "Type" tag
141 to distinguish the parts of the mRNA into UTR and CDS:
143 mRNA B0511.1 Chr1:1..100 Type=UTR
144 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
145 mRNA B0511.1 Chr1:801..1000 Type=UTR
147 The top level feature's primary tag will be "mRNA", and its subparts
148 will have types UTR and CDS as indicated. Additional tags that are
149 placed in the first line of the feature will be applied to the top
150 level. In this example, the note "Putative primase" will be applied to
151 the mRNA at the top level of the feature:
153 mRNA B0511.1 Chr1:1..100 Type=UTR;Note="Putative primase"
154 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
155 mRNA B0511.1 Chr1:801..1000 Type=UTR
157 =head2 Feature Groups
159 Features can be grouped so that they are rendered by the "group"
160 glyph. To start a group, create a two-column feature entry showing
161 the group type and a name for the group. Follow this with a list of
162 feature entries with a blank type. For example:
165 yk53c10.3 15000-15500,15700-15800
166 yk53c10.5 18892-19154
168 This example is declaring that the ESTs named yk53c10.3 and yk53c10.5
169 belong to the same group named yk53c10.
171 =head2 Comments and the #include Directive
173 Lines that begin with the # sign are treated as comments and
174 ignored. When a # sign appears within a line, everything to the right
175 of the symbol is also ignored, unless it looks like an HTML fragment or
180 glyph = generic # this comment is ignored
182 link = http://www.google.com/search?q=$name#results
184 Be careful, because the processing of # signs uses a regexp heuristic. To be safe,
185 always put a space after the # sign to make sure it is treated as a comment.
187 The special comment "#include 'filename'" acts like the C preprocessor
188 directive and will insert the comments of a named file into the
189 position at which it occurs. Relative paths will be treated relative
190 to the file in which the #include occurs. Nested #include directives
193 #include "/usr/local/share/my_directives.txt"
194 #include 'my_directives.txt'
195 #include chromosome3_features.gff3
197 You can enclose the file path in single or double quotes as shown
198 above. If there are no spaces in the filename the quotes are optional.
200 Include file processing is not very smart. Avoid creating circular
201 #include references. You have been warned!
205 Note that this loader always creates denormalized features such that
206 subfeatures and their parents are stored as one big database
207 object. The GFF3 format and its loader is usually preferred for both
208 space and execution efficiency.
218 use Text
::ParseWords
'shellwords','quotewords';
220 use base
'Bio::DB::SeqFeature::Store::Loader';
225 Usage : $loader = Bio::DB::SeqFeature::Store::FeatureFileLoader->new(@options)
226 Function: create a new parser
227 Returns : a Bio::DB::SeqFeature::Store::FeatureFileLoader parser and loader
228 Args : several - see below
231 This method creates a new FeatureFile loader and establishes its connection
232 with a Bio::DB::SeqFeature::Store database. Arguments are -name=E<gt>$value
233 pairs as described in this table:
238 -store A writeable Bio::DB::SeqFeature::Store database handle.
240 -seqfeature_class The name of the type of Bio::SeqFeatureI object to create
241 and store in the database (Bio::DB::SeqFeature by default)
243 -sf_class A shorter alias for -seqfeature_class
245 -verbose Send progress information to standard error.
247 -fast If true, activate fast loading (see below)
249 -chunk_size Set the storage chunk size for nucleotide/protein sequences
252 -tmp Indicate a temporary directory to use when loading non-normalized
255 When you call new(), a connection to a Bio::DB::SeqFeature::Store
256 database should already have been established and the database
257 initialized (if appropriate).
259 Some combinations of Bio::SeqFeatures and Bio::DB::SeqFeature::Store
260 databases support a fast loading mode. Currently the only reliable
261 implementation of fast loading is the combination of DBI::mysql with
262 Bio::DB::SeqFeature. The other important restriction on fast loading
263 is the requirement that a feature that contains subfeatures must occur
264 in the FeatureFile file before any of its subfeatures. Otherwise the
265 subfeatures that occurred before the parent feature will not be
266 attached to the parent correctly. This restriction does not apply to
267 normal (slow) loading.
269 If you use an unnormalized feature class, such as
270 Bio::SeqFeature::Generic, then the loader needs to create a temporary
271 database in which to cache features until all their parts and subparts
272 have been seen. This temporary databases uses the "bdb" adaptor. The
273 -tmp option specifies the directory in which that database will be
274 created. If not present, it defaults to the system default tmp
275 directory specified by File::Spec-E<gt>tmpdir().
277 The -chunk_size option allows you to tune the representation of
278 DNA/Protein sequence in the Store database. By default, sequences are
279 split into 2000 base/residue chunks and then reassembled as
280 needed. This avoids the problem of pulling a whole chromosome into
281 memory in order to fetch a short subsequence from somewhere in the
282 middle. Depending on your usage patterns, you may wish to tune this
283 parameter using a chunk size that is larger or smaller than the
288 # sub new {} inherited
293 Usage : $count = $loader->load(@ARGV)
294 Function: load the indicated files or filehandles
295 Returns : number of feature lines loaded
296 Args : list of files or filehandles
299 Once the loader is created, invoke its load() method with a list of
300 FeatureFile or FASTA file paths or previously-opened filehandles in order to
301 load them into the database. Compressed files ending with .gz, .Z and
302 .bz2 are automatically recognized and uncompressed on the fly. Paths
303 beginning with http: or ftp: are treated as URLs and opened using the
304 LWP GET program (which must be on your path).
306 FASTA files are recognized by their initial "E<gt>" character. Do not feed
307 the loader a file that is neither FeatureFile nor FASTA; I don't know what
308 will happen, but it will probably not be what you expect.
312 # sub load {} inherited
316 The following read-only accessors return values passed or created during new():
318 store() the long-term Bio::DB::SeqFeature::Store object
320 tmp_store() the temporary Bio::DB::SeqFeature::Store object used
323 sfclass() the Bio::SeqFeatureI class
325 fast() whether fast loading is active
327 seq_chunk_size() the sequence chunk size
329 verbose() verbose progress messages
333 # sub store {} inherited
334 # sub tmp_store {} inherited
335 # sub sfclass {} inherited
336 # sub fast {} inherited
337 # sub seq_chunk_size {} inherited
338 # sub verbose {} inherited
340 =head2 default_seqfeature_class
342 $class = $loader->default_seqfeature_class
344 Return the default SeqFeatureI class (Bio::Graphics::Feature).
348 sub default_seqfeature_class
{ #override
350 return 'Bio::Graphics::Feature';
356 $count = $loader->load_fh($filehandle)
358 Load the FeatureFile data at the other end of the filehandle and return true
359 if successful. Internally, load_fh() invokes:
362 do_load($filehandle);
367 # sub load_fh { } inherited
369 =head2 start_load, finish_load
371 These methods are called at the start and end of a filehandle load.
375 sub create_load_data
{
377 $self->SUPER::create_load_data
();
378 $self->{load_data
}{mode
} = 'fff';
379 $self->{load_data
}{CurrentGroup
} = undef;
385 $self->SUPER::finish_load
;
390 $loader->load_line($data);
392 Load a line of a FeatureFile file. You must bracket this with calls to
393 start_load() and finish_load()!
395 $loader->start_load();
396 $loader->load_line($_) while <FH>;
397 $loader->finish_load();
406 return unless $line =~ /\S/; # blank line
407 my $load_data = $self->{load_data
};
409 $load_data->{mode
} = 'fff' if /\s/; # if it has any whitespace in
410 # it, then back to fff mode
412 if ($line =~ /^\#\s?\#\s*([\#]+)/) { ## meta instruction
413 $load_data->{mode
} = 'fff';
414 $self->handle_meta($1);
416 } elsif ($line =~ /^\#/) {
417 $load_data->{mode
} = 'fff'; # just to be safe
421 elsif ($line =~ /^>\s*(\S+)/) { # FASTA lines are coming
422 $load_data->{mode
} = 'fasta';
423 $self->start_or_finish_sequence($1);
426 elsif ($load_data->{mode
} eq 'fasta') {
427 $self->load_sequence($line);
430 elsif ($load_data->{mode
} eq 'fff') {
431 $self->handle_feature($line);
432 if (++$load_data->{count
} % 1000 == 0) {
433 my $now = $self->time();
434 my $nl = -t STDOUT
&& !$ENV{EMACS
} ?
"\r" : "\n";
435 $self->msg(sprintf("%d features loaded in %5.2fs...$nl",
436 $load_data->{count
},$now - $load_data->{start_time
}));
437 $load_data->{start_time
} = $now;
442 $self->throw("I don't know what to do with this line:\n$line");
449 $loader->handle_meta($meta_directive)
451 This method is called to handle meta-directives such as
452 ##sequence-region. The method will receive the directive with the
453 initial ## stripped off.
457 # sub handle_meta { } inherited
459 =head2 handle_feature
461 $loader->handle_feature($gff3_line)
463 This method is called to process a single FeatureFile line. It manipulates
464 information stored a data structure called $self-E<gt>{load_data}.
472 my $ld = $self->{load_data
};
474 # handle reference line
475 if (/^reference\s*=\s*(.+)/) {
476 $ld->{reference
} = $1;
481 my @tokens = quotewords
('\s+',1,$_);
482 for (0..2) { # remove quotes from everything but last column
483 next unless defined $tokens[$_];
484 $tokens[$_] =~ s/^"//;
485 $tokens[$_] =~ s/"$//;
488 if (@tokens < 3) { # short line; assume a group identifier
489 $self->store_current_feature();
490 my $type = shift @tokens;
491 my $name = shift @tokens;
492 $ld->{CurrentGroup
} = $self->_make_indexed_feature($name,$type,'',{_ff_group
=>1});
493 $self->_indexit($name => 1);
497 my($type,$name,$strand,$bounds,$attributes);
499 if ($tokens[2] =~ /^([+-.]|[+-]?[01])$/) { # old version
500 ($type,$name,$strand,$bounds,$attributes) = @tokens;
501 } else { # new version
502 ($type,$name,$bounds,$attributes) = @tokens;
505 # handle case of there only being one value in the last column,
506 # in which case we treat it the same as Note="value"
507 my $attr = $self->parse_attributes($attributes);
509 # @parts is an array of ([ref,start,end],[ref,start,end],...)
510 my @parts = map { [/(?:(\w+):)?(-?\d+)(?:-|\.\.)(-?\d+)/]} split /(?:,| )\s*/,$bounds;
512 # deal with groups -- a group is ending if $type is defined
513 # and CurrentGroup is set
514 if ($type && $ld->{CurrentGroup
}) {
515 $self->_store_group();
518 $type = '' unless defined $type;
519 $name = '' unless defined $name;
520 $type ||= $ld->{CurrentGroup
}->primary_tag if $ld->{CurrentGroup
};
522 my $reference = $ld->{reference
} || 'ChrUN';
524 if (defined $_ && ref($_) eq 'ARRAY'
528 $strand ||= $_->[1] <= $_->[2] ?
'+' : '-';
529 ($_->[1],$_->[2]) = ($_->[2],$_->[1]) if $_->[1] > $_->[2];
531 $reference = $_->[0] if defined $_->[0];
532 $_ = [@
{$_}[1,2]]; # strip off the reference.
535 # now @parts is an array of [start,end] and $reference contains the seqid
537 # apply coordinate mapper
538 if ($self->{coordinate_mapper
} && $reference) {
539 my @remapped = $self->{coordinate_mapper
}->($reference,@parts);
540 ($reference,@parts) = @remapped if @remapped;
543 # either create a new feature or add a segment to it
544 my $feature = $ld->{CurrentFeature
};
546 $ld->{OldPartType
} = $ld->{PartType
};
547 $ld->{PartType
} = $attr->{Type
}[0] if exists $attr->{Type
};
548 $ld->{PartType
} = $attr->{type
}[0] if exists $attr->{type
};
549 $ld->{PartType
} ||= $type;
552 local $^W
= 0; # avoid uninit warning when display_name() is called
554 # if this is a different feature from what we have now, then we
555 # store the current one, and create a new one
556 if ($feature->display_name ne $name ||
557 $feature->method ne $type) {
558 $self->store_current_feature; # new feature, store old one
560 } else { # create a new multipart feature
561 $self->_multilevel_feature($feature,$ld->{OldPartType
})
562 unless $feature->get_SeqFeatures;
563 my $part = $self->_make_feature($name,
569 $feature->add_SeqFeature($part);
573 $feature ||= $self->_make_indexed_feature($name,
574 $type, # side effect is to set CurrentFeature
580 # add more segments to the current feature
582 for my $part (@parts) {
583 $type ||= $feature->primary_tag;
584 my $sp = $self->_make_feature($name,
590 $feature->add_SeqFeature($sp);
595 sub _multilevel_feature
{ # turn a single-level feature into a multilevel one
599 my %attributes = $f->attributes;
600 $attributes{Score
} = [$f->score] if defined $f->score;
601 $attributes{Phase
} = [$f->phase] if defined $f->phase;
602 my @args = ($f->display_name,
609 my $subpart = $self->_make_feature(@args);
610 $f->add_SeqFeature($subpart);
613 sub _make_indexed_feature
{
615 my $f = $self->_make_feature(@_);
616 my $name = $f->display_name;
617 $self->{load_data
}{CurrentFeature
} = $f;
618 $self->{load_data
}{CurrentID
} = $name;
619 $self->_indexit($name => 1);
625 my ($name,$type,$strand,$attributes,$ref,$start,$end) = @_;
627 # some basic error checking
628 $self->throw("invalid feature line: $_")
629 if ($ref && !defined $start)
630 or ($ref && !defined $end)
631 or ($start && $start !~ /^[-\d]+$/)
632 or ($end && $end !~ /^[-\d]+$/)
638 my @args = (-name
=> $name,
639 -strand
=> $strand eq '+' ?
1
646 -attributes
=> $attributes,
649 if (my ($method,$source) = $type =~ /(\S+):(\S+)/) {
650 push @args,(-primary_tag
=> $method,
653 push @args,(-primary_tag
=> $type);
656 push @args,(-seq_id
=> $ref) if defined $ref;
657 push @args,(-start
=> $start) if defined $start;
658 push @args,(-end
=> $end) if defined $end;
660 # pull out special attributes
661 if (my $score = $attributes->{Score
} || $attributes->{score
}) {
662 push @args,(-score
=> $score->[0]);
663 delete $attributes->{$_} foreach qw(Score score);
666 if (my $note = $attributes->{Note
} || $attributes->{note
}) {
667 push @args,(-desc
=> join '; ',@
$note);
668 delete $attributes->{$_} foreach qw(Note note);
671 if (my $url = $attributes->{url
} || $attributes->{Url
}) {
672 push @args,(-url
=> $url->[0]);
673 delete $attributes->{$_} foreach qw
(Url url
);
676 if (my $phase = $attributes->{phase
} || $attributes->{Phase
}) {
677 push @args,(-phase
=> $phase->[0]);
678 delete $attributes->{$_} foreach qw
(Phase phase
);
681 $self->_indexit($name=>1)
682 if $self->index_subfeatures && $name;
684 return $self->sfclass->new(@args);
687 =head2 store_current_feature
689 $loader->store_current_feature()
691 This method is called to store the currently active feature in the
692 database. It uses a data structure stored in $self-E<gt>{load_data}.
696 sub store_current_feature
{ # overridden
700 # if there is an open group, then we simply add the current
701 # feature to the group.
702 my $ld = $self->{load_data
};
703 if ($ld->{CurrentGroup
} && $ld->{CurrentFeature
}) {
704 $ld->{CurrentGroup
}->add_SeqFeature($ld->{CurrentFeature
})
705 unless $ld->{CurrentGroup
} eq $ld->{CurrentFeature
}; # paranoia - shouldn't happen
709 $self->SUPER::store_current_feature
();
715 my $ld = $self->{load_data
};
716 my $group = $ld->{CurrentGroup
} or return;
717 # if there is an unattached feature, then add it
718 $self->store_current_feature() if $ld->{CurrentFeature
};
719 $ld->{CurrentFeature
} = $group;
720 $ld->{CurrentID
} = $group->display_name;
721 $self->_indexit($ld->{CurrentID
} => 1);
722 undef $ld->{CurrentGroup
};
723 $self->store_current_feature();
726 =head2 build_object_tree
728 $loader->build_object_tree()
730 This method gathers together features and subfeatures and builds the
731 graph that connects them.
736 # put objects together
738 sub build_object_tree
{
739 croak
"We shouldn't be building an object tree in the FeatureFileLoader";
742 =head2 build_object_tree_in_tables
744 $loader->build_object_tree_in_tables()
746 This method gathers together features and subfeatures and builds the
747 graph that connects them, assuming that parent/child relationships
748 will be stored in a database table.
752 sub build_object_tree_in_tables
{
753 croak
"We shouldn't be building an object tree in the FeatureFileLoader";
756 =head2 build_object_tree_in_features
758 $loader->build_object_tree_in_features()
760 This method gathers together features and subfeatures and builds the
761 graph that connects them, assuming that parent/child relationships are
762 stored in the seqfeature objects themselves.
766 sub build_object_tree_in_features
{
767 croak
"We shouldn't be building an object tree in the FeatureFileLoader";
770 =head2 attach_children
772 $loader->attach_children($store,$load_data,$load_id,$feature)
774 This recursively adds children to features and their subfeatures. It
775 is called when subfeatures are directly contained within other
776 features, rather than stored in a relational table.
780 sub attach_children
{
781 croak
"We shouldn't be attaching children in the
785 =head2 parse_attributes
787 @attributes = $loader->parse_attributes($attribute_line)
789 This method parses the information contained in the $attribute_line
790 into a flattened hash (array). It may return one element, in which case it is
795 sub parse_attributes
{
799 $att ||= ''; # to prevent uninit variable warnings from quotewords()
801 my @pairs = quotewords
('[;\s]',1,$att);
803 for my $pair (@pairs) {
804 unless ($pair =~ /=/) {
805 push @
{$attributes{Note
}},(quotewords
('',0,$pair))[0] || $pair;
807 my ($tag,$value) = quotewords
('\s*=\s*',0,$pair);
808 $tag = 'Note' if $tag eq 'description';
809 push @
{$attributes{$tag}},$value;
815 =head2 start_or_finish_sequence
817 $loader->start_or_finish_sequence('Chr9')
819 This method is called at the beginning and end of a fasta section.
831 This is an early version, so there are certainly some bugs. Please
832 use the BioPerl bug tracking system to report bugs.
837 L<Bio::DB::SeqFeature::Store>,
838 L<Bio::DB::SeqFeature::Segment>,
839 L<Bio::DB::SeqFeature::NormalizedFeature>,
840 L<Bio::DB::SeqFeature::GFF3Loader>,
841 L<Bio::DB::SeqFeature::Store::DBI::mysql>,
842 L<Bio::DB::SeqFeature::Store::bdb>
846 Lincoln Stein E<lt>lstein@cshl.orgE<gt>.
848 Copyright (c) 2006 Cold Spring Harbor Laboratory.
850 This library is free software; you can redistribute it and/or modify
851 it under the same terms as Perl itself.