tag fourth (and hopefully last) alpha
[bioperl-live.git] / branch-1-6 / Bio / Tools / RNAMotif.pm
blobaac1e470e8eb36ddf1d85c9b5964d18138f8c52a
1 # $Id$
3 # BioPerl module for Bio::Tools::RNAMotif
5 # Please direct questions and support issues to <bioperl-l@bioperl.org>
7 # Cared for by Chris Fields <cjfields-at-uiuc-dot-edu>
9 # Copyright Chris Fields
11 # You may distribute this module under the same terms as perl itself
13 # POD documentation - main docs before the code
15 =head1 NAME
17 Bio::Tools::RNAMotif - A parser for RNAMotif output
19 =head1 SYNOPSIS
21 use Bio::Tools::RNAMotif;
22 my $parser = Bio::Tools::RNAMotif->new(-file => $rna_output,
23 -motiftag => 'protein_bind'
24 -desctag => 'TRAP_binding');
25 #parse the results
26 while( my $motif = $parser->next_prediction) {
27 # do something here
30 =head1 DESCRIPTION
32 Parses raw RNAMotif output. RNAMotif uses a RNA profile, consisting
33 of sequence and structural elements stored in a descriptor file, to
34 search for potential motifs in a DNA sequence file. For more
35 information, see:
37 Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R.
38 RNAMotif, an RNA secondary structure definition and search algorithm.
39 Nucleic Acids Res. 2001 Nov 15;29(22):4724-35.
40 http://www.scripps.edu/mb/case/casegr-sh-3.5.html.
42 This module is not currently complete. As is, it will parse raw
43 RNAMotif output (i.e. information not passed through the secondary
44 programs rmfmt or rm2ct) and pack information into
45 Bio::SeqFeature::Generic objects. Currently, parsing extra output
46 utilized by the sprintf() function in an RNAMotif descriptor is not
47 implemented; this information is instead packed into the score tag,
48 which can be accessed by using the following:
50 my ($score) = $feature->score;
52 If the score contains anything besides a digit, it will throw a
53 warning that sprintf() may have been used.
54 Several values have also been added in the 'tag' hash. These can be
55 accessed using the following syntax:
57 my ($entry) = $feature->get_Annotations('secstructure');
59 Added tags are :
61 descline - entire description line (in case the regex used for
62 sequence ID doesn't adequately catch the name
63 descfile - name of the descriptor file (may include path to file)
64 secstrucure - contains structural information from the descriptor
65 used as a query
66 sequence - sequence of motif, separated by spaces according to
67 matches to the structure in the descriptor (in
68 SecStructure).
70 See t/RNAMotif.t for example usage.
72 The clean_features method can also be used to return a list of seqfeatures (in a
73 Bio::SeqFeature::Collection object) that are within a particular region. RNAMotif
74 is prone with some descriptors to returning redundant hits; an attempt to rectify
75 this problem is attempted with RNAMotif's companion program rmprune, which returns
76 the structure with the longest helices (and theoretically the best scoring structure).
77 However, this doesn't take into account alternative foldings which may score better.
78 This method adds a bit more flexibility, giving the user the ability to screen folds
79 based on where the feature is found and the score. Passing a positive integer x
80 screens SeqFeatures based on the highest score within x bp, while a negative integer
81 screens based on the lowest score. So, to return the highest scoring values within
82 20 bp (likely using an arbitrary scroing system in the SCORE section of a descriptor
83 file), one could use:
85 $list = $obj->clean_features(20);
87 ... and returning the lowest scoring structures within the same region (when the
88 score is based on calculated free energies from efn2) can be accomplished
89 by the following:
91 $list = $obj->clean_features(-20);
93 If you wanted the best feature in a sequence, you could set this to a large number,
94 preferrably on that exceeds the bases in a sequence
96 $list = $obj->clean_features(10000000);
98 Each seqfeature in the collection can then be acted upon:
100 @sf = $list->get_all_features;
101 for my $f (@sf) {
102 # do crazy things here
105 At some point a more complicated feature object may be used to support
106 this data rather than forcing most of the information into tag/value
107 pairs in a SeqFeature::Generic. This will hopefully allow for more
108 flexible analysis of data (specifically RNA secondary structural
109 data). It works for now...
111 =head1 FEEDBACK
113 =head2 Mailing Lists
115 User feedback is an integral part of the evolution of this and other
116 Bioperl modules. Send your comments and suggestions preferably to
117 the Bioperl mailing list. Your participation is much appreciated.
119 bioperl-l@bioperl.org - General discussion
120 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
122 =head2 Support
124 Please direct usage questions or support issues to the mailing list:
126 I<bioperl-l@bioperl.org>
128 rather than to the module maintainer directly. Many experienced and
129 reponsive experts will be able look at the problem and quickly
130 address it. Please include a thorough description of the problem
131 with code and data examples if at all possible.
133 =head2 Reporting Bugs
135 Report bugs to the Bioperl bug tracking system to help us keep track
136 of the bugs and their resolution. Bug reports can be submitted via the
137 web:
139 http://bugzilla.open-bio.org/
141 =head1 AUTHOR - Chris Fields
143 Email cjfields-at-uiuc-dot-edu
145 =head1 APPENDIX
147 The rest of the documentation details each of the object methods.
148 Internal methods are usually preceded with a _
150 =cut
152 # Let the code begin...
154 package Bio::Tools::RNAMotif;
155 use strict;
157 use Bio::SeqFeature::Generic;
158 use Bio::SeqFeature::Collection;
160 use base qw(Bio::Tools::AnalysisResult);
162 our($MotifTag,$SrcTag,$DescTag) = qw(misc_binding RNAMotif rnamotif);
164 =head2 new
166 Title : new
167 Usage : my $obj = Bio::Tools::RNAMotif->new();
168 Function: Builds a new Bio::Tools::RNAMotif object
169 Returns : an instance of Bio::Tools::RNAMotif
170 Args : -fh/-file for input filename
171 -motiftag => primary tag used in gene features (default 'misc_binding')
172 -desctag => tag used for display_name name (default 'rnamotif')
173 -srctag => source tag used in all features (default 'RNAMotif')
175 =cut
177 sub _initialize {
178 my($self,@args) = @_;
179 $self->warn('Use of this module is deprecated. Use Bio::SearchIO::rnamotif instead');
180 $self->SUPER::_initialize(@args);
181 my ($motiftag,$desctag,$srctag) = $self->SUPER::_rearrange([qw(MOTIFTAG
182 DESCTAG
183 SRCTAG
185 @args);
186 $self->motif_tag(defined $motiftag ? $motiftag : $MotifTag);
187 $self->source_tag(defined $srctag ? $srctag : $SrcTag);
188 $self->desc_tag(defined $desctag ? $desctag : $DescTag);
189 $self->{'_sec_structure' => '',
190 '_dfile' => ''};
193 =head2 motif_tag
195 Title : motif_tag
196 Usage : $obj->motif_tag($newval)
197 Function: Get/Set the value used for 'motif_tag', which is used for setting the
198 primary_tag.
199 Default is 'misc_binding' as set by the global $MotifTag.
200 'misc_binding' is used here because a conserved RNA motif is capable
201 of binding proteins (regulatory proteins), antisense RNA (siRNA),
202 small molecules (riboswitches), or nothing at all (tRNA,
203 terminators, etc.). It is recommended that this be changed to other
204 tags ('misc_RNA', 'protein_binding', 'tRNA', etc.) where appropriate.
205 For more information, see:
206 http://www.ncbi.nlm.nih.gov/collab/FT/index.html
207 Returns : value of motif_tag (a scalar)
208 Args : on set, new value (a scalar or undef, optional)
210 =cut
212 sub motif_tag{
213 my $self = shift;
215 return $self->{'motif_tag'} = shift if @_;
216 return $self->{'motif_tag'};
219 =head2 source_tag
221 Title : source_tag
222 Usage : $obj->source_tag($newval)
223 Function: Get/Set the value used for the 'source_tag'.
224 Default is 'RNAMotif' as set by the global $SrcTag
225 Returns : value of source_tag (a scalar)
226 Args : on set, new value (a scalar or undef, optional)
229 =cut
231 sub source_tag{
232 my $self = shift;
234 return $self->{'source_tag'} = shift if @_;
235 return $self->{'source_tag'};
239 =head2 desc_tag
241 Title : desc_tag
242 Usage : $obj->desc_tag($newval)
243 Function: Get/Set the value used for the query motif. This will be placed in
244 the tag '-display_name'. Default is 'rnamotif' as set by the global
245 $DescTag. Use this to manually set the descriptor (motif searched for).
246 Since there is no way for this module to tell what the motif is from the
247 name of the descriptor file or the RNAMotif output, this should
248 be set every time an RNAMotif object is instantiated for clarity
249 Returns : value of exon_tag (a scalar)
250 Args : on set, new value (a scalar or undef, optional)
252 =cut
254 sub desc_tag{
255 my $self = shift;
257 return $self->{'desc_tag'} = shift if @_;
258 return $self->{'desc_tag'};
261 =head2 analysis_method
263 Usage : $obj->analysis_method();
264 Purpose : Inherited method. Overridden to ensure that the name matches
265 /RNAMotif/i.
266 Returns : String
267 Argument : n/a
269 =cut
271 #-------------
272 sub analysis_method {
273 #-------------
274 my ($self, $method) = @_;
275 if($method && ($method !~ /RNAMotif/i)) {
276 $self->throw("method $method not supported in " . ref($self));
278 return $self->SUPER::analysis_method($method);
281 =head2 next_feature
283 Title : next_feature
284 Usage : while($gene = $obj->next_feature()) {
285 # do something
287 Function: Returns the next gene structure prediction of the RNAMotif result
288 file. Call this method repeatedly until FALSE is returned.
289 The returned object is actually a SeqFeatureI implementing object.
290 This method is required for classes implementing the
291 SeqAnalysisParserI interface, and is merely an alias for
292 next_prediction() at present.
293 Returns : A Bio::Tools::Prediction::Gene object.
294 Args : None (at present)
296 =cut
298 sub next_feature {
299 my ($self,@args) = @_;
300 # even though next_prediction doesn't expect any args (and this method
301 # does neither), we pass on args in order to be prepared if this changes
302 # ever
303 return $self->next_prediction(@args);
306 =head2 next_prediction
308 Title : next_prediction
309 Usage : while($gene = $obj->next_prediction()) {
310 # do something
312 Function: Returns the next gene structure prediction of the RNAMotif result
313 file. Call this method repeatedly until FALSE is returned.
314 Returns : A Bio::SeqFeature::Generic object
315 Args : None (at present)
317 =cut
319 sub next_prediction {
320 my ($self) = @_;
321 my ($motiftag,$srctag,$desctag) = ( $self->motif_tag,
322 $self->source_tag,
323 $self->desc_tag);
324 my ($score, $strand, $start, $length, $sequence, $end, $seqid, $description)=0;
325 while($_ = $self->_readline) {
326 while($_ =~ /^#RM/) { # header line
327 if(/^#RM\sdescr\s(.*)$/) { # contains sec structure
328 $self->{'_sec_structure'}=$1;
330 if(/^#RM\sdfile\s(.*)$/) { # contains dfile
331 $self->{'_dfile'}=$1;
333 $_ = $self->_readline;
335 if(m/^>((\S*)\s.*)$/) {
336 $seqid = $2;
337 $description = $1; # contains entire description line if needed
338 if($seqid =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|/) {
339 $seqid = $2; # pulls out gid
342 # start pulling out hit information...
343 # regex is m/^\S+\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s(.*)$/
344 # m/^\S+\s+ # seqID, not needed
345 # (.+)\s+ # score, or extra info using sprintf()
346 # (\d+)\s+ # strand
347 # (\d+)\s+ # start
348 # (\d+)\s # length
349 # (.*)$/ # sequence, divided according to descriptor
350 if(m/^\S+\s+(.+)\s+(\d+)\s+(\d+)\s+(\d+)\s(.*)$/) {
351 ($score, $strand, $start, $length, $sequence, $end)=
352 ($1, $2, $3, $4, $5, 0);
353 if( $strand==0 ) {
354 $end = $start + $length -1;
355 $strand = 1;
356 } else {
357 $end = $start - $length + 1;
358 ($start, $end, $strand) = ($end, $start, -1);
360 my $gene = Bio::SeqFeature::Generic->new(-seq_id => $seqid,
361 -start => $start,
362 -end => $end,
363 -strand => $strand,
364 -score => $score,
365 -primary_tag => $motiftag,
366 -source_tag => $srctag,
367 -display_name => $desctag,
368 -tag => {
369 'descline' => $description,
370 'descfile' => $self->{'_dfile'},
371 'secstructure' => $self->{'_sec_structure'},
372 'sequence' => $sequence});
373 return $gene;
378 =head2 clean_features
380 Title : next_prediction
381 Usage : @list = $obj->clean_features(-10);
382 Function: Cleans (reduces redundant hits) based on score, position
383 Returns : a Bio::SeqFeature::Collection object
384 Args : Pos./Neg. integer (for highest/lowest scoring seqfeature within x bp).
385 Throws : Error unless digit is entered.
387 =cut
389 sub clean_features {
390 my $self = shift;
391 my $bp = shift;
392 $self->throw("No arg, need pos. or neg. integer") if !$bp;
393 $self->throw("Need pos. or neg. integer") if ($bp !~ /\-?\d/ || $bp =~ /\./);
394 my ($b, $sf2);
395 my @list = ();
396 my @features = ();
397 while (my $pred = $self->next_prediction) {
398 push @features, $pred;
400 while (@features) {
401 $b = shift @features if !defined($b);
402 $sf2 = shift @features;
403 # from same sequence?
404 if ($sf2) { # if there is no feature, then...
405 if ($b->seq_id == $sf2->seq_id) {
406 # close starts (probable redundant hit)?
407 if(abs(($b->start)-($sf2->start)) <= abs($bp)) {
408 # which score is better?
409 if( (($bp < 0) && ($b->score > $sf2->score)) || # lowest score
410 (($bp > 0) && ($b->score < $sf2->score)) ){ # highest score
411 $b = $sf2;
412 next;
413 } else {
414 next;
417 push @list, $b;
418 $b = $sf2;
422 push @list, $b if $b;
423 my $col = Bio::SeqFeature::Collection->new;
424 $col->add_features(\@list);
425 return $col;