branch-1-6/Bio/Tools/RNAMotif.pm

   1 # $Id$
   2 #
   3 # BioPerl module for Bio::Tools::RNAMotif
   4 #
   5 # Please direct questions and support issues to <bioperl-l@bioperl.org>
   6 #
   7 # Cared for by Chris Fields <cjfields-at-uiuc-dot-edu>
   8 #
   9 # Copyright Chris Fields
  10 #
  11 # You may distribute this module under the same terms as perl itself
  12
  13 # POD documentation - main docs before the code
  14
  15 =head1 NAME
  16
  17 Bio::Tools::RNAMotif - A parser for RNAMotif output
  18
  19 =head1 SYNOPSIS
  20
  21   use Bio::Tools::RNAMotif;
  22   my $parser = Bio::Tools::RNAMotif->new(-file => $rna_output,
  23                                         -motiftag => 'protein_bind'
  24                                         -desctag => 'TRAP_binding');
  25   #parse the results
  26   while( my $motif = $parser->next_prediction) {
  27     # do something here
  28   }
  29
  30 =head1 DESCRIPTION
  31
  32 Parses raw RNAMotif output.  RNAMotif uses a RNA profile, consisting
  33 of sequence and structural elements stored in a descriptor file, to
  34 search for potential motifs in a DNA sequence file.  For more
  35 information, see:
  36
  37 Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R.
  38 RNAMotif, an RNA secondary structure definition and search algorithm.
  39 Nucleic Acids Res. 2001 Nov 15;29(22):4724-35.
  40 http://www.scripps.edu/mb/case/casegr-sh-3.5.html.
  41
  42 This module is not currently complete.  As is, it will parse raw
  43 RNAMotif output (i.e. information not passed through the secondary
  44 programs rmfmt or rm2ct) and pack information into
  45 Bio::SeqFeature::Generic objects.  Currently, parsing extra output
  46 utilized by the sprintf() function in an RNAMotif descriptor is not
  47 implemented; this information is instead packed into the score tag,
  48 which can be accessed by using the following:
  49
  50   my ($score) = $feature->score;
  51
  52 If the score contains anything besides a digit, it will throw a
  53 warning that sprintf() may have been used.
  54 Several values have also been added in the 'tag' hash.  These can be
  55 accessed using the following syntax:
  56
  57   my ($entry) = $feature->get_Annotations('secstructure');
  58
  59 Added tags are :
  60
  61    descline     - entire description line (in case the regex used for
  62                   sequence ID doesn't adequately catch the name
  63    descfile     - name of the descriptor file (may include path to file)
  64    secstrucure  - contains structural information from the descriptor
  65                   used as a query
  66    sequence     - sequence of motif, separated by spaces according to
  67                   matches to the structure in the descriptor (in
  68                   SecStructure).
  69
  70 See t/RNAMotif.t for example usage.
  71
  72 The clean_features method can also be used to return a list of seqfeatures (in a
  73 Bio::SeqFeature::Collection object) that are within a particular region.   RNAMotif
  74 is prone with some descriptors to returning redundant hits; an attempt to rectify
  75 this problem is attempted with RNAMotif's companion program rmprune, which returns
  76 the structure with the longest helices (and theoretically the best scoring structure).
  77 However, this doesn't take into account alternative foldings which may score better.
  78 This method adds a bit more flexibility, giving the user the ability to screen folds
  79 based on where the feature is found and the score.  Passing a positive integer x
  80 screens SeqFeatures based on the highest score within x bp, while a negative integer
  81 screens based on the lowest score. So, to return the highest scoring values within
  82 20 bp (likely using an arbitrary scroing system in the SCORE section of a descriptor
  83 file), one could use:
  84
  85   $list = $obj->clean_features(20);
  86
  87 ... and returning the lowest scoring structures within the same region (when the
  88 score is based on calculated free energies from efn2) can be accomplished
  89 by the following:
  90
  91   $list = $obj->clean_features(-20);
  92
  93 If you wanted the best feature in a sequence, you could set this to a large number,
  94 preferrably on that exceeds the bases in a sequence
  95
  96   $list = $obj->clean_features(10000000);
  97
  98 Each seqfeature in the collection can then be acted upon:
  99
 100   @sf = $list->get_all_features;
 101   for my $f (@sf) {
 102     # do crazy things here
 103   }
 104
 105 At some point a more complicated feature object may be used to support
 106 this data rather than forcing most of the information into tag/value
 107 pairs in a SeqFeature::Generic.  This will hopefully allow for more
 108 flexible analysis of data (specifically RNA secondary structural
 109 data).  It works for now...
 110
 111 =head1 FEEDBACK
 112
 113 =head2 Mailing Lists
 114
 115 User feedback is an integral part of the evolution of this and other
 116 Bioperl modules. Send your comments and suggestions preferably to
 117 the Bioperl mailing list.  Your participation is much appreciated.
 118
 119   bioperl-l@bioperl.org                  - General discussion
 120   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 121
 122 =head2 Support
 123
 124 Please direct usage questions or support issues to the mailing list:
 125
 126 I<bioperl-l@bioperl.org>
 127
 128 rather than to the module maintainer directly. Many experienced and
 129 reponsive experts will be able look at the problem and quickly
 130 address it. Please include a thorough description of the problem
 131 with code and data examples if at all possible.
 132
 133 =head2 Reporting Bugs
 134
 135 Report bugs to the Bioperl bug tracking system to help us keep track
 136 of the bugs and their resolution. Bug reports can be submitted via the
 137 web:
 138
 139   http://bugzilla.open-bio.org/
 140
 141 =head1 AUTHOR - Chris Fields
 142
 143 Email cjfields-at-uiuc-dot-edu
 144
 145 =head1 APPENDIX
 146
 147 The rest of the documentation details each of the object methods.
 148 Internal methods are usually preceded with a _
 149
 150 =cut
 151
 152 # Let the code begin...
 153
 154 package Bio::Tools::RNAMotif;
 155 use strict;
 156
 157 use Bio::SeqFeature::Generic;
 158 use Bio::SeqFeature::Collection;
 159
 160 use base qw(Bio::Tools::AnalysisResult);
 161
 162 our($MotifTag,$SrcTag,$DescTag) = qw(misc_binding RNAMotif rnamotif);
 163
 164 =head2 new
 165
 166  Title   : new
 167  Usage   : my $obj = Bio::Tools::RNAMotif->new();
 168  Function: Builds a new Bio::Tools::RNAMotif object
 169  Returns : an instance of Bio::Tools::RNAMotif
 170  Args    : -fh/-file for input filename
 171            -motiftag => primary tag used in gene features (default 'misc_binding')
 172            -desctag => tag used for display_name name (default 'rnamotif')
 173            -srctag  => source tag used in all features (default 'RNAMotif')
 174
 175 =cut
 176
 177 sub _initialize {
 178   my($self,@args) = @_;
 179   $self->warn('Use of this module is deprecated.  Use Bio::SearchIO::rnamotif instead');
 180   $self->SUPER::_initialize(@args);
 181   my ($motiftag,$desctag,$srctag) =  $self->SUPER::_rearrange([qw(MOTIFTAG
 182                                                                   DESCTAG
 183                                                                   SRCTAG
 184                                  )],
 185                                   @args);
 186   $self->motif_tag(defined $motiftag ? $motiftag : $MotifTag);
 187   $self->source_tag(defined $srctag ? $srctag : $SrcTag);
 188   $self->desc_tag(defined $desctag ? $desctag : $DescTag);
 189   $self->{'_sec_structure' => '',
 190           '_dfile' => ''};
 191 }
 192
 193 =head2 motif_tag
 194
 195  Title   : motif_tag
 196  Usage   : $obj->motif_tag($newval)
 197  Function: Get/Set the value used for 'motif_tag', which is used for setting the
 198            primary_tag.
 199            Default is 'misc_binding' as set by the global $MotifTag.
 200            'misc_binding' is used here because a conserved RNA motif is capable
 201            of binding proteins (regulatory proteins), antisense RNA (siRNA),
 202            small molecules (riboswitches), or nothing at all (tRNA,
 203            terminators, etc.).  It is recommended that this be changed to other
 204            tags ('misc_RNA', 'protein_binding', 'tRNA', etc.) where appropriate.
 205            For more information, see:
 206            http://www.ncbi.nlm.nih.gov/collab/FT/index.html
 207  Returns : value of motif_tag (a scalar)
 208  Args    : on set, new value (a scalar or undef, optional)
 209
 210 =cut
 211
 212 sub motif_tag{
 213     my $self = shift;
 214
 215     return $self->{'motif_tag'} = shift if @_;
 216     return $self->{'motif_tag'};
 217 }
 218
 219 =head2 source_tag
 220
 221  Title   : source_tag
 222  Usage   : $obj->source_tag($newval)
 223  Function: Get/Set the value used for the 'source_tag'.
 224            Default is 'RNAMotif' as set by the global $SrcTag
 225  Returns : value of source_tag (a scalar)
 226  Args    : on set, new value (a scalar or undef, optional)
 227
 228
 229 =cut
 230
 231 sub source_tag{
 232     my $self = shift;
 233
 234     return $self->{'source_tag'} = shift if @_;
 235     return $self->{'source_tag'};
 236 }
 237
 238
 239 =head2 desc_tag
 240
 241  Title   : desc_tag
 242  Usage   : $obj->desc_tag($newval)
 243  Function: Get/Set the value used for the query motif.  This will be placed in
 244            the tag '-display_name'.  Default is 'rnamotif' as set by the global
 245            $DescTag.  Use this to manually set the descriptor (motif searched for).
 246            Since there is no way for this module to tell what the motif is from the
 247            name of the descriptor file or the RNAMotif output, this should
 248            be set every time an RNAMotif object is instantiated for clarity
 249  Returns : value of exon_tag (a scalar)
 250  Args    : on set, new value (a scalar or undef, optional)
 251
 252 =cut
 253
 254 sub desc_tag{
 255     my $self = shift;
 256
 257     return $self->{'desc_tag'} = shift if @_;
 258     return $self->{'desc_tag'};
 259 }
 260
 261 =head2 analysis_method
 262
 263  Usage     : $obj->analysis_method();
 264  Purpose   : Inherited method. Overridden to ensure that the name matches
 265              /RNAMotif/i.
 266  Returns   : String
 267  Argument  : n/a
 268
 269 =cut
 270
 271 #-------------
 272 sub analysis_method {
 273 #-------------
 274     my ($self, $method) = @_;
 275     if($method && ($method !~ /RNAMotif/i)) {
 276     $self->throw("method $method not supported in " . ref($self));
 277     }
 278     return $self->SUPER::analysis_method($method);
 279 }
 280
 281 =head2 next_feature
 282
 283  Title   : next_feature
 284  Usage   : while($gene = $obj->next_feature()) {
 285                   # do something
 286            }
 287  Function: Returns the next gene structure prediction of the RNAMotif result
 288            file. Call this method repeatedly until FALSE is returned.
 289            The returned object is actually a SeqFeatureI implementing object.
 290            This method is required for classes implementing the
 291            SeqAnalysisParserI interface, and is merely an alias for
 292            next_prediction() at present.
 293  Returns : A Bio::Tools::Prediction::Gene object.
 294  Args    : None (at present)
 295
 296 =cut
 297
 298 sub next_feature {
 299     my ($self,@args) = @_;
 300     # even though next_prediction doesn't expect any args (and this method
 301     # does neither), we pass on args in order to be prepared if this changes
 302     # ever
 303     return $self->next_prediction(@args);
 304 }
 305
 306 =head2 next_prediction
 307
 308  Title   : next_prediction
 309  Usage   : while($gene = $obj->next_prediction()) {
 310                   # do something
 311            }
 312  Function: Returns the next gene structure prediction of the RNAMotif result
 313            file. Call this method repeatedly until FALSE is returned.
 314  Returns : A Bio::SeqFeature::Generic object
 315  Args    : None (at present)
 316
 317 =cut
 318
 319 sub next_prediction {
 320     my ($self) = @_;
 321     my ($motiftag,$srctag,$desctag) = ( $self->motif_tag,
 322                        $self->source_tag,
 323                        $self->desc_tag);
 324     my ($score, $strand, $start, $length, $sequence, $end, $seqid, $description)=0;
 325     while($_ = $self->_readline) {
 326         while($_ =~ /^#RM/) { # header line
 327             if(/^#RM\sdescr\s(.*)$/) { # contains sec structure
 328                 $self->{'_sec_structure'}=$1;
 329             }
 330             if(/^#RM\sdfile\s(.*)$/) { # contains dfile
 331                 $self->{'_dfile'}=$1;
 332             }
 333             $_ = $self->_readline;
 334         }
 335         if(m/^>((\S*)\s.*)$/) {
 336             $seqid = $2;
 337             $description = $1; # contains entire description line if needed
 338             if($seqid =~  /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|/) {
 339                 $seqid = $2; # pulls out gid
 340             }
 341         }
 342         # start pulling out hit information...
 343         # regex is m/^\S+\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s(.*)$/
 344         # m/^\S+\s+     # seqID, not needed
 345         # (.+)\s+       # score, or extra info using sprintf()
 346         # (\d+)\s+      # strand
 347         # (\d+)\s+      # start
 348         # (\d+)\s       # length
 349         # (.*)$/        # sequence, divided according to descriptor
 350         if(m/^\S+\s+(.+)\s+(\d+)\s+(\d+)\s+(\d+)\s(.*)$/) {
 351             ($score, $strand, $start, $length, $sequence, $end)=
 352                 ($1, $2, $3, $4, $5, 0);
 353             if( $strand==0 ) {
 354                 $end = $start + $length -1;
 355                 $strand = 1;
 356             } else {
 357                 $end = $start - $length + 1;
 358                 ($start, $end, $strand) = ($end, $start, -1);
 359             }
 360             my $gene = Bio::SeqFeature::Generic->new(-seq_id => $seqid,
 361                                                       -start  => $start,
 362                                                       -end    => $end,
 363                                                       -strand => $strand,
 364                                                       -score  => $score,
 365                                                       -primary_tag => $motiftag,
 366                                                       -source_tag  => $srctag,
 367                                                       -display_name => $desctag,
 368                                                       -tag     => {
 369                                                         'descline'       => $description,
 370                                                         'descfile'      => $self->{'_dfile'},
 371                                                         'secstructure'  => $self->{'_sec_structure'},
 372                                                         'sequence'       => $sequence});
 373             return $gene;
 374         }
 375     }
 376 }
 377
 378 =head2 clean_features
 379
 380  Title   : next_prediction
 381  Usage   : @list = $obj->clean_features(-10);
 382  Function: Cleans (reduces redundant hits) based on score, position
 383  Returns : a Bio::SeqFeature::Collection object
 384  Args    : Pos./Neg. integer (for highest/lowest scoring seqfeature within x bp).
 385  Throws  : Error unless digit is entered.
 386
 387 =cut
 388
 389 sub clean_features {
 390     my $self = shift;
 391     my $bp = shift;
 392     $self->throw("No arg, need pos. or neg. integer") if !$bp;
 393     $self->throw("Need pos. or neg. integer") if ($bp !~ /\-?\d/ || $bp =~ /\./);
 394     my ($b, $sf2);
 395     my @list = ();
 396     my @features = ();
 397     while (my $pred = $self->next_prediction) {
 398         push @features, $pred;
 399     }
 400     while (@features) {
 401         $b = shift @features if !defined($b);
 402         $sf2 = shift @features;
 403         # from same sequence?
 404         if ($sf2) { # if there is no feature, then...
 405             if ($b->seq_id == $sf2->seq_id) {
 406                 # close starts (probable redundant hit)?
 407                 if(abs(($b->start)-($sf2->start)) <= abs($bp)) {
 408                     # which score is better?
 409                     if( (($bp < 0) && ($b->score > $sf2->score))  ||  # lowest score
 410                         (($bp > 0) && ($b->score < $sf2->score)) ){   # highest score
 411                         $b = $sf2;
 412                         next;
 413                     } else {
 414                         next;
 415                     }
 416                 }
 417                 push @list, $b;
 418                 $b = $sf2;
 419             }
 420         }
 421     }
 422     push @list, $b if $b;
 423     my $col = Bio::SeqFeature::Collection->new;
 424     $col->add_features(\@list);
 425     return $col;
 426 }
 427
 428 1;