Sync'ed RichSeqI with the implementation. RichSeq provides backward
[bioperl-live.git] / Bio / Search / SearchUtils.pm
blobfe8e5aed5ca1f5b66391e094fed8104f643515f9
1 =head1 NAME
3 Bio::Search::SearchUtils - Utility functions for Bio::Search:: objects
5 =head1 SYNOPSIS
7 This module is just a collection of subroutines, not an object.
9 =head1 DESCRIPTION
11 The SearchUtils.pm module is a collection of subroutines used primarily by
12 Bio::Search::Hit::HitI objects for some of the additional
13 functionality, such as HSP tiling. Right now, the SearchUtils is just a
14 collection of methods, not an object.
16 =head1 AUTHOR
18 Steve Chervitz E<lt>sac@bioperl.orgE<gt>
20 =cut
24 package Bio::Search::SearchUtils;
26 use strict;
27 use vars qw($DEBUG);
28 $DEBUG = 1;
30 =head2 tile_hsps
32 Usage : tile_hsps( $sbjct );
33 : This is called automatically by methods in Bio::Search::Hit::GenericHit
34 : that rely on having tiled data.
35 Purpose : Collect statistics about the aligned sequences in a set of HSPs.
36 : Calculates the following data across all HSPs:
37 : -- total alignment length
38 : -- total identical residues
39 : -- total conserved residues
40 Returns : n/a
41 Argument : A Bio::Search::Hit::HitI object
42 Throws : n/a
43 Comments :
44 : This method performs more careful summing of data across
45 : all HSPs in the Sbjct object. Only HSPs that are in the same strand
46 : and frame are tiled. Simply summing the data from all HSPs
47 : in the same strand and frame will overestimate the actual
48 : length of the alignment if there is overlap between different HSPs
49 : (often the case).
51 : The strategy is to tile the HSPs and sum over the
52 : contigs, collecting data separately from overlapping and
53 : non-overlapping regions of each HSP. To facilitate this, the
54 : HSP.pm object now permits extraction of data from sub-sections
55 : of an HSP.
57 : Additional useful information is collected from the results
58 : of the tiling. It is possible that sub-sequences in
59 : different HSPs will overlap significantly. In this case, it
60 : is impossible to create a single unambiguous alignment by
61 : concatenating the HSPs. The ambiguity may indicate the
62 : presence of multiple, similar domains in one or both of the
63 : aligned sequences. This ambiguity is recorded using the
64 : ambiguous_aln() method.
66 : This method does not attempt to discern biologically
67 : significant vs. insignificant overlaps. The allowable amount of
68 : overlap can be set with the overlap() method or with the -OVERLAP
69 : parameter used when constructing the Hit object.
71 : For a given hit, both the query and the sbjct sequences are
72 : tiled independently.
74 : -- If only query sequence HSPs overlap,
75 : this may suggest multiple domains in the sbjct.
76 : -- If only sbjct sequence HSPs overlap,
77 : this may suggest multiple domains in the query.
78 : -- If both query & sbjct sequence HSPs overlap,
79 : this suggests multiple domains in both.
80 : -- If neither query & sbjct sequence HSPs overlap,
81 : this suggests either no multiple domains in either
82 : sequence OR that both sequences have the same
83 : distribution of multiple similar domains.
85 : This method can deal with the special case of when multiple
86 : HSPs exactly overlap.
88 : Efficiency concerns:
89 : Speed will be an issue for sequences with numerous HSPs.
91 Bugs : Currently, tile_hsps() does not properly account for
92 : the number of non-tiled but overlapping HSPs, which becomes a problem
93 : as overlap() grows. Large values overlap() may thus lead to
94 : incorrect statistics for some hits. For best results, keep overlap()
95 : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and
96 : Ambiguous Alignments" section in L<Bio::Search::Hit::GenericHit>.
98 See Also : L<_adjust_contigs>(), L<Bio::Search::Hit::GenericHit|Bio::Search::Hit::GenericHit>
100 =cut
102 #--------------
103 sub tile_hsps {
104 #--------------
105 my $sbjct = shift;
107 $sbjct->tiled_hsps(1);
108 $sbjct->gaps('query', 0);
109 $sbjct->gaps('hit', 0);
111 ## Simple summation scheme. Valid if there is only one HSP.
112 if( $sbjct->n == 1 or $sbjct->num_hsps == 1) {
113 my $hsp = $sbjct->hsp;
114 $sbjct->length_aln('query', $hsp->length('query'));
115 $sbjct->length_aln('hit', $hsp->length('sbjct'));
116 $sbjct->length_aln('total', $hsp->length('total'));
117 $sbjct->matches( $hsp->matches() );
118 $sbjct->gaps('query', $hsp->gaps('query'));
119 $sbjct->gaps('sbjct', $hsp->gaps('sbjct'));
121 # print "_tile_hsps(): single HSP, easy stats.\n";
122 return;
123 } else {
124 # print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n";
125 $sbjct->length_aln('query', 0);
126 $sbjct->length_aln('sbjct', 0);
127 $sbjct->length_aln('total', 0);
128 $sbjct->matches( 0, 0);
131 ## More than one HSP. Must tile HSPs.
132 # print "\nTiling HSPs for $sbjct\n";
133 my($hsp, $qstart, $qstop, $sstart, $sstop);
134 my($frame, $strand, $qstrand, $sstrand);
135 my(@qcontigs, @scontigs);
136 my $qoverlap = 0;
137 my $soverlap = 0;
138 my $max_overlap = $sbjct->overlap;
139 my $hit_qgaps = 0;
140 my $hit_sgaps = 0;
141 my $hit_len_aln = 0;
142 my %start_stop;
143 my $v = $sbjct->verbose;
144 foreach $hsp ( $sbjct->hsps() ) {
145 $sbjct->debug( sprintf(" HSP: %s %d..%d\n",$hsp->query->seq_id, $hsp->query->start, $hsp->hit->end)) if $v > 0; #$hsp->str('query');
146 # printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'),
147 # $hsp->length(-TYPE=>'cons'),
148 # $hsp->length(-TYPE=>'cons',
149 # -START=>0,-STOP=>10);
151 ($qstart, $qstop) = $hsp->range('query');
152 ($sstart, $sstop) = $hsp->range('sbjct');
153 $frame = $hsp->frame;
154 $frame = -1 unless defined $frame;
156 ($qstrand, $sstrand) = ($hsp->query->strand,
157 $hsp->hit->strand);
159 # Note: No correction for overlap.
161 my ($qgaps, $sgaps) = ($hsp->gaps('query'), $hsp->gaps('hit'));
162 $hit_qgaps += $qgaps;
163 $hit_sgaps += $sgaps;
164 $hit_len_aln += $hsp->length;
166 ## Collect contigs in the query sequence.
167 $qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop,
168 \@qcontigs, $max_overlap, $frame,
169 $qstrand);
171 ## Collect contigs in the sbjct sequence
172 # (needed for domain data and gapped Blast).
173 $soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop,
174 \@scontigs, $max_overlap, $frame,
175 $sstrand);
177 ## Collect overall start and stop data for query and
178 # sbjct over all HSPs.
179 unless ( defined $start_stop{'qstart'} ) {
180 $start_stop{'qstart'} = $qstart;
181 $start_stop{'qstop'} = $qstop;
182 $start_stop{'sstart'} = $sstart;
183 $start_stop{'sstop'} = $sstop;
184 } else {
185 $start_stop{'qstart'} = ($qstart < $start_stop{'qstart'} ?
186 $qstart : $start_stop{'qstart'} );
187 $start_stop{'qstop'} = ($qstop > $start_stop{'qstop'} ?
188 $qstop : $start_stop{'qstop'} );
189 $start_stop{'sstart'} = ($sstart < $start_stop{'sstart'} ?
190 $sstart : $start_stop{'sstart'} );
191 $start_stop{'sstop'} = ($sstop > $start_stop{'sstop'} ?
192 $sstop : $start_stop{'sstop'} );
196 # Store the collected data in the Hit object
197 $sbjct->gaps('query', $hit_qgaps);
198 $sbjct->gaps('hit', $hit_sgaps);
199 $sbjct->length_aln('total', $hit_len_aln);
201 $sbjct->start('query',$start_stop{'qstart'});
202 $sbjct->end('query', $start_stop{'$qstop'});
203 $sbjct->start('hit', $start_stop{'$sstart'});
204 $sbjct->end('hit', $start_stop{'$sstop'});
206 ## Collect data across the collected contigs.
208 $sbjct->debug( "\nQUERY CONTIGS:\n".
209 " gaps = $sbjct->{'_gaps_query'}\n");
211 # Account for strand/frame.
212 # Strategy: collect data on a per strand+frame basis and
213 # save the most significant one.
214 my (%qctg_dat);
215 foreach (@qcontigs) {
216 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
218 if( $v > 0 ) {
219 $sbjct->debug(sprintf( "$frame/$strand len is getting %d for %d..%d\n",
220 ($_->{'stop'} - $_->{'start'} + 1), $_->{'start'}, $_->{'stop'}));
222 $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
223 $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'};
224 $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'};
225 $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
228 # Find longest contig.
229 my @sortedkeys = sort { $qctg_dat{$b}->{'length_aln_query'}
230 <=> $qctg_dat{$a}->{'length_aln_query'} }
231 keys %qctg_dat;
233 # Save the largest to the sbjct:
234 my $longest = $sortedkeys[0];
235 $sbjct->debug( "longest is ". $qctg_dat{ $longest }->{'length_aln_query'}. "\n");
236 $sbjct->length_aln('query', $qctg_dat{ $longest }->{'length_aln_query'});
237 $sbjct->matches($qctg_dat{ $longest }->{'totalIdentical'},
238 $qctg_dat{ $longest }->{'totalConserved'});
239 $sbjct->strand('query', $qctg_dat{ $longest }->{'qstrand'});
241 ## Collect data for sbjct contigs. Important for gapped Blast.
242 ## The totalIdentical and totalConserved numbers will be the same
243 ## as determined for the query contigs.
245 $sbjct->debug( "\nSBJCT CONTIGS:\n".
246 " gaps = ". $sbjct->gaps('sbjct'). "\n");
247 my (%sctg_dat);
248 foreach(@scontigs) {
249 $sbjct->debug(" sbjct contig: $_->{'start'} - $_->{'stop'}\n".
250 " iden = $_->{'iden'}; cons = $_->{'cons'}\n");
251 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
252 $sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1;
253 $sctg_dat{ "$frame$strand" }->{'frame'} = $frame;
254 $sctg_dat{ "$frame$strand" }->{'sstrand'} = $strand;
257 @sortedkeys = sort { $sctg_dat{ $b }->{'length_aln_sbjct'}
258 <=> $sctg_dat{ $a }->{'length_aln_sbjct'}
259 } keys %sctg_dat;
261 # Save the largest to the sbjct:
262 $longest = $sortedkeys[0];
264 $sbjct->length_aln('sbjct', $sctg_dat{ $longest }->{'length_aln_sbjct'});
265 $sbjct->frame( $sctg_dat{ $longest }->{'frame'} );
266 $sbjct->strand('hit', $sctg_dat{ $longest }->{'sstrand'});
268 if($qoverlap) {
269 if($soverlap) { $sbjct->ambiguous_aln('qs');
270 $sbjct->debug("\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n");
272 else { $sbjct->ambiguous_aln('q');
273 $sbjct->debug( "\n*** AMBIGUOUS ALIGNMENT: Query\n\n");
275 } elsif($soverlap) {
276 $sbjct->ambiguous_aln('s');
277 $sbjct->debug( "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n");
280 # Adjust length based on BLAST flavor.
281 my $prog = $sbjct->algorithm;
282 if($prog =~ /^(PSI)?T(BLAST|FAST)[NY]/oi ) {
283 $sbjct->length_aln('sbjct', $sbjct->length_aln('sbjct')/3);
284 } elsif($prog =~ /^(BLAST|FAST)(X|Y|XY)/oi ) {
285 $sbjct->length_aln('query', $sbjct->length_aln('query')/3);
286 } elsif($prog =~ /^T(BLAST|FAST)(X|Y|XY)/oi ) {
287 $sbjct->length_aln('query', $sbjct->length_aln('query')/3);
288 $sbjct->length_aln('sbjct', $sbjct->length_aln('sbjct')/3);
294 =head2 _adjust_contigs
296 Usage : n/a; called automatically during object construction.
297 Purpose : Builds HSP contigs for a given BLAST hit.
298 : Utility method called by _tile_hsps()
299 Returns :
300 Argument :
301 Throws : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches()
302 : for invalid sub-sequence ranges.
303 Status : Experimental
304 Comments : This method does not currently support gapped alignments.
305 : Also, it does not keep track of the number of HSPs that
306 : overlap within the amount specified by overlap().
307 : This will lead to significant tracking errors for large
308 : overlap values.
310 See Also : L<tile_hsps>(), L<Bio::Search::Hit::BlastHSP::matches|Bio::Search::Hit::BlastHSP>
312 =cut
314 #-------------------
315 sub _adjust_contigs {
316 #-------------------
317 my ($seqType, $hsp, $start, $stop, $contigs_ref,
318 $max_overlap, $frame, $strand) = @_;
320 my $overlap = 0;
321 my ($numID, $numCons);
324 foreach ( @$contigs_ref) {
326 # Don't merge things unless they have matching strand/frame.
327 next unless ($_->{'frame'} == $frame &&
328 $_->{'strand'} == $strand);
330 ## Test special case of a nested HSP. Skip it.
331 if( $start >= $_->{'start'} && $stop <= $_->{'stop'}) {
332 # print SDTERR ( "----> Nested HSP. Skipping.\n");
333 $overlap = 1;
334 next;
337 ## Test for overlap at beginning of contig.
338 # to find left most
339 if($start < $_->{'start'} &&
340 $stop > ($_->{'start'} + $max_overlap) ) {
341 # print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
342 # Collect stats over the non-overlapping region.
343 eval {
344 ($numID, $numCons) = $hsp->matches(-SEQ =>$seqType,
345 -START =>$start,
346 -STOP =>$_->{'start'}-1);
348 if($@) { warn "\a\n$@\n"; }
349 else {
350 $_->{'start'} = $start; # Assign a new start coordinate to the contig
351 $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
352 $_->{'cons'} += $numCons;
353 $overlap = 1;
357 ## Test for overlap at end of contig.
358 if($stop > $_->{'stop'} and
359 $start < ($_->{'stop'} - $max_overlap)) {
360 # print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
361 # Collect stats over the non-overlapping region.
362 eval {
363 ($numID,$numCons) = $hsp->matches(-SEQ =>$seqType,
364 -START =>$_->{'stop'},
365 -STOP =>$stop);
367 if($@) { warn "\a\n$@\n"; }
368 else {
369 $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig
370 $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
371 $_->{'cons'} += $numCons;
372 $overlap = 1;
375 $overlap && do {
376 # print STDERR " New Contig data:\n";
377 # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
378 last;
381 ## If there is no overlap, add the complete HSP data.
382 ! $overlap && do {
383 ($numID,$numCons) = $hsp->matches(-SEQ=>$seqType);
384 push @$contigs_ref, {'start' =>$start, 'stop' =>$stop,
385 'iden' =>$numID, 'cons' =>$numCons,
386 'strand'=>$strand,'frame'=>$frame};
388 $overlap;
391 =head2 get_exponent
393 Usage : &get_exponent( number );
394 Purpose : Determines the power of 10 exponent of an integer, float,
395 : or scientific notation number.
396 Example : &get_exponent("4.0e-206");
397 : &get_exponent("0.00032");
398 : &get_exponent("10.");
399 : &get_exponent("1000.0");
400 : &get_exponent("e+83");
401 Argument : Float, Integer, or scientific notation number
402 Returns : Integer representing the exponent part of the number (+ or -).
403 : If argument == 0 (zero), return value is "-999".
404 Comments : Exponents are rounded up (less negative) if the mantissa is >= 5.
405 : Exponents are rounded down (more negative) if the mantissa is <= -5.
407 =cut
409 #------------------
410 sub get_exponent {
411 #------------------
412 my $data = shift;
414 my($num, $exp) = split /[eE]/, $data;
416 if( defined $exp) {
417 $num = 1 if not $num;
418 $num >= 5 and $exp++;
419 $num <= -5 and $exp--;
420 } elsif( $num == 0) {
421 $exp = -999;
422 } elsif( not $num =~ /\./) {
423 $exp = CORE::length($num) -1;
424 } else {
425 $exp = 0;
426 $num .= '0' if $num =~ /\.$/;
427 my ($c);
428 my $rev = 0;
429 if($num !~ /^0/) {
430 $num = reverse($num);
431 $rev = 1;
433 do { $c = chop($num);
434 $c == 0 && $exp++;
435 } while( $c ne '.');
437 $exp = -$exp if $num == 0 and not $rev;
438 $exp -= 1 if $rev;
440 return $exp;
443 =head2 collapse_nums
445 Usage : @cnums = collapse_nums( @numbers );
446 Purpose : Collapses a list of numbers into a set of ranges of consecutive terms:
447 : Useful for condensing long lists of consecutive numbers.
448 : EXPANDED:
449 : 1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32
450 : COLLAPSED:
451 : 1-6 10 12-15 17 18 20-22 24 26 30-32
452 Argument : List of numbers sorted numerically.
453 Returns : List of numbers mixed with ranges of numbers (see above).
454 Throws : n/a
456 See Also : L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit>
458 =cut
460 #------------------
461 sub collapse_nums {
462 #------------------
463 # This is probably not the slickest connectivity algorithm, but will do for now.
464 my @a = @_;
465 my ($from, $to, $i, @ca, $consec);
467 $consec = 0;
468 for($i=0; $i < @a; $i++) {
469 not $from and do{ $from = $a[$i]; next; };
470 if($a[$i] == $a[$i-1]+1) {
471 $to = $a[$i];
472 $consec++;
473 } else {
474 if($consec == 1) { $from .= ",$to"; }
475 else { $from .= $consec>1 ? "\-$to" : ""; }
476 push @ca, split(',', $from);
477 $from = $a[$i];
478 $consec = 0;
479 $to = undef;
482 if(defined $to) {
483 if($consec == 1) { $from .= ",$to"; }
484 else { $from .= $consec>1 ? "\-$to" : ""; }
486 push @ca, split(',', $from) if $from;
488 @ca;
492 =head2 strip_blast_html
494 Usage : $boolean = &strip_blast_html( string_ref );
495 : This method is exported.
496 Purpose : Removes HTML formatting from a supplied string.
497 : Attempts to restore the Blast report to enable
498 : parsing by Bio::SearchIO::blast.pm
499 Returns : Boolean: true if string was stripped, false if not.
500 Argument : string_ref = reference to a string containing the whole Blast
501 : report containing HTML formatting.
502 Throws : Croaks if the argument is not a scalar reference.
503 Comments : Based on code originally written by Alex Dong Li
504 : (ali@genet.sickkids.on.ca).
505 : This method does some Blast-specific stripping
506 : (adds back a '>' character in front of each HSP
507 : alignment listing).
509 : THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES!
511 : Removal of the HTML tags and accurate reconstitution of the
512 : non-HTML-formatted report is highly dependent on structure of
513 : the HTML-formatted version. For example, it assumes that first
514 : line of each alignment section (HSP listing) starts with a
515 : <a name=..> anchor tag. This permits the reconstruction of the
516 : original report in which these lines begin with a ">".
517 : This is required for parsing.
519 : If the structure of the Blast report itself is not intended to
520 : be a standard, the structure of the HTML-formatted version
521 : is even less so. Therefore, the use of this method to
522 : reconstitute parsable Blast reports from HTML-format versions
523 : should be considered a temorary solution.
525 See Also : B<Bio::Search::Processor::BlastIO::new()>
527 =cut
529 #--------------------
530 sub strip_blast_html {
531 #--------------------
532 # This may not best way to remove html tags. However, it is simple.
533 # it won't work under following conditions:
534 # 1) if quoted > appears in a tag (does this ever happen?)
535 # 2) if a tag is split over multiple lines and this method is
536 # used to process one line at a time.
538 my ($string_ref) = shift;
540 ref $string_ref eq 'SCALAR' or
541 croak ("Can't strip HTML: ".
542 "Argument is should be a SCALAR reference not a ${\ref $string_ref}\n");
544 my $str = $$string_ref;
545 my $stripped = 0;
547 # Removing "<a name =...>" and adding the '>' character for
548 # HSP alignment listings.
549 $str =~ s/(\A|\n)<a name ?=[^>]+> ?/>/sgi and $stripped = 1;
551 # Removing all "<>" tags.
552 $str =~ s/<[^>]+>|&nbsp//sgi and $stripped = 1;
554 # Re-uniting any lone '>' characters.
555 $str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1;
557 $$string_ref = $str;
558 $stripped;