tag fourth (and hopefully last) alpha
[bioperl-live.git] / branch-1-6 / Bio / Search / Hit / PsiBlastHit.pm
blob1df985092624bb41d433ddb1ff64bf711a48291f
1 #-----------------------------------------------------------------
2 # $Id$
4 # BioPerl module Bio::Search::Hit::PsiBlastHit
6 # (This module was originally called Bio::Tools::Blast::Sbjct)
8 # Please direct questions and support issues to <bioperl-l@bioperl.org>
10 # Cared for by Steve Chervitz <sac@bioperl.org>
12 # You may distribute this module under the same terms as perl itself
13 #-----------------------------------------------------------------
15 ## POD Documentation:
17 =head1 NAME
19 Bio::Search::Hit::PsiBlastHit - Bioperl BLAST Hit object
21 =head1 SYNOPSIS
23 See L<Bio::Search::Result::BlastResult>.
25 =head1 DESCRIPTION
27 The Bio::Search::Hit::PsiBlastHit.pm module encapsulates data and
28 methods for manipulating "hits" from a BLAST report. A BLAST hit is a
29 collection of HSPs along with other metadata such as sequence name and
30 score information. Hit objects are accessed via
31 L<Bio::Search::Result::BlastResult> objects after parsing a BLAST
32 report using the L<Bio::SearchIO> system.
34 In Blast lingo, the "sbjct" sequences are all the sequences in a
35 target database which were compared against a "query" sequence. The
36 terms "sbjct" and "hit" will be used interchangeably in this module.
37 All methods that take 'sbjct' as an argument also support 'hit' as a
38 synonym.
40 This module supports BLAST versions 1.x and 2.x, gapped and ungapped,
41 and PSI-BLAST.
43 The construction of PsiBlastHit objects is performed by
44 Bio::SearchIO::blast::PsiBlastHitFactory in a process that is
45 orchestrated by the Blast parser (L<Bio::SearchIO::blast>).
46 The resulting PsiBlastHits are then accessed via
47 L<Bio::Search::Result::BlastResult>). Therefore, you do not need to
48 use L<Bio::Search::Hit::PsiBlastHit>) directly. If you need to
49 construct PsiBlastHits directly, see the C<new()> function for details.
51 For L<Bio::SearchIO> BLAST parsing usage examples, see the
52 C<examples/search-blast> directory of the Bioperl distribution.
55 =head2 HSP Tiling and Ambiguous Alignments
57 If a Blast hit has more than one HSP, the Bio::Search::Hit::PsiBlastHit.pm
58 object has the ability to merge overlapping HSPs into contiguous
59 blocks. This permits the PsiBlastHit object to sum data across all HSPs
60 without counting data in the overlapping regions multiple times, which
61 would happen if data from each overlapping HSP are simply summed. HSP
62 tiling is performed automatically when methods of the PsiBlastHit object
63 that rely on tiled data are invoked. These include
64 L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>,
65 L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>,
66 L<num_unaligned_query()|num_unaligned_query>, L<num_unaligned_hit()|num_unaligned_hit>.
68 It also permits the assessment of an "ambiguous alignment" if the
69 query (or sbjct) sequences from different HSPs overlap
70 (see L<ambiguous_aln()|ambiguous_aln>). The existence
71 of an overlap could indicate a biologically interesting region in the
72 sequence, such as a repeated domain. The PsiBlastHit object uses the
73 C<-OVERLAP> parameter to determine when two sequences overlap; if this is
74 set to 2 -- the default -- then any two sbjct or query HSP sequences
75 must overlap by more than two residues to get merged into the same
76 contig and counted as an overlap. See the L<BUGS | BUGS> section below for
77 "issues" with HSP tiling.
80 The results of the HSP tiling is reported with the following ambiguity codes:
82 'q' = Query sequence contains multiple sub-sequences matching
83 a single region in the sbjct sequence.
85 's' = Subject (PsiBlastHit) sequence contains multiple sub-sequences matching
86 a single region in the query sequence.
88 'qs' = Both query and sbjct sequences contain more than one
89 sub-sequence with similarity to the other sequence.
92 For addition information about ambiguous BLAST alignments, see
93 L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
95 =head1 DEPENDENCIES
97 Bio::Search::Hit::PsiBlastHit.pm is a concrete class that inherits from
98 L<Bio::Root::Root> and L<Bio::Search::Hit::HitI>. and relies on
99 L<Bio::Search::HSP::BlastHSP>.
102 =head1 BUGS
104 One consequence of the HSP tiling is that methods that rely on HSP
105 tiling such as L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>
106 etc. may report misleading numbers when C<-OVERLAP> is set to a large
107 number. For example, say we have two HSPs and the query sequence tile
108 as follows:
110 1 8 22 30 40 60
111 Full seq: ------------------------------------------------------------
112 * ** * **
113 HSP1: --------------- (6 identical matches)
114 ** ** **
115 HSP2: ------------- (6 identical matches)
118 If C<-OVERLAP> is set to some number over 4, HSP1 and HSP2 will not be
119 tiled into a single contig and their numbers of identical matches will
120 be added, giving a total of 12, not 10 if they had be combined into
121 one contig. This can lead to number greater than 1.0 for methods
122 L<frac_identical()|frac_identical> and L<frac_conserved()|frac_conserved>. This is less of an issue
123 with gapped Blast since it tends to combine HSPs that would be listed
124 separately without gapping. (Fractions E<gt>1.0 can be viewed as a
125 signal for an interesting alignment that warrants further inspection,
126 thus turning this bug into a feature :-).
128 Using large values for C<-OVERLAP> can lead to incorrect numbers
129 reported by methods that rely on HSP tiling but can be useful if you
130 care more about detecting ambiguous alignments. Setting C<-OVERLAP>
131 to zero will lead to the most accurate numbers for the
132 tiling-dependent methods but will be useless for detecting overlapping
133 HSPs since all HSPs will appear to overlap.
136 =head1 SEE ALSO
138 Bio::Search::HSP::BlastHSP.pm - Blast HSP object.
139 Bio::Search::Result::BlastResult.pm - Blast Result object.
140 Bio::Search::Hit::HitI.pm - Interface implemented by PsiBlastHit.pm
141 Bio::Root::Root.pm - Base class for PsiBlastHit.pm
143 Links:
145 http://bio.perl.org/ - Bioperl Project Homepage
148 =head1 FEEDBACK
150 =head2 Mailing Lists
152 User feedback is an integral part of the evolution of this and other
153 Bioperl modules. Send your comments and suggestions preferably to one
154 of the Bioperl mailing lists. Your participation is much appreciated.
156 bioperl-l@bioperl.org - General discussion
157 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
159 =head2 Support
161 Please direct usage questions or support issues to the mailing list:
163 I<bioperl-l@bioperl.org>
165 rather than to the module maintainer directly. Many experienced and
166 reponsive experts will be able look at the problem and quickly
167 address it. Please include a thorough description of the problem
168 with code and data examples if at all possible.
170 =head2 Reporting Bugs
172 Report bugs to the Bioperl bug tracking system to help us keep track
173 the bugs and their resolution. Bug reports can be submitted via the
174 web:
176 http://bugzilla.open-bio.org/
178 =head1 AUTHOR
180 Steve Chervitz E<lt>sac@bioperl.orgE<gt>
182 See L<the FEEDBACK section | FEEDBACK> for where to send bug reports and comments.
184 =head1 ACKNOWLEDGEMENTS
186 This software was originally developed in the Department of Genetics
187 at Stanford University. I would also like to acknowledge my
188 colleagues at Affymetrix for useful feedback.
190 =head1 COPYRIGHT
192 Copyright (c) 1996-2001 Steve Chervitz. All Rights Reserved.
194 =head1 DISCLAIMER
196 This software is provided "as is" without warranty of any kind.
198 =head1 APPENDIX
200 The rest of the documentation details each of the object methods.
201 Internal methods are usually preceded with a _
203 =cut
206 # Let the code begin...
208 package Bio::Search::Hit::PsiBlastHit;
210 use strict;
211 use Bio::Search::BlastUtils;
212 use vars qw(%SUMMARY_OFFSET);
214 use overload
215 '""' => \&to_string;
217 use base qw(Bio::Root::Root Bio::Search::Hit::HitI);
220 =head2 new
222 Usage : $hit = Bio::Search::Hit::PsiBlastHit->new( %named_params );
223 : Bio::Search::Hit::PsiBlastHit.pm objects are constructed
224 : automatically by Bio::SearchIO::PsiBlastHitFactory.pm,
225 : so there is no need for direct instantiation.
226 Purpose : Constructs a new PsiBlastHit object and Initializes key variables
227 : for the hit.
228 Returns : A Bio::Search::Hit::PsiBlastHit object
229 Argument : Named Parameters:
230 : Parameter keys are case-insensitive.
231 : -RAW_DATA => array reference holding raw BLAST report data
232 : for a single hit. This includes all lines
233 : within the HSP alignment listing section of a
234 : traditional BLAST or PSI-BLAST (non-XML) report,
235 : starting at (or just after) the leading '>'.
236 : -HOLD_RAW_DATA => boolean, should -RAW_DATA be saved within the object.
237 : -QUERY_LEN => Length of the query sequence
238 : -ITERATION => integer (PSI-BLAST iteration number in which hit was found)
239 : -OVERLAP => integer (maximum overlap between adjacent
240 : HSPs when tiling)
241 : -PROGRAM => string (type of Blast: BLASTP, BLASTN, etc)
242 : -SIGNIF => significance
243 : -IS_PVAL => boolean, true if -SIGNIF contains a P-value
244 : -SCORE => raw BLAST score
245 : -FOUND_AGAIN => boolean, true if this was a hit from the
246 : section of a PSI-BLAST with iteration > 1
247 : containing sequences that were also found
248 : in iteration 1.
249 Comments : This object accepts raw Blast report data not because it
250 : is required for parsing, but in order to retrieve it
251 : (only available if -HOLD_RAW_DATA is set to true).
253 See Also : L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<Bio::Root::Root::new()|Bio::Root::Root>
255 =cut
257 #-------------------
258 sub new {
259 #-------------------
260 my ($class, @args ) = @_;
261 my $self = $class->SUPER::new( @args );
263 my ($raw_data, $signif, $is_pval, $hold_raw);
265 ($self->{'_blast_program'}, $self->{'_query_length'}, $raw_data, $hold_raw,
266 $self->{'_overlap'}, $self->{'_iteration'}, $signif, $is_pval,
267 $self->{'_score'}, $self->{'_found_again'} ) =
268 $self->_rearrange( [qw(PROGRAM
269 QUERY_LEN
270 RAW_DATA
271 HOLD_RAW_DATA
272 OVERLAP
273 ITERATION
274 SIGNIF
275 IS_PVAL
276 SCORE
277 FOUND_AGAIN )], @args );
279 # TODO: Handle this in parser. Just pass in name parameter.
280 $self->_set_id( $raw_data->[0] );
282 if($is_pval) {
283 $self->{'_p'} = $signif;
284 } else {
285 $self->{'_expect'} = $signif;
288 if( $hold_raw ) {
289 $self->{'_hit_data'} = $raw_data;
292 return $self;
295 sub DESTROY {
296 my $self=shift;
297 #print STDERR "-->DESTROYING $self\n";
301 #=================================================
302 # Begin Bio::Search::Hit::HitI implementation
303 #=================================================
305 =head2 algorithm
307 Title : algorithm
308 Usage : $alg = $hit->algorithm();
309 Function: Gets the algorithm specification that was used to obtain the hit
310 For BLAST, the algorithm denotes what type of sequence was aligned
311 against what (BLASTN: dna-dna, BLASTP prt-prt, BLASTX translated
312 dna-prt, TBLASTN prt-translated dna, TBLASTX translated
313 dna-translated dna).
314 Returns : a scalar string
315 Args : none
317 =cut
319 #----------------
320 sub algorithm {
321 #----------------
322 my ($self,@args) = @_;
323 return $self->{'_blast_program'};
326 =head2 name
328 Usage : $hit->name([string]);
329 Purpose : Set/Get a string to identify the hit.
330 Example : $name = $hit->name;
331 : $hit->name('M81707');
332 Returns : String consisting of the hit's name or undef if not set.
333 Comments : The name is parsed out of the "Query=" line as the first chunk of
334 non-whitespace text. If you want the rest of the line, use
335 $hit->description().
337 See Also: L<accession()|accession>
339 =cut
343 #----------------
344 sub name {
345 #----------------
346 my $self = shift;
347 if (@_) {
348 my $name = shift;
349 $name =~ s/^\s+|(\s+|,)$//g;
350 $self->{'_name'} = $name;
352 return $self->{'_name'};
355 =head2 description
357 Usage : $hit_object->description( [integer] );
358 Purpose : Set/Get a description string for the hit.
359 This is parsed out of the "Query=" line as everything after
360 the first chunk of non-whitespace text. Use $hit->name()
361 to get the first chunk (the ID of the sequence).
362 Example : $description = $hit->description;
363 : $desc_60char = $hit->description(60);
364 Argument : Integer (optional) indicating the desired length of the
365 : description string to be returned.
366 Returns : String consisting of the hit's description or undef if not set.
368 =cut
372 #----------------
373 sub description {
374 #----------------
375 my( $self, $len ) = @_;
376 $len = (defined $len) ? $len : (CORE::length $self->{'_description'});
377 return substr( $self->{'_description'}, 0 ,$len );
380 =head2 accession
382 Title : accession
383 Usage : $acc = $hit->accession();
384 Function: Retrieve the accession (if available) for the hit
385 Returns : a scalar string (empty string if not set)
386 Args : none
387 Comments: Accession numbers are extracted based on the assumption that they
388 are delimited by | characters (NCBI-style). If this is not the case,
389 use the name() method and parse it as necessary.
391 See Also: L<name()|name>
393 =cut
395 #--------------------
396 sub accession {
397 #--------------------
398 my $self = shift;
399 if(@_) { $self->{'_accession'} = shift; }
400 $self->{'_accession'} || '';
403 =head2 raw_score
405 Usage : $hit_object->raw_score();
406 Purpose : Gets the BLAST score of the best HSP for the current Blast hit.
407 Example : $score = $hit_object->raw_score();
408 Returns : Integer
409 Argument : n/a
410 Throws : n/a
412 See Also : L<bits()|bits>
414 =cut
416 #----------
417 sub raw_score {
418 #----------
419 my $self = shift;
421 # The check for $self->{'_score'} is a remnant from the 'query' mode days
422 # in which the sbjct object would collect data from the description line only.
424 my ($score);
425 if(not defined($self->{'_score'})) {
426 $score = $self->hsp->score;
427 } else {
428 $score = $self->{'_score'};
430 return $score;
434 =head2 length
436 Usage : $hit_object->length();
437 Purpose : Get the total length of the hit sequence.
438 Example : $len = $hit_object->length();
439 Returns : Integer
440 Argument : n/a
441 Throws : n/a
442 Comments : Developer note: when using the built-in length function within
443 : this module, call it as CORE::length().
445 See Also : L<logical_length()|logical_length>, L<length_aln()|length_aln>
447 =cut
449 #-----------
450 sub length {
451 #-----------
452 my $self = shift;
453 return $self->{'_length'};
456 =head2 significance
458 Equivalent to L<signif()|signif>
460 =cut
462 #----------------
463 sub significance { shift->signif( @_ ); }
464 #----------------
467 =head2 next_hsp
469 Title : next_hsp
470 Usage : $hsp = $obj->next_hsp();
471 Function : returns the next available High Scoring Pair object
472 Example :
473 Returns : Bio::Search::HSP::BlastHSP or undef if finished
474 Args : none
476 =cut
478 #----------------
479 sub next_hsp {
480 #----------------
481 my $self = shift;
483 unless($self->{'_hsp_queue_started'}) {
484 $self->{'_hsp_queue'} = [$self->hsps()];
485 $self->{'_hsp_queue_started'} = 1;
487 pop @{$self->{'_hsp_queue'}};
490 #=================================================
491 # End Bio::Search::Hit::HitI implementation
492 #=================================================
495 # Providing a more explicit method for getting name of hit
496 # (corresponds with column name in HitTableWriter)
497 #----------------
498 sub hit_name {
499 #----------------
500 my $self = shift;
501 $self->name( @_ );
504 # Older method Delegates to description()
505 #----------------
506 sub desc {
507 #----------------
508 my $self = shift;
509 return $self->description( @_ );
512 # Providing a more explicit method for getting description of hit
513 # (corresponds with column name in HitTableWriter)
514 #----------------
515 sub hit_description {
516 #----------------
517 my $self = shift;
518 return $self->description( @_ );
521 =head2 score
523 Equivalent to L<raw_score()|raw_score>
525 =cut
527 #----------------
528 sub score { shift->raw_score( @_ ); }
529 #----------------
532 =head2 hit_length
534 Equivalent to L<length()|length>
536 =cut
538 # Providing a more explicit method for getting length of hit
539 #----------------
540 sub hit_length { shift->length( @_ ); }
541 #----------------
544 =head2 signif
546 Usage : $hit_object->signif( [format] );
547 Purpose : Get the P or Expect value for the best HSP of the given BLAST hit.
548 : The value returned is the one which is reported in the description
549 : section of the Blast report. For Blast1 and WU-Blast2, this
550 : is a P-value, for Blast2, it is an Expect value.
551 Example : $obj->signif() # returns 1.3e-34
552 : $obj->signif('exp') # returns -34
553 : $obj->signif('parts') # returns (1.3, -34)
554 Returns : Float or scientific notation number (the raw P/Expect value, DEFAULT).
555 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
556 : 2-element list (float, int) if format == 'parts' and P/Expect value
557 : is in scientific notation (see Comments).
558 Argument : format: string of 'raw' | 'exp' | 'parts'
559 : 'raw' returns value given in report. Default. (1.2e-34)
560 : 'exp' returns exponent value only (34)
561 : 'parts' returns the decimal and exponent as a
562 : 2-element list (1.2, -34) (see Comments).
563 Throws : n/a
564 Comments : The signif() method provides a way to deal with the fact that
565 : Blast1 and Blast2 formats (and WU- vs. NCBI-BLAST) differ in
566 : what is reported in the description lines of each hit in the
567 : Blast report. The signif() method frees any client code from
568 : having to know if this is a P-value or an Expect value,
569 : making it easier to write code that can process both
570 : Blast1 and Blast2 reports. This is not necessarily a good thing,
571 : since one should always know when one is working with P-values or
572 : Expect values (hence the deprecated status).
573 : Use of expect() is recommended since all hits will have an Expect value.
575 : Using the 'parts' argument is not recommended since it will not
576 : work as expected if the expect value is not in scientific notation.
577 : That is, floats are not converted into sci notation before
578 : splitting into parts.
580 See Also : L<p()|p>, L<expect()|expect>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
582 =cut
584 #-------------
585 sub signif {
586 #-------------
587 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
588 my ($self, $fmt) = @_;
590 my $val = defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
592 # $val can be zero.
593 defined($val) or $self->throw("Can't get P- or Expect value: HSPs may not have been set.");
595 return $val if not $fmt or $fmt =~ /^raw/i;
596 ## Special formats: exponent-only or as list.
597 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
598 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
600 ## Default: return the raw P/Expect-value.
601 return $val;
604 #----------------
605 sub raw_hit_data {
606 #----------------
607 my $self = shift;
608 my $data = '>';
609 # Need to add blank lines where we've removed them.
610 foreach( @{$self->{'_hit_data'}} ) {
611 if( $_ eq 'end') {
612 $data .= "\n";
614 else {
615 $data .= /^\s*(Score|Query)/ ? "\n$_" : $_;
618 return $data;
622 #=head2 _set_length
624 # Usage : $hit_object->_set_length( "233" );
625 # Purpose : Set the total length of the hit sequence.
626 # Example : $hit_object->_set_length( $len );
627 # Returns : n/a
628 # Argument : Integer (only when setting). Any commas will be stripped out.
629 # Throws : n/a
631 #=cut
633 #-----------
634 sub _set_length {
635 #-----------
636 my ($self, $len) = @_;
637 $len =~ s/,//g; # get rid of commas
638 $self->{'_length'} = $len;
641 #=head2 _set_description
643 # Usage : Private method; called automatically during construction
644 # Purpose : Sets the description of the hit sequence.
645 # : For sequence without descriptions, does not set any description.
646 # Argument : Array containing description (multiple lines).
647 # Comments : Processes the supplied description:
648 # 1. Join all lines into one string.
649 # 2. Remove sequence id at the beginning of description.
650 # 3. Removes junk charactes at begin and end of description.
652 #=cut
654 #--------------
655 sub _set_description {
656 #--------------
657 my( $self, @desc ) = @_;
658 my( $desc);
660 # print STDERR "PsiBlastHit: RAW DESC:\n@desc\n";
662 $desc = join(" ", @desc);
664 my $name = $self->name;
666 if($desc) {
667 $desc =~ s/^\s*\S+\s+//; # remove the sequence ID(s)
668 # This won't work if there's no description.
669 $desc =~ s/^\s*$name//; # ...but this should.
670 $desc =~ s/^[\s!]+//;
671 $desc =~ s/ \d+$//;
672 $desc =~ s/\.+$//;
673 $self->{'_description'} = $desc;
676 # print STDERR "PsiBlastHit: _set_description = $desc\n";
679 =head2 to_string
681 Title : to_string
682 Usage : print $hit->to_string;
683 Function: Returns a string representation for the Blast Hit.
684 Primarily intended for debugging purposes.
685 Example : see usage
686 Returns : A string of the form:
687 [PsiBlastHit] <name> <description>
688 e.g.:
689 [PsiBlastHit] emb|Z46660|SC9725 S.cerevisiae chromosome XIII cosmid
690 Args : None
692 =cut
694 #----------------
695 sub to_string {
696 #----------------
697 my $self = shift;
698 return "[PsiBlastHit] " . $self->name . " " . $self->description;
702 #=head2 _set_id
704 # Usage : Private method; automatically called by new()
705 # Purpose : Sets the name of the PsiBlastHit sequence from the BLAST summary line.
706 # : The identifier is assumed to be the first
707 # : chunk of non-whitespace characters in the description line
708 # : Does not assume any semantics in the structure of the identifier
709 # : (Formerly, this method attempted to extract database name from
710 # : the seq identifiers, but this was prone to break).
711 # Returns : n/a
712 # Argument : String containing description line of the hit from Blast report
713 # : or first line of an alignment section (with or without the leading '>').
714 # Throws : Warning if cannot locate sequence ID.
716 #See Also : L<new()|new>, L<accession()|accession>
718 #=cut
720 #---------------
721 sub _set_id {
722 #---------------
723 my( $self, $desc ) = @_;
725 # New strategy: Assume only that the ID is the first white space
726 # delimited chunk. Not attempting to extract accession & database name.
727 # Clients will have to interpret it as necessary.
728 if($desc =~ /^>?(\S+)\s*(.*)/) {
729 my ($name, $desc) = ($1, $2);
730 $self->name($name);
731 $self->{'_description'} = $desc;
732 # Note that this description comes from the summary section of the
733 # BLAST report and so may be truncated. The full description will be
734 # set from the alignment section. We're setting description here in case
735 # the alignment section isn't being parsed.
737 # Assuming accession is delimited with | symbols (NCBI-style)
738 my @pieces = split(/\|/,$name);
739 my $acc = pop @pieces;
740 $self->accession( $acc );
742 else {
743 $self->warn("Can't locate sequence identifier in summary line.", "Line = $desc");
744 $desc = 'Unknown sequence ID' if not $desc;
745 $self->name($desc);
750 =head2 ambiguous_aln
752 Usage : $ambig_code = $hit_object->ambiguous_aln();
753 Purpose : Sets/Gets ambiguity code data member.
754 Example : (see usage)
755 Returns : String = 'q', 's', 'qs', '-'
756 : 'q' = query sequence contains overlapping sub-sequences
757 : while sbjct does not.
758 : 's' = sbjct sequence contains overlapping sub-sequences
759 : while query does not.
760 : 'qs' = query and sbjct sequence contains overlapping sub-sequences
761 : relative to each other.
762 : '-' = query and sbjct sequence do not contains multiple domains
763 : relative to each other OR both contain the same distribution
764 : of similar domains.
765 Argument : n/a
766 Throws : n/a
767 Status : Experimental
769 See Also : L<Bio::Search::BlastUtils::tile_hsps>, L<HSP Tiling and Ambiguous Alignments>
771 =cut
773 #--------------------
774 sub ambiguous_aln {
775 #--------------------
776 my $self = shift;
777 if(@_) { $self->{'_ambiguous_aln'} = shift; }
778 $self->{'_ambiguous_aln'} || '-';
783 =head2 overlap
785 Usage : $blast_object->overlap( [integer] );
786 Purpose : Gets/Sets the allowable amount overlap between different HSP sequences.
787 Example : $blast_object->overlap(5);
788 : $overlap = $blast_object->overlap;
789 Returns : Integer.
790 Argument : integer.
791 Throws : n/a
792 Status : Experimental
793 Comments : Any two HSPs whose sequences overlap by less than or equal
794 : to the overlap() number of resides will be considered separate HSPs
795 : and will not get tiled by Bio::Search::BlastUtils::_adjust_contigs().
797 See Also : L<Bio::Search::BlastUtils::_adjust_contigs()|Bio::Search::BlastUtils>, L<BUGS | BUGS>
799 =cut
801 #-------------
802 sub overlap {
803 #-------------
804 my $self = shift;
805 if(@_) { $self->{'_overlap'} = shift; }
806 defined $self->{'_overlap'} ? $self->{'_overlap'} : 0;
814 =head2 bits
816 Usage : $hit_object->bits();
817 Purpose : Gets the BLAST bit score of the best HSP for the current Blast hit.
818 Example : $bits = $hit_object->bits();
819 Returns : Integer
820 Argument : n/a
821 Throws : Exception if bit score is not set.
822 Comments : For BLAST1, the non-bit score is listed in the summary line.
824 See Also : L<score()|score>
826 =cut
828 #---------
829 sub bits {
830 #---------
831 my $self = shift;
833 # The check for $self->{'_bits'} is a remnant from the 'query' mode days
834 # in which the sbjct object would collect data from the description line only.
836 my ($bits);
837 if(not defined($self->{'_bits'})) {
838 $bits = $self->hsp->bits;
839 } else {
840 $bits = $self->{'_bits'};
842 return $bits;
847 =head2 n
849 Usage : $hit_object->n();
850 Purpose : Gets the N number for the current Blast hit.
851 : This is the number of HSPs in the set which was ascribed
852 : the lowest P-value (listed on the description line).
853 : This number is not the same as the total number of HSPs.
854 : To get the total number of HSPs, use num_hsps().
855 Example : $n = $hit_object->n();
856 Returns : Integer
857 Argument : n/a
858 Throws : Exception if HSPs have not been set (BLAST2 reports).
859 Comments : Note that the N parameter is not reported in gapped BLAST2.
860 : Calling n() on such reports will result in a call to num_hsps().
861 : The num_hsps() method will count the actual number of
862 : HSPs in the alignment listing, which may exceed N in
863 : some cases.
865 See Also : L<num_hsps()|num_hsps>
867 =cut
869 #-----
870 sub n {
871 #-----
872 my $self = shift;
874 # The check for $self->{'_n'} is a remnant from the 'query' mode days
875 # in which the sbjct object would collect data from the description line only.
877 my ($n);
878 if(not defined($self->{'_n'})) {
879 $n = $self->hsp->n;
880 } else {
881 $n = $self->{'_n'};
883 $n ||= $self->num_hsps;
885 return $n;
890 =head2 frame
892 Usage : $hit_object->frame();
893 Purpose : Gets the reading frame for the best HSP after HSP tiling.
894 : This is only valid for BLASTX and TBLASTN/X reports.
895 Example : $frame = $hit_object->frame();
896 Returns : Integer (-2 .. +2)
897 Argument : n/a
898 Throws : Exception if HSPs have not been set (BLAST2 reports).
899 Comments : This method requires that all HSPs be tiled. If they have not
900 : already been tiled, they will be tiled first automatically..
901 : If you don't want the tiled data, iterate through each HSP
902 : calling frame() on each (use hsps() to get all HSPs).
904 See Also : L<hsps()|hsps>
906 =cut
908 #----------'
909 sub frame {
910 #----------
911 my $self = shift;
913 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
915 # The check for $self->{'_frame'} is a remnant from the 'query' mode days
916 # in which the sbjct object would collect data from the description line only.
918 my ($frame);
919 if(not defined($self->{'_frame'})) {
920 $frame = $self->hsp->frame('hit');
921 } else {
922 $frame = $self->{'_frame'};
924 return $frame;
931 =head2 p
933 Usage : $hit_object->p( [format] );
934 Purpose : Get the P-value for the best HSP of the given BLAST hit.
935 : (Note that P-values are not provided with NCBI Blast2 reports).
936 Example : $p = $sbjct->p;
937 : $p = $sbjct->p('exp'); # get exponent only.
938 : ($num, $exp) = $sbjct->p('parts'); # split sci notation into parts
939 Returns : Float or scientific notation number (the raw P-value, DEFAULT).
940 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
941 : 2-element list (float, int) if format == 'parts' and P-value
942 : is in scientific notation (See Comments).
943 Argument : format: string of 'raw' | 'exp' | 'parts'
944 : 'raw' returns value given in report. Default. (1.2e-34)
945 : 'exp' returns exponent value only (34)
946 : 'parts' returns the decimal and exponent as a
947 : 2-element list (1.2, -34) (See Comments).
948 Throws : Warns if no P-value is defined. Uses expect instead.
949 Comments : Using the 'parts' argument is not recommended since it will not
950 : work as expected if the P-value is not in scientific notation.
951 : That is, floats are not converted into sci notation before
952 : splitting into parts.
954 See Also : L<expect()|expect>, L<signif()|signif>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
956 =cut
958 #--------
959 sub p {
960 #--------
961 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
962 my ($self, $fmt) = @_;
964 my $val = $self->{'_p'};
966 # $val can be zero.
967 if(not defined $val) {
968 # P-value not defined, must be a NCBI Blast2 report.
969 # Use expect instead.
970 $self->warn( "P-value not defined. Using expect() instead.");
971 $val = $self->{'_expect'};
974 return $val if not $fmt or $fmt =~ /^raw/i;
975 ## Special formats: exponent-only or as list.
976 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
977 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
979 ## Default: return the raw P-value.
980 return $val;
985 =head2 expect
987 Usage : $hit_object->expect( [format] );
988 Purpose : Get the Expect value for the best HSP of the given BLAST hit.
989 Example : $e = $sbjct->expect;
990 : $e = $sbjct->expect('exp'); # get exponent only.
991 : ($num, $exp) = $sbjct->expect('parts'); # split sci notation into parts
992 Returns : Float or scientific notation number (the raw expect value, DEFAULT).
993 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
994 : 2-element list (float, int) if format == 'parts' and Expect
995 : is in scientific notation (see Comments).
996 Argument : format: string of 'raw' | 'exp' | 'parts'
997 : 'raw' returns value given in report. Default. (1.2e-34)
998 : 'exp' returns exponent value only (34)
999 : 'parts' returns the decimal and exponent as a
1000 : 2-element list (1.2, -34) (see Comments).
1001 Throws : Exception if the Expect value is not defined.
1002 Comments : Using the 'parts' argument is not recommended since it will not
1003 : work as expected if the expect value is not in scientific notation.
1004 : That is, floats are not converted into sci notation before
1005 : splitting into parts.
1007 See Also : L<p()|p>, L<signif()|signif>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
1009 =cut
1011 #-----------
1012 sub expect {
1013 #-----------
1014 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
1015 my ($self, $fmt) = @_;
1017 my $val;
1019 # For Blast reports that list the P value on the description line,
1020 # getting the expect value requires fully parsing the HSP data.
1021 # For NCBI blast, there's no problem.
1022 if(not defined($self->{'_expect'})) {
1023 if( defined $self->{'_hsps'}) {
1024 $self->{'_expect'} = $val = $self->hsp->expect;
1025 } else {
1026 # If _expect is not set and _hsps are not set,
1027 # then this must be a P-value-based report that was
1028 # run without setting the HSPs (shallow parsing).
1029 $self->throw("Can't get expect value. HSPs have not been set.");
1031 } else {
1032 $val = $self->{'_expect'};
1035 # $val can be zero.
1036 defined($val) or $self->throw("Can't get Expect value.");
1038 return $val if not $fmt or $fmt =~ /^raw/i;
1039 ## Special formats: exponent-only or as list.
1040 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
1041 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
1043 ## Default: return the raw Expect-value.
1044 return $val;
1048 =head2 hsps
1050 Usage : $hit_object->hsps();
1051 Purpose : Get a list containing all HSP objects.
1052 : Get the numbers of HSPs for the current hit.
1053 Example : @hsps = $hit_object->hsps();
1054 : $num = $hit_object->hsps(); # alternatively, use num_hsps()
1055 Returns : Array context : list of Bio::Search::HSP::BlastHSP.pm objects.
1056 : Scalar context: integer (number of HSPs).
1057 : (Equivalent to num_hsps()).
1058 Argument : n/a. Relies on wantarray
1059 Throws : Exception if the HSPs have not been collected.
1061 See Also : L<hsp()|hsp>, L<num_hsps()|num_hsps>
1063 =cut
1065 #---------
1066 sub hsps {
1067 #---------
1068 my $self = shift;
1070 if (not ref $self->{'_hsps'}) {
1071 $self->throw("Can't get HSPs: data not collected.");
1074 return wantarray
1075 # returning list containing all HSPs.
1076 ? @{$self->{'_hsps'}}
1077 # returning number of HSPs.
1078 : scalar(@{$self->{'_hsps'}});
1083 =head2 hsp
1085 Usage : $hit_object->hsp( [string] );
1086 Purpose : Get a single BlastHSP.pm object for the present PsiBlastHit.pm object.
1087 Example : $hspObj = $hit_object->hsp; # same as 'best'
1088 : $hspObj = $hit_object->hsp('best');
1089 : $hspObj = $hit_object->hsp('worst');
1090 Returns : Object reference for a Bio::Search::HSP::BlastHSP.pm object.
1091 Argument : String (or no argument).
1092 : No argument (default) = highest scoring HSP (same as 'best').
1093 : 'best' or 'first' = highest scoring HSP.
1094 : 'worst' or 'last' = lowest scoring HSP.
1095 Throws : Exception if the HSPs have not been collected.
1096 : Exception if an unrecognized argument is used.
1098 See Also : L<hsps()|hsps>, L<num_hsps>()
1100 =cut
1102 #----------
1103 sub hsp {
1104 #----------
1105 my( $self, $option ) = @_;
1106 $option ||= 'best';
1108 if (not ref $self->{'_hsps'}) {
1109 $self->throw("Can't get HSPs: data not collected.");
1112 my @hsps = @{$self->{'_hsps'}};
1114 return $hsps[0] if $option =~ /best|first|1/i;
1115 return $hsps[$#hsps] if $option =~ /worst|last/i;
1117 $self->throw("Can't get HSP for: $option\n" .
1118 "Valid arguments: 'best', 'worst'");
1123 =head2 num_hsps
1125 Usage : $hit_object->num_hsps();
1126 Purpose : Get the number of HSPs for the present Blast hit.
1127 Example : $nhsps = $hit_object->num_hsps();
1128 Returns : Integer
1129 Argument : n/a
1130 Throws : Exception if the HSPs have not been collected.
1132 See Also : L<hsps()|hsps>
1134 =cut
1136 #-------------
1137 sub num_hsps {
1138 #-------------
1139 my $self = shift;
1141 if (not defined $self->{'_hsps'}) {
1142 $self->throw("Can't get HSPs: data not collected.");
1145 return scalar(@{$self->{'_hsps'}});
1150 =head2 logical_length
1152 Usage : $hit_object->logical_length( [seq_type] );
1153 : (mostly intended for internal use).
1154 Purpose : Get the logical length of the hit sequence.
1155 : For query sequence of BLASTX and TBLASTX reports and the hit
1156 : sequence of TBLASTN and TBLASTX reports, the returned length
1157 : is the length of the would-be amino acid sequence (length/3).
1158 : For all other BLAST flavors, this function is the same as length().
1159 Example : $len = $hit_object->logical_length();
1160 Returns : Integer
1161 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1162 ('sbjct' is synonymous with 'hit')
1163 Throws : n/a
1164 Comments : This is important for functions like frac_aligned_query()
1165 : which need to operate in amino acid coordinate space when dealing
1166 : with T?BLASTX type reports.
1168 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>
1170 =cut
1172 #--------------------
1173 sub logical_length {
1174 #--------------------
1175 my $self = shift;
1176 my $seqType = shift || 'query';
1177 $seqType = 'sbjct' if $seqType eq 'hit';
1179 my $length;
1181 # For the sbjct, return logical sbjct length
1182 if( $seqType eq 'sbjct' ) {
1183 $length = $self->{'_logical_length'} || $self->{'_length'};
1185 else {
1186 # Otherwise, return logical query length
1187 $length = $self->{'_query_length'};
1189 # Adjust length based on BLAST flavor.
1190 if($self->{'_blast_program'} =~ /T?BLASTX/ ) {
1191 $length /= 3;
1194 return $length;
1198 =head2 length_aln
1200 Usage : $hit_object->length_aln( [seq_type] );
1201 Purpose : Get the total length of the aligned region for query or sbjct seq.
1202 : This number will include all HSPs
1203 Example : $len = $hit_object->length_aln(); # default = query
1204 : $lenAln = $hit_object->length_aln('query');
1205 Returns : Integer
1206 Argument : seq_Type = 'query' or 'hit' or 'sbjct' (Default = 'query')
1207 ('sbjct' is synonymous with 'hit')
1208 Throws : Exception if the argument is not recognized.
1209 Comments : This method will report the logical length of the alignment,
1210 : meaning that for TBLAST[NX] reports, the length is reported
1211 : using amino acid coordinate space (i.e., nucleotides / 3).
1213 : This method requires that all HSPs be tiled. If they have not
1214 : already been tiled, they will be tiled first automatically..
1215 : If you don't want the tiled data, iterate through each HSP
1216 : calling length() on each (use hsps() to get all HSPs).
1218 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>, L<gaps()|gaps>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<Bio::Search::HSP::BlastHSP::length()|Bio::Search::HSP::BlastHSP>
1220 =cut
1222 #---------------'
1223 sub length_aln {
1224 #---------------
1225 my( $self, $seqType ) = @_;
1227 $seqType ||= 'query';
1228 $seqType = 'sbjct' if $seqType eq 'hit';
1230 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1232 my $data = $self->{'_length_aln_'.$seqType};
1234 ## If we don't have data, figure out what went wrong.
1235 if(!$data) {
1236 $self->throw("Can't get length aln for sequence type \"$seqType\"" .
1237 "Valid types are 'query', 'hit', 'sbjct' ('sbjct' = 'hit')");
1239 $data;
1243 =head2 gaps
1245 Usage : $hit_object->gaps( [seq_type] );
1246 Purpose : Get the number of gaps in the aligned query, sbjct, or both sequences.
1247 : Data is summed across all HSPs.
1248 Example : $qgaps = $hit_object->gaps('query');
1249 : $hgaps = $hit_object->gaps('hit');
1250 : $tgaps = $hit_object->gaps(); # default = total (query + hit)
1251 Returns : scalar context: integer
1252 : array context without args: two-element list of integers
1253 : (queryGaps, sbjctGaps)
1254 : Array context can be forced by providing an argument of 'list' or 'array'.
1256 : CAUTION: Calling this method within printf or sprintf is arrray context.
1257 : So this function may not give you what you expect. For example:
1258 : printf "Total gaps: %d", $hit->gaps();
1259 : Actually returns a two-element array, so what gets printed
1260 : is the number of gaps in the query, not the total
1262 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total' | 'list' (default = 'total')
1263 ('sbjct' is synonymous with 'hit')
1264 Throws : n/a
1265 Comments : If you need data for each HSP, use hsps() and then interate
1266 : through each HSP object.
1267 : This method requires that all HSPs be tiled. If they have not
1268 : already been tiled, they will be tiled first automatically..
1269 : Not relying on wantarray since that will fail in situations
1270 : such as printf "%d", $hit->gaps() in which you might expect to
1271 : be printing the total gaps, but evaluates to array context.
1273 See Also : L<length_aln()|length_aln>
1275 =cut
1277 #----------
1278 sub gaps {
1279 #----------
1280 my( $self, $seqType ) = @_;
1282 $seqType ||= (wantarray ? 'list' : 'total');
1283 $seqType = 'sbjct' if $seqType eq 'hit';
1285 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1287 $seqType = lc($seqType);
1289 if($seqType =~ /list|array/i) {
1290 return ($self->{'_gaps_query'}, $self->{'_gaps_sbjct'});
1293 if($seqType eq 'total') {
1294 return ($self->{'_gaps_query'} + $self->{'_gaps_sbjct'}) || 0;
1295 } else {
1296 return $self->{'_gaps_'.$seqType} || 0;
1302 =head2 matches
1304 Usage : $hit_object->matches( [class] );
1305 Purpose : Get the total number of identical or conserved matches
1306 : (or both) across all HSPs.
1307 : (Note: 'conservative' matches are indicated as 'positives'
1308 : in the Blast report.)
1309 Example : ($id,$cons) = $hit_object->matches(); # no argument
1310 : $id = $hit_object->matches('id');
1311 : $cons = $hit_object->matches('cons');
1312 Returns : Integer or a 2-element array of integers
1313 Argument : class = 'id' | 'cons' OR none.
1314 : If no argument is provided, both identical and conservative
1315 : numbers are returned in a two element list.
1316 : (Other terms can be used to refer to the conservative
1317 : matches, e.g., 'positive'. All that is checked is whether or
1318 : not the supplied string starts with 'id'. If not, the
1319 : conservative matches are returned.)
1320 Throws : Exception if the requested data cannot be obtained.
1321 Comments : If you need data for each HSP, use hsps() and then interate
1322 : through the HSP objects.
1323 : Does not rely on wantarray to return a list. Only checks for
1324 : the presence of an argument (no arg = return list).
1326 See Also : L<Bio::Search::HSP::BlastHSP::matches()|Bio::Search::HSP::BlastHSP>, L<hsps()|hsps>
1328 =cut
1330 #---------------
1331 sub matches {
1332 #---------------
1333 my( $self, $arg) = @_;
1334 my(@data,$data);
1336 if(!$arg) {
1337 @data = ($self->{'_totalIdentical'}, $self->{'_totalConserved'});
1339 return @data if @data;
1341 } else {
1343 if($arg =~ /^id/i) {
1344 $data = $self->{'_totalIdentical'};
1345 } else {
1346 $data = $self->{'_totalConserved'};
1348 return $data if $data;
1351 ## Something went wrong if we make it to here.
1352 $self->throw("Can't get identical or conserved data: no data.");
1356 =head2 start
1358 Usage : $sbjct->start( [seq_type] );
1359 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
1360 : in the PsiBlastHit object. If there is more than one HSP, the lowest start
1361 : value of all HSPs is returned.
1362 Example : $qbeg = $sbjct->start('query');
1363 : $sbeg = $sbjct->start('hit');
1364 : ($qbeg, $sbeg) = $sbjct->start();
1365 Returns : scalar context: integer
1366 : array context without args: list of two integers (queryStart, sbjctStart)
1367 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1368 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1369 ('sbjct' is synonymous with 'hit')
1370 Throws : n/a
1371 Comments : This method requires that all HSPs be tiled. If there is more than one
1372 : HSP and they have not already been tiled, they will be tiled first automatically..
1373 : Remember that the start and end coordinates of all HSPs are
1374 : normalized so that start < end. Strand information can be
1375 : obtained by calling $hit->strand().
1377 See Also : L<end()|end>, L<range()|range>, L<strand()|strand>, L<HSP Tiling and Ambiguous Alignments>, L<Bio::Search::HSP::BlastHSP::start|Bio::Search::HSP::BlastHSP>
1379 =cut
1381 #----------
1382 sub start {
1383 #----------
1384 my ($self, $seqType) = @_;
1386 $seqType ||= (wantarray ? 'list' : 'query');
1387 $seqType = 'sbjct' if $seqType eq 'hit';
1389 # If there is only one HSP, defer this call to the solitary HSP.
1390 if($self->num_hsps == 1) {
1391 return $self->hsp->start($seqType);
1392 } else {
1393 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1394 if($seqType =~ /list|array/i) {
1395 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
1396 } else {
1397 ## Sensitive to member name changes.
1398 $seqType = "_\L$seqType\E";
1399 return $self->{$seqType.'Start'};
1405 =head2 end
1407 Usage : $sbjct->end( [seq_type] );
1408 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
1409 : in the PsiBlastHit object. If there is more than one HSP, the largest end
1410 : value of all HSPs is returned.
1411 Example : $qend = $sbjct->end('query');
1412 : $send = $sbjct->end('hit');
1413 : ($qend, $send) = $sbjct->end();
1414 Returns : scalar context: integer
1415 : array context without args: list of two integers (queryEnd, sbjctEnd)
1416 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1417 Argument : In scalar context: seq_type = 'query' or 'sbjct'
1418 : (case insensitive). If not supplied, 'query' is used.
1419 Throws : n/a
1420 Comments : This method requires that all HSPs be tiled. If there is more than one
1421 : HSP and they have not already been tiled, they will be tiled first automatically..
1422 : Remember that the start and end coordinates of all HSPs are
1423 : normalized so that start < end. Strand information can be
1424 : obtained by calling $hit->strand().
1426 See Also : L<start()|start>, L<range()|range>, L<strand()|strand>, L<HSP Tiling and Ambiguous Alignments>, L<Bio::Search::HSP::BlastHSP::end|Bio::Search::HSP::BlastHSP>
1428 =cut
1430 #----------
1431 sub end {
1432 #----------
1433 my ($self, $seqType) = @_;
1435 $seqType ||= (wantarray ? 'list' : 'query');
1436 $seqType = 'sbjct' if $seqType eq 'hit';
1438 # If there is only one HSP, defer this call to the solitary HSP.
1439 if($self->num_hsps == 1) {
1440 return $self->hsp->end($seqType);
1441 } else {
1442 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1443 if($seqType =~ /list|array/i) {
1444 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
1445 } else {
1446 ## Sensitive to member name changes.
1447 $seqType = "_\L$seqType\E";
1448 return $self->{$seqType.'Stop'};
1453 =head2 range
1455 Usage : $sbjct->range( [seq_type] );
1456 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
1457 : in the HSP alignment.
1458 Example : ($qbeg, $qend) = $sbjct->range('query');
1459 : ($sbeg, $send) = $sbjct->range('hit');
1460 Returns : Two-element array of integers
1461 Argument : seq_type = string, 'query' or 'hit' or 'sbjct' (default = 'query')
1462 ('sbjct' is synonymous with 'hit')
1463 Throws : n/a
1465 See Also : L<start()|start>, L<end()|end>
1467 =cut
1469 #----------
1470 sub range {
1471 #----------
1472 my ($self, $seqType) = @_;
1473 $seqType ||= 'query';
1474 $seqType = 'sbjct' if $seqType eq 'hit';
1475 return ($self->start($seqType), $self->end($seqType));
1479 =head2 frac_identical
1481 Usage : $hit_object->frac_identical( [seq_type] );
1482 Purpose : Get the overall fraction of identical positions across all HSPs.
1483 : The number refers to only the aligned regions and does not
1484 : account for unaligned regions in between the HSPs, if any.
1485 Example : $frac_iden = $hit_object->frac_identical('query');
1486 Returns : Float (2-decimal precision, e.g., 0.75).
1487 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total'
1488 : default = 'query' (but see comments below).
1489 : ('sbjct' is synonymous with 'hit')
1490 Throws : n/a
1491 Comments : Different versions of Blast report different values for the total
1492 : length of the alignment. This is the number reported in the
1493 : denominators in the stats section:
1494 : "Identical = 34/120 Positives = 67/120".
1495 : NCBI BLAST uses the total length of the alignment (with gaps)
1496 : WU-BLAST uses the length of the query sequence (without gaps).
1498 : Therefore, when called with an argument of 'total',
1499 : this method will report different values depending on the
1500 : version of BLAST used. Total does NOT take into account HSP
1501 : tiling, so it should not be used.
1503 : To get the fraction identical among only the aligned residues,
1504 : ignoring the gaps, call this method without an argument or
1505 : with an argument of 'query' or 'hit'.
1507 : If you need data for each HSP, use hsps() and then iterate
1508 : through the HSP objects.
1509 : This method requires that all HSPs be tiled. If they have not
1510 : already been tiled, they will be tiled first automatically.
1512 See Also : L<frac_conserved()|frac_conserved>, L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1514 =cut
1516 #------------------
1517 sub frac_identical {
1518 #------------------
1519 my ($self, $seqType) = @_;
1520 $seqType ||= 'query';
1521 $seqType = 'sbjct' if $seqType eq 'hit';
1523 ## Sensitive to member name format.
1524 $seqType = lc($seqType);
1526 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1528 sprintf( "%.2f", $self->{'_totalIdentical'}/$self->{'_length_aln_'.$seqType});
1533 =head2 frac_conserved
1535 Usage : $hit_object->frac_conserved( [seq_type] );
1536 Purpose : Get the overall fraction of conserved positions across all HSPs.
1537 : The number refers to only the aligned regions and does not
1538 : account for unaligned regions in between the HSPs, if any.
1539 Example : $frac_cons = $hit_object->frac_conserved('hit');
1540 Returns : Float (2-decimal precision, e.g., 0.75).
1541 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total'
1542 : default = 'query' (but see comments below).
1543 : ('sbjct' is synonymous with 'hit')
1544 Throws : n/a
1545 Comments : Different versions of Blast report different values for the total
1546 : length of the alignment. This is the number reported in the
1547 : denominators in the stats section:
1548 : "Positives = 34/120 Positives = 67/120".
1549 : NCBI BLAST uses the total length of the alignment (with gaps)
1550 : WU-BLAST uses the length of the query sequence (without gaps).
1552 : Therefore, when called with an argument of 'total',
1553 : this method will report different values depending on the
1554 : version of BLAST used. Total does NOT take into account HSP
1555 : tiling, so it should not be used.
1557 : To get the fraction conserved among only the aligned residues,
1558 : ignoring the gaps, call this method without an argument or
1559 : with an argument of 'query' or 'hit'.
1561 : If you need data for each HSP, use hsps() and then interate
1562 : through the HSP objects.
1563 : This method requires that all HSPs be tiled. If they have not
1564 : already been tiled, they will be tiled first automatically.
1566 See Also : L<frac_identical()|frac_identical>, L<matches()|matches>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1568 =cut
1570 #--------------------
1571 sub frac_conserved {
1572 #--------------------
1573 my ($self, $seqType) = @_;
1574 $seqType ||= 'query';
1575 $seqType = 'sbjct' if $seqType eq 'hit';
1577 ## Sensitive to member name format.
1578 $seqType = lc($seqType);
1580 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1582 sprintf( "%.2f", $self->{'_totalConserved'}/$self->{'_length_aln_'.$seqType});
1588 =head2 frac_aligned_query
1590 Usage : $hit_object->frac_aligned_query();
1591 Purpose : Get the fraction of the query sequence which has been aligned
1592 : across all HSPs (not including intervals between non-overlapping
1593 : HSPs).
1594 Example : $frac_alnq = $hit_object->frac_aligned_query();
1595 Returns : Float (2-decimal precision, e.g., 0.75).
1596 Argument : n/a
1597 Throws : n/a
1598 Comments : If you need data for each HSP, use hsps() and then interate
1599 : through the HSP objects.
1600 : To compute the fraction aligned, the logical length of the query
1601 : sequence is used, meaning that for [T]BLASTX reports, the
1602 : full length of the query sequence is converted into amino acids
1603 : by dividing by 3. This is necessary because of the way
1604 : the lengths of aligned sequences are computed.
1605 : This method requires that all HSPs be tiled. If they have not
1606 : already been tiled, they will be tiled first automatically.
1608 See Also : L<frac_aligned_hit()|frac_aligned_hit>, L<logical_length()|logical_length>, L<length_aln()|length_aln>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1610 =cut
1612 #----------------------
1613 sub frac_aligned_query {
1614 #----------------------
1615 my $self = shift;
1617 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1619 sprintf( "%.2f", $self->{'_length_aln_query'}/$self->logical_length('query'));
1624 =head2 frac_aligned_hit
1626 Usage : $hit_object->frac_aligned_hit();
1627 Purpose : Get the fraction of the hit (sbjct) sequence which has been aligned
1628 : across all HSPs (not including intervals between non-overlapping
1629 : HSPs).
1630 Example : $frac_alnq = $hit_object->frac_aligned_hit();
1631 Returns : Float (2-decimal precision, e.g., 0.75).
1632 Argument : n/a
1633 Throws : n/a
1634 Comments : If you need data for each HSP, use hsps() and then interate
1635 : through the HSP objects.
1636 : To compute the fraction aligned, the logical length of the sbjct
1637 : sequence is used, meaning that for TBLAST[NX] reports, the
1638 : full length of the sbjct sequence is converted into amino acids
1639 : by dividing by 3. This is necessary because of the way
1640 : the lengths of aligned sequences are computed.
1641 : This method requires that all HSPs be tiled. If they have not
1642 : already been tiled, they will be tiled first automatically.
1644 See Also : L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, , L<logical_length()|logical_length>, L<length_aln()|length_aln>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1646 =cut
1648 #--------------------
1649 sub frac_aligned_hit {
1650 #--------------------
1651 my $self = shift;
1653 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1655 sprintf( "%.2f", $self->{'_length_aln_sbjct'}/$self->logical_length('sbjct'));
1659 ## These methods are being maintained for backward compatibility.
1661 =head2 frac_aligned_sbjct
1663 Same as L<frac_aligned_hit()|frac_aligned_hit>
1665 =cut
1667 #----------------
1668 sub frac_aligned_sbjct { my $self=shift; $self->frac_aligned_hit(@_); }
1669 #----------------
1671 =head2 num_unaligned_sbjct
1673 Same as L<num_unaligned_hit()|num_unaligned_hit>
1675 =cut
1677 #----------------
1678 sub num_unaligned_sbjct { my $self=shift; $self->num_unaligned_hit(@_); }
1679 #----------------
1683 =head2 num_unaligned_hit
1685 Usage : $hit_object->num_unaligned_hit();
1686 Purpose : Get the number of the unaligned residues in the hit sequence.
1687 : Sums across all all HSPs.
1688 Example : $num_unaln = $hit_object->num_unaligned_hit();
1689 Returns : Integer
1690 Argument : n/a
1691 Throws : n/a
1692 Comments : See notes regarding logical lengths in the comments for frac_aligned_hit().
1693 : They apply here as well.
1694 : If you need data for each HSP, use hsps() and then interate
1695 : through the HSP objects.
1696 : This method requires that all HSPs be tiled. If they have not
1697 : already been tiled, they will be tiled first automatically..
1699 See Also : L<num_unaligned_query()|num_unaligned_query>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<frac_aligned_hit()|frac_aligned_hit>
1701 =cut
1703 #---------------------
1704 sub num_unaligned_hit {
1705 #---------------------
1706 my $self = shift;
1708 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1710 my $num = $self->logical_length('sbjct') - $self->{'_length_aln_sbjct'};
1711 ($num < 0 ? 0 : $num );
1715 =head2 num_unaligned_query
1717 Usage : $hit_object->num_unaligned_query();
1718 Purpose : Get the number of the unaligned residues in the query sequence.
1719 : Sums across all all HSPs.
1720 Example : $num_unaln = $hit_object->num_unaligned_query();
1721 Returns : Integer
1722 Argument : n/a
1723 Throws : n/a
1724 Comments : See notes regarding logical lengths in the comments for frac_aligned_query().
1725 : They apply here as well.
1726 : If you need data for each HSP, use hsps() and then interate
1727 : through the HSP objects.
1728 : This method requires that all HSPs be tiled. If they have not
1729 : already been tiled, they will be tiled first automatically..
1731 See Also : L<num_unaligned_hit()|num_unaligned_hit>, L<frac_aligned_query()|frac_aligned_query>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1733 =cut
1735 #-----------------------
1736 sub num_unaligned_query {
1737 #-----------------------
1738 my $self = shift;
1740 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1742 my $num = $self->logical_length('query') - $self->{'_length_aln_query'};
1743 ($num < 0 ? 0 : $num );
1748 =head2 seq_inds
1750 Usage : $hit->seq_inds( seq_type, class, collapse );
1751 Purpose : Get a list of residue positions (indices) across all HSPs
1752 : for identical or conserved residues in the query or sbjct sequence.
1753 Example : @s_ind = $hit->seq_inds('query', 'identical');
1754 : @h_ind = $hit->seq_inds('hit', 'conserved');
1755 : @h_ind = $hit->seq_inds('hit', 'conserved', 1);
1756 Returns : Array of integers
1757 : May include ranges if collapse is non-zero.
1758 Argument : [0] seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1759 : ('sbjct' is synonymous with 'hit')
1760 : [1] class = 'identical' or 'conserved' (default = 'identical')
1761 : (can be shortened to 'id' or 'cons')
1762 : (actually, anything not 'id' will evaluate to 'conserved').
1763 : [2] collapse = boolean, if non-zero, consecutive positions are merged
1764 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
1765 : collapses to "1-5 7 9-11". This is useful for
1766 : consolidating long lists. Default = no collapse.
1767 Throws : n/a.
1768 Comments : Note that HSPs are not tiled for this. This could be a problem
1769 : for hits containing mutually exclusive HSPs.
1770 : TODO: Consider tiling and then reporting seq_inds for the
1771 : best HSP contig.
1773 See Also : L<Bio::Search::HSP::BlastHSP::seq_inds()|Bio::Search::HSP::BlastHSP>
1775 =cut
1777 #-------------
1778 sub seq_inds {
1779 #-------------
1780 my ($self, $seqType, $class, $collapse) = @_;
1782 $seqType ||= 'query';
1783 $class ||= 'identical';
1784 $collapse ||= 0;
1786 $seqType = 'sbjct' if $seqType eq 'hit';
1788 my (@inds, $hsp);
1789 foreach $hsp ($self->hsps) {
1790 # This will merge data for all HSPs together.
1791 push @inds, $hsp->seq_inds($seqType, $class);
1794 # Need to remove duplicates and sort the merged positions.
1795 if(@inds) {
1796 my %tmp = map { $_, 1 } @inds;
1797 @inds = sort {$a <=> $b} keys %tmp;
1800 $collapse ? &Bio::Search::BlastUtils::collapse_nums(@inds) : @inds;
1804 =head2 iteration
1806 Usage : $sbjct->iteration( );
1807 Purpose : Gets the iteration number in which the Hit was found.
1808 Example : $iteration_num = $sbjct->iteration();
1809 Returns : Integer greater than or equal to 1
1810 Non-PSI-BLAST reports will report iteration as 1, but this number
1811 is only meaningful for PSI-BLAST reports.
1812 Argument : none
1813 Throws : none
1815 See Also : L<found_again()|found_again>
1817 =cut
1819 #----------------
1820 sub iteration { shift->{'_iteration'} }
1821 #----------------
1824 =head2 found_again
1826 Usage : $sbjct->found_again;
1827 Purpose : Gets a boolean indicator whether or not the hit has
1828 been found in a previous iteration.
1829 This is only applicable to PSI-BLAST reports.
1831 This method indicates if the hit was reported in the
1832 "Sequences used in model and found again" section of the
1833 PSI-BLAST report or if it was reported in the
1834 "Sequences not found previously or not previously below threshold"
1835 section of the PSI-BLAST report. Only for hits in iteration > 1.
1837 Example : if( $sbjct->found_again()) { ... };
1838 Returns : Boolean (1 or 0) for PSI-BLAST report iterations greater than 1.
1839 Returns undef for PSI-BLAST report iteration 1 and non PSI_BLAST
1840 reports.
1841 Argument : none
1842 Throws : none
1844 See Also : L<found_again()|found_again>
1846 =cut
1848 #----------------
1849 sub found_again { shift->{'_found_again'} }
1850 #----------------
1853 =head2 strand
1855 Usage : $sbjct->strand( [seq_type] );
1856 Purpose : Gets the strand(s) for the query, sbjct, or both sequences
1857 : in the best HSP of the PsiBlastHit object after HSP tiling.
1858 : Only valid for BLASTN, TBLASTX, BLASTX-query, TBLASTN-hit.
1859 Example : $qstrand = $sbjct->strand('query');
1860 : $sstrand = $sbjct->strand('hit');
1861 : ($qstrand, $sstrand) = $sbjct->strand();
1862 Returns : scalar context: integer '1', '-1', or '0'
1863 : array context without args: list of two strings (queryStrand, sbjctStrand)
1864 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1865 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1866 ('sbjct' is synonymous with 'hit')
1867 Throws : n/a
1868 Comments : This method requires that all HSPs be tiled. If they have not
1869 : already been tiled, they will be tiled first automatically..
1870 : If you don't want the tiled data, iterate through each HSP
1871 : calling strand() on each (use hsps() to get all HSPs).
1873 : Formerly (prior to 10/21/02), this method would return the
1874 : string "-1/1" for hits with HSPs on both strands.
1875 : However, now that strand and frame is properly being accounted
1876 : for during HSP tiling, it makes more sense for strand()
1877 : to return the strand data for the best HSP after tiling.
1879 : If you really want to know about hits on opposite strands,
1880 : you should be iterating through the HSPs using methods on the
1881 : HSP objects.
1883 : A possible use case where knowing whether a hit has HSPs
1884 : on both strands would be when filtering via SearchIO for hits with
1885 : this property. However, in this case it would be better to have a
1886 : dedicated method such as $hit->hsps_on_both_strands(). Similarly
1887 : for frame. This could be provided if there is interest.
1889 See Also : L<Bio::Search::HSP::BlastHSP::strand>()
1891 =cut
1893 #----------'
1894 sub strand {
1895 #----------
1896 my ($self, $seqType) = @_;
1898 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1900 $seqType ||= (wantarray ? 'list' : 'query');
1901 $seqType = 'sbjct' if $seqType eq 'hit';
1903 my ($qstr, $hstr);
1904 # If there is only one HSP, defer this call to the solitary HSP.
1905 if($self->num_hsps == 1) {
1906 return $self->hsp->strand($seqType);
1908 elsif( defined $self->{'_qstrand'}) {
1909 # Get the data computed during hsp tiling.
1910 $qstr = $self->{'_qstrand'};
1911 $hstr = $self->{'_sstrand'};
1913 else {
1914 # otherwise, iterate through all HSPs collecting strand info.
1915 # This will return the string "-1/1" if there are HSPs on different strands.
1916 # NOTE: This was the pre-10/21/02 procedure which will no longer be used,
1917 # (unless the above elsif{} is commented out).
1918 my (%qstr, %hstr);
1919 foreach my $hsp( $self->hsps ) {
1920 my ( $q, $h ) = $hsp->strand();
1921 $qstr{ $q }++;
1922 $hstr{ $h }++;
1924 $qstr = join( '/', sort keys %qstr);
1925 $hstr = join( '/', sort keys %hstr);
1928 if($seqType =~ /list|array/i) {
1929 return ($qstr, $hstr);
1930 } elsif( $seqType eq 'query' ) {
1931 return $qstr;
1932 } else {
1933 return $hstr;
1939 __END__
1941 #####################################################################################
1942 # END OF CLASS #
1943 #####################################################################################
1946 =head1 FOR DEVELOPERS ONLY
1948 =head2 Data Members
1950 Information about the various data members of this module is provided for those
1951 wishing to modify or understand the code. Two things to bear in mind:
1953 =over 4
1955 =item 1 Do NOT rely on these in any code outside of this module.
1957 All data members are prefixed with an underscore to signify that they are private.
1958 Always use accessor methods. If the accessor doesn't exist or is inadequate,
1959 create or modify an accessor (and let me know, too!). (An exception to this might
1960 be for BlastHSP.pm which is more tightly coupled to PsiBlastHit.pm and
1961 may access PsiBlastHit data members directly for efficiency purposes, but probably
1962 should not).
1964 =item 2 This documentation may be incomplete and out of date.
1966 It is easy for these data member descriptions to become obsolete as
1967 this module is still evolving. Always double check this info and search
1968 for members not described here.
1970 =back
1972 An instance of Bio::Search::Hit::PsiBlastHit.pm is a blessed reference to a hash containing
1973 all or some of the following fields:
1975 FIELD VALUE
1976 --------------------------------------------------------------
1977 _hsps : Array ref for a list of Bio::Search::HSP::BlastHSP.pm objects.
1979 _db : Database identifier from the summary line.
1981 _desc : Description data for the hit from the summary line.
1983 _length : Total length of the hit sequence.
1985 _score : BLAST score.
1987 _bits : BLAST score (in bits). Matrix-independent.
1989 _p : BLAST P value. Obtained from summary section. (Blast1/WU-Blast only)
1991 _expect : BLAST Expect value. Obtained from summary section.
1993 _n : BLAST N value (number of HSPs) (Blast1/WU-Blast2 only)
1995 _frame : Reading frame for TBLASTN and TBLASTX analyses.
1997 _totalIdentical: Total number of identical aligned monomers.
1999 _totalConserved: Total number of conserved aligned monomers (a.k.a. "positives").
2001 _overlap : Maximum number of overlapping residues between adjacent HSPs
2002 : before considering the alignment to be ambiguous.
2004 _ambiguous_aln : Boolean. True if the alignment of all HSPs is ambiguous.
2006 _length_aln_query : Length of the aligned region of the query sequence.
2008 _length_aln_sbjct : Length of the aligned region of the sbjct sequence.
2011 =cut