Squash warning
[bioperl-live.git] / Bio / Search / HSP / PsiBlastHSP.pm
blobfd12591e49ab0817d4e034546edc3f301af1f216
1 #-----------------------------------------------------------------
2 # $Id$
4 # BioPerl module Bio::Search::HSP::PsiBlastHSP
6 # (This module was originally called Bio::Tools::Blast::HSP)
8 # Cared for by Steve Chervitz <sac@bioperl.org>
10 # You may distribute this module under the same terms as perl itself
11 #-----------------------------------------------------------------
13 ## POD Documentation:
15 =head1 NAME
17 Bio::Search::HSP::PsiBlastHSP - Bioperl BLAST High-Scoring Pair object
19 =head1 SYNOPSIS
21 See L<Bio::Search::Hit::BlastHit>.
23 =head1 DESCRIPTION
25 A Bio::Search::HSP::PsiBlastHSP object provides an interface to data
26 obtained in a single alignment section of a Blast report (known as a
27 "High-scoring Segment Pair"). This is essentially a pairwise
28 alignment with score information.
30 PsiBlastHSP objects are accessed via L<Bio::Search::Hit::BlastHit>
31 objects after parsing a BLAST report using the L<Bio::SearchIO>
32 system.
34 The construction of PsiBlastHSP objects is performed by
35 Bio::Factory::BlastHitFactory in a process that is
36 orchestrated by the Blast parser (L<Bio::SearchIO::psiblast>).
37 The resulting PsiBlastHSPs are then accessed via
38 L<Bio::Search::Hit::BlastHit>). Therefore, you do not need to
39 use L<Bio::Search::HSP::PsiBlastHSP>) directly. If you need to construct
40 PsiBlastHSPs directly, see the new() function for details.
42 For L<Bio::SearchIO> BLAST parsing usage examples, see the
43 C<examples/search-blast> directory of the Bioperl distribution.
46 =head2 Start and End coordinates
48 Sequence endpoints are swapped so that start is always less than
49 end. This affects For TBLASTN/X hits on the minus strand. Strand
50 information can be recovered using the strand() method. This
51 normalization step is standard Bioperl practice. It also facilitates
52 use of range information by methods such as match().
54 =over 1
56 =item * Supports BLAST versions 1.x and 2.x, gapped and ungapped.
58 =back
60 Bio::Search::HSP::PsiBlastHSP.pm has the ability to extract a list of all
61 residue indices for identical and conservative matches along both
62 query and sbjct sequences. Since this degree of detail is not always
63 needed, this behavior does not occur during construction of the PsiBlastHSP
64 object. These data will automatically be collected as necessary as
65 the PsiBlastHSP.pm object is used.
67 =head1 DEPENDENCIES
69 Bio::Search::HSP::PsiBlastHSP.pm is a concrete class that inherits from
70 L<Bio::SeqFeature::SimilarityPair> and L<Bio::Search::HSP::HSPI>.
71 L<Bio::Seq> and L<Bio::SimpleAlign> are employed for creating
72 sequence and alignment objects, respectively.
74 =head2 Relationship to L<Bio::SimpleAlign> and L<Bio::Seq>
76 PsiBlastHSP.pm can provide the query or sbjct sequence as a L<Bio::Seq>
77 object via the L<seq()|seq> method. The PsiBlastHSP.pm object can also create a
78 two-sequence L<Bio::SimpleAlign> alignment object using the the query
79 and sbjct sequences via the L<get_aln()|get_aln> method. Creation of alignment
80 objects is not automatic when constructing the PsiBlastHSP.pm object since
81 this level of functionality is not always required and would generate
82 a lot of extra overhead when crunching many reports.
85 =head1 FEEDBACK
87 =head2 Mailing Lists
89 User feedback is an integral part of the evolution of this and other
90 Bioperl modules. Send your comments and suggestions preferably to one
91 of the Bioperl mailing lists. Your participation is much appreciated.
93 bioperl-l@bioperl.org - General discussion
94 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
96 =head2 Reporting Bugs
98 Report bugs to the Bioperl bug tracking system to help us keep track
99 the bugs and their resolution. Bug reports can be submitted via the
100 web:
102 http://bugzilla.open-bio.org/
104 =head1 AUTHOR
106 Steve Chervitz E<lt>sac-at-bioperl.orgE<gt>
108 See L<the FEEDBACK section | FEEDBACK> for where to send bug reports and comments.
110 =head1 ACKNOWLEDGEMENTS
112 This software was originally developed in the Department of Genetics
113 at Stanford University. I would also like to acknowledge my
114 colleagues at Affymetrix for useful feedback.
116 =head1 SEE ALSO
118 Bio::Search::Hit::BlastHit.pm - Blast hit object.
119 Bio::Search::Result::BlastResult.pm - Blast Result object.
120 Bio::Seq.pm - Biosequence object
122 =head2 Links:
124 http://bio.perl.org/ - Bioperl Project Homepage
126 =head1 COPYRIGHT
128 Copyright (c) 1996-2001 Steve Chervitz. All Rights Reserved.
130 =head1 DISCLAIMER
132 This software is provided "as is" without warranty of any kind.
134 =cut
137 # END of main POD documentation.
139 =head1 APPENDIX
141 The rest of the documentation details each of the object methods.
142 Internal methods are usually preceded with a _
144 =cut
146 # Let the code begin...
148 package Bio::Search::HSP::PsiBlastHSP;
150 use strict;
151 use Bio::SeqFeature::Similarity;
153 use vars qw($GAP_SYMBOL %STRAND_SYMBOL);
155 use overload
156 '""' => \&to_string;
158 use base qw(Bio::SeqFeature::SimilarityPair Bio::Search::HSP::HSPI);
160 $GAP_SYMBOL = '-'; # Need a more general way to handle gap symbols.
161 %STRAND_SYMBOL = ('Plus' => 1, 'Minus' => -1 );
164 =head2 new
166 Usage : $hsp = Bio::Search::HSP::PsiBlastHSP->new( %named_params );
167 : Bio::Search::HSP::PsiBlastHSP.pm objects are constructed
168 : automatically by Bio::SearchIO::BlastHitFactory.pm,
169 : so there is no need for direct instantiation.
170 Purpose : Constructs a new PsiBlastHSP object and Initializes key variables
171 : for the HSP.
172 Returns : A Bio::Search::HSP::PsiBlastHSP object
173 Argument : Named parameters:
174 : Parameter keys are case-insensitive.
175 : -RAW_DATA => array ref containing raw BLAST report data for
176 : for a single HSP. This includes all lines
177 : of the HSP alignment from a traditional BLAST
178 or PSI-BLAST (non-XML) report,
179 : -RANK => integer (1..n).
180 : -PROGRAM => string ('TBLASTN', 'BLASTP', etc.).
181 : -QUERY_NAME => string, id of query sequence
182 : -HIT_NAME => string, id of hit sequence
184 Comments : Having the raw data allows this object to do lazy parsing of
185 : the raw HSP data (i.e., not parsed until needed).
187 : Note that there is a fair amount of basic parsing that is
188 : currently performed in this module that would be more appropriate
189 : to do within a separate factory object.
190 : This parsing code will likely be relocated and more initialization
191 : parameters will be added to new().
193 See Also : L<Bio::SeqFeature::SimilarityPair::new()>, L<Bio::SeqFeature::Similarity::new()>
195 =cut
197 #----------------
198 sub new {
199 #----------------
200 my ($class, @args ) = @_;
202 my $self = $class->SUPER::new( @args );
203 # Initialize placeholders
204 $self->{'_queryGaps'} = $self->{'_sbjctGaps'} = 0;
205 my ($raw_data, $qname, $hname, $qlen, $hlen);
207 ($self->{'_prog'}, $self->{'_rank'}, $raw_data,
208 $qname, $hname) =
209 $self->_rearrange([qw( PROGRAM
210 RANK
211 RAW_DATA
212 QUERY_NAME
213 HIT_NAME
214 )], @args );
216 # _set_data() does a fair amount of parsing.
217 # This will likely change (see comment above.)
218 $self->_set_data( @{$raw_data} );
219 # Store the aligned query as sequence feature
220 my ($qb, $hb) = ($self->start());
221 my ($qe, $he) = ($self->end());
222 my ($qs, $hs) = ($self->strand());
223 my ($qf,$hf) = ($self->query->frame(),
224 $self->hit->frame);
226 $self->query( Bio::SeqFeature::Similarity->new (-start =>$qb,
227 -end =>$qe,
228 -strand =>$qs,
229 -bits =>$self->bits,
230 -score =>$self->score,
231 -frame =>$qf,
232 -seq_id => $qname,
233 -source =>$self->{'_prog'} ));
235 $self->hit( Bio::SeqFeature::Similarity->new (-start =>$hb,
236 -end =>$he,
237 -strand =>$hs,
238 -bits =>$self->bits,
239 -score =>$self->score,
240 -frame =>$hf,
241 -seq_id => $hname,
242 -source =>$self->{'_prog'} ));
244 # set lengths
245 $self->query->seqlength($qlen); # query
246 $self->hit->seqlength($hlen); # subject
248 $self->query->frac_identical($self->frac_identical('query'));
249 $self->hit->frac_identical($self->frac_identical('hit'));
250 return $self;
253 #sub DESTROY {
254 # my $self = shift;
255 # #print STDERR "--->DESTROYING $self\n";
259 # Title : _id_str;
260 # Purpose : Intended for internal use only to provide a string for use
261 # within exception messages to help users figure out which
262 # query/hit caused the problem.
263 # Returns : Short string with name of query and hit seq
264 sub _id_str {
265 my $self = shift;
266 if( not defined $self->{'_id_str'}) {
267 my $qname = $self->query->seqname;
268 my $hname = $self->hit->seqname;
269 $self->{'_id_str'} = "QUERY=\"$qname\" HIT=\"$hname\"";
271 return $self->{'_id_str'};
274 #=================================================
275 # Begin Bio::Search::HSP::HSPI implementation
276 #=================================================
278 =head2 algorithm
280 Title : algorithm
281 Usage : $alg = $hsp->algorithm();
282 Function: Gets the algorithm specification that was used to obtain the hsp
283 For BLAST, the algorithm denotes what type of sequence was aligned
284 against what (BLASTN: dna-dna, BLASTP prt-prt, BLASTX translated
285 dna-prt, TBLASTN prt-translated dna, TBLASTX translated
286 dna-translated dna).
287 Returns : a scalar string
288 Args : none
290 =cut
292 #----------------
293 sub algorithm {
294 #----------------
295 my ($self,@args) = @_;
296 return $self->{'_prog'};
302 =head2 signif()
304 Usage : $hsp_obj->signif()
305 Purpose : Get the P-value or Expect value for the HSP.
306 Returns : Float (0.001 or 1.3e-43)
307 : Returns P-value if it is defined, otherwise, Expect value.
308 Argument : n/a
309 Throws : n/a
310 Comments : Provided for consistency with BlastHit::signif()
311 : Support for returning the significance data in different
312 : formats (e.g., exponent only), is not provided for HSP objects.
313 : This is only available for the BlastHit or Blast object.
315 See Also : L<p()|p>, L<expect()|expect>, L<Bio::Search::Hit::BlastHit::signif()|Bio::Search::Hit::BlastHit>
317 =cut
319 #-----------
320 sub signif {
321 #-----------
322 my $self = shift;
323 my $val ||= defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
324 $val;
329 =head2 evalue
331 Usage : $hsp_obj->evalue()
332 Purpose : Get the Expect value for the HSP.
333 Returns : Float (0.001 or 1.3e-43)
334 Argument : n/a
335 Throws : n/a
336 Comments : Support for returning the expectation data in different
337 : formats (e.g., exponent only), is not provided for HSP objects.
338 : This is only available for the BlastHit or Blast object.
340 See Also : L<p()|p>
342 =cut
344 #----------
345 sub evalue { shift->{'_expect'} }
346 #----------
349 =head2 p
351 Usage : $hsp_obj->p()
352 Purpose : Get the P-value for the HSP.
353 Returns : Float (0.001 or 1.3e-43) or undef if not defined.
354 Argument : n/a
355 Throws : n/a
356 Comments : P-value is not defined with NCBI Blast2 reports.
357 : Support for returning the expectation data in different
358 : formats (e.g., exponent only) is not provided for HSP objects.
359 : This is only available for the BlastHit or Blast object.
361 See Also : L<expect()|expect>
363 =cut
365 #-----
366 sub p { my $self = shift; $self->{'_p'}; }
367 #-----
369 # alias
370 sub pvalue { shift->p(@_); }
372 =head2 length
374 Usage : $hsp->length( [seq_type] )
375 Purpose : Get the length of the aligned portion of the query or sbjct.
376 Example : $hsp->length('query')
377 Returns : integer
378 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total' (default = 'total')
379 ('sbjct' is synonymous with 'hit')
380 Throws : n/a
381 Comments : 'total' length is the full length of the alignment
382 : as reported in the denominators in the alignment section:
383 : "Identical = 34/120 Positives = 67/120".
385 See Also : L<gaps()|gaps>
387 =cut
389 #-----------
390 sub length {
391 #-----------
392 ## Developer note: when using the built-in length function within
393 ## this module, call it as CORE::length().
394 my( $self, $seqType ) = @_;
395 $seqType ||= 'total';
396 $seqType = 'sbjct' if $seqType eq 'hit';
398 $seqType ne 'total' and $self->_set_seq_data() unless $self->{'_set_seq_data'};
400 ## Sensitive to member name format.
401 $seqType = "_\L$seqType\E";
402 $self->{$seqType.'Length'};
407 =head2 gaps
409 Usage : $hsp->gaps( [seq_type] )
410 Purpose : Get the number of gap characters in the query, sbjct, or total alignment.
411 : Also can return query gap chars and sbjct gap chars as a two-element list
412 : when in array context.
413 Example : $total_gaps = $hsp->gaps();
414 : ($qgaps, $sgaps) = $hsp->gaps();
415 : $qgaps = $hsp->gaps('query');
416 Returns : scalar context: integer
417 : array context without args: (int, int) = ('queryGaps', 'sbjctGaps')
418 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
419 : ('sbjct' is synonymous with 'hit')
420 : (default = 'total', scalar context)
421 : Array context can be "induced" by providing an argument of 'list' or 'array'.
422 Throws : n/a
424 See Also : L<length()|length>, L<matches()|matches>
426 =cut
428 #---------
429 sub gaps {
430 #---------
431 my( $self, $seqType ) = @_;
433 $self->_set_seq_data() unless $self->{'_set_seq_data'};
435 $seqType ||= (wantarray ? 'list' : 'total');
436 $seqType = 'sbjct' if $seqType eq 'hit';
438 if($seqType =~ /list|array/i) {
439 return (($self->{'_queryGaps'} || 0), ($self->{'_sbjctGaps'} || 0));
442 if($seqType eq 'total') {
443 return ($self->{'_queryGaps'} + $self->{'_sbjctGaps'}) || 0;
444 } else {
445 ## Sensitive to member name format.
446 $seqType = "_\L$seqType\E";
447 return $self->{$seqType.'Gaps'} || 0;
452 =head2 frac_identical
454 Usage : $hsp_object->frac_identical( [seq_type] );
455 Purpose : Get the fraction of identical positions within the given HSP.
456 Example : $frac_iden = $hsp_object->frac_identical('query');
457 Returns : Float (2-decimal precision, e.g., 0.75).
458 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
459 : ('sbjct' is synonymous with 'hit')
460 : default = 'total' (but see comments below).
461 Throws : n/a
462 Comments : Different versions of Blast report different values for the total
463 : length of the alignment. This is the number reported in the
464 : denominators in the stats section:
465 : "Identical = 34/120 Positives = 67/120".
466 : NCBI-BLAST uses the total length of the alignment (with gaps)
467 : WU-BLAST uses the length of the query sequence (without gaps).
468 : Therefore, when called without an argument or an argument of 'total',
469 : this method will report different values depending on the
470 : version of BLAST used.
472 : To get the fraction identical among only the aligned residues,
473 : ignoring the gaps, call this method with an argument of 'query'
474 : or 'sbjct' ('sbjct' is synonymous with 'hit').
476 See Also : L<frac_conserved()|frac_conserved>, L<num_identical()|num_identical>, L<matches()|matches>
478 =cut
480 #-------------------
481 sub frac_identical {
482 #-------------------
483 # The value is calculated as opposed to storing it from the parsed results.
484 # This saves storage and also permits flexibility in determining for which
485 # sequence (query or sbjct) the figure is to be calculated.
487 my( $self, $seqType ) = @_;
488 $seqType ||= 'total';
489 $seqType = 'sbjct' if $seqType eq 'hit';
491 if($seqType ne 'total') {
492 $self->_set_seq_data() unless $self->{'_set_seq_data'};
494 ## Sensitive to member name format.
495 $seqType = "_\L$seqType\E";
497 sprintf( "%.2f", $self->{'_numIdentical'}/$self->{$seqType.'Length'});
501 =head2 frac_conserved
503 Usage : $hsp_object->frac_conserved( [seq_type] );
504 Purpose : Get the fraction of conserved positions within the given HSP.
505 : (Note: 'conservative' positions are called 'positives' in the
506 : Blast report.)
507 Example : $frac_cons = $hsp_object->frac_conserved('query');
508 Returns : Float (2-decimal precision, e.g., 0.75).
509 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
510 : ('sbjct' is synonymous with 'hit')
511 : default = 'total' (but see comments below).
512 Throws : n/a
513 Comments : Different versions of Blast report different values for the total
514 : length of the alignment. This is the number reported in the
515 : denominators in the stats section:
516 : "Identical = 34/120 Positives = 67/120".
517 : NCBI-BLAST uses the total length of the alignment (with gaps)
518 : WU-BLAST uses the length of the query sequence (without gaps).
519 : Therefore, when called without an argument or an argument of 'total',
520 : this method will report different values depending on the
521 : version of BLAST used.
523 : To get the fraction conserved among only the aligned residues,
524 : ignoring the gaps, call this method with an argument of 'query'
525 : or 'sbjct'.
527 See Also : L<frac_conserved()|frac_conserved>, L<num_conserved()|num_conserved>, L<matches()|matches>
529 =cut
531 #--------------------
532 sub frac_conserved {
533 #--------------------
534 # The value is calculated as opposed to storing it from the parsed results.
535 # This saves storage and also permits flexibility in determining for which
536 # sequence (query or sbjct) the figure is to be calculated.
538 my( $self, $seqType ) = @_;
539 $seqType ||= 'total';
540 $seqType = 'sbjct' if $seqType eq 'hit';
542 if($seqType ne 'total') {
543 $self->_set_seq_data() unless $self->{'_set_seq_data'};
546 ## Sensitive to member name format.
547 $seqType = "_\L$seqType\E";
549 sprintf( "%.2f", $self->{'_numConserved'}/$self->{$seqType.'Length'});
552 =head2 query_string
554 Title : query_string
555 Usage : my $qseq = $hsp->query_string;
556 Function: Retrieves the query sequence of this HSP as a string
557 Returns : string
558 Args : none
561 =cut
563 #----------------
564 sub query_string{ shift->seq_str('query'); }
565 #----------------
567 =head2 hit_string
569 Title : hit_string
570 Usage : my $hseq = $hsp->hit_string;
571 Function: Retrieves the hit sequence of this HSP as a string
572 Returns : string
573 Args : none
576 =cut
578 #----------------
579 sub hit_string{ shift->seq_str('hit'); }
580 #----------------
583 =head2 homology_string
585 Title : homology_string
586 Usage : my $homo_string = $hsp->homology_string;
587 Function: Retrieves the homology sequence for this HSP as a string.
588 : The homology sequence is the string of symbols in between the
589 : query and hit sequences in the alignment indicating the degree
590 : of conservation (e.g., identical, similar, not similar).
591 Returns : string
592 Args : none
594 =cut
596 #----------------
597 sub homology_string{ shift->seq_str('match'); }
598 #----------------
600 #=================================================
601 # End Bio::Search::HSP::HSPI implementation
602 #=================================================
604 # Older method delegating to method defined in HSPI.
606 =head2 expect
608 See L<Bio::Search::HSP::HSPI::expect()|Bio::Search::HSP::HSPI>
610 =cut
612 #----------
613 sub expect { shift->evalue( @_ ); }
614 #----------
617 =head2 rank
619 Usage : $hsp->rank( [string] );
620 Purpose : Get the rank of the HSP within a given Blast hit.
621 Example : $rank = $hsp->rank;
622 Returns : Integer (1..n) corresponding to the order in which the HSP
623 appears in the BLAST report.
625 =cut
629 #----------
630 sub rank { shift->{'_rank'} }
631 #----------
633 # For backward compatibility
634 #----------
635 sub name { shift->rank }
636 #----------
638 =head2 to_string
640 Title : to_string
641 Usage : print $hsp->to_string;
642 Function: Returns a string representation for the Blast HSP.
643 Primarily intended for debugging purposes.
644 Example : see usage
645 Returns : A string of the form:
646 [PsiBlastHSP] <rank>
647 e.g.:
648 [BlastHit] 1
649 Args : None
651 =cut
653 #----------
654 sub to_string {
655 #----------
656 my $self = shift;
657 return "[PsiBlastHSP] " . $self->rank();
661 #=head2 _set_data (Private method)
663 # Usage : called automatically during object construction.
664 # Purpose : Parses the raw HSP section from a flat BLAST report and
665 # sets the query sequence, sbjct sequence, and the "match" data
666 # : which consists of the symbols between the query and sbjct lines
667 # : in the alignment.
668 # Argument : Array (all lines for a single, complete HSP, from a raw,
669 # flat (i.e., non-XML) BLAST report)
670 # Throws : Propagates any exceptions from the methods called ("See Also")
672 #See Also : L<_set_seq()|_set_seq>, L<_set_score_stats()|_set_score_stats>, L<_set_match_stats()|_set_match_stats>, L<_initialize()|_initialize>
674 #=cut
676 #--------------
677 sub _set_data {
678 #--------------
679 my $self = shift;
680 my @data = @_;
681 my @queryList = (); # 'Query' = SEQUENCE USED TO QUERY THE DATABASE.
682 my @sbjctList = (); # 'Sbjct' = HOMOLOGOUS SEQUENCE FOUND IN THE DATABASE.
683 my @matchList = ();
684 my $matchLine = 0; # Alternating boolean: when true, load 'match' data.
685 my @linedat = ();
687 #print STDERR "PsiBlastHSP: set_data()\n";
689 my($line, $aln_row_len, $length_diff);
690 $length_diff = 0;
692 # Collecting data for all lines in the alignment
693 # and then storing the collections for possible processing later.
695 # Note that "match" lines may not be properly padded with spaces.
696 # This loop now properly handles such cases:
697 # Query: 1141 PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVIXXXXX 1200
698 # PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVI
699 # Sbjct: 1141 PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVILSLKL 1200
701 foreach $line( @data ) {
702 next if $line =~ /^\s*$/;
704 if( $line =~ /^ ?Score/ ) {
705 $self->_set_score_stats( $line );
706 } elsif( $line =~ /^ ?(Identities|Positives|Strand)/ ) {
707 $self->_set_match_stats( $line );
708 } elsif( $line =~ /^ ?Frame = ([\d+-]+)/ ) {
709 # Version 2.0.8 has Frame information on a separate line.
710 # Storing frame according to SeqFeature::Generic::frame()
711 # which does not contain strand info (use strand()).
712 my $frame = abs($1) - 1;
713 $self->frame( $frame );
714 } elsif( $line =~ /^(Query:?[\s\d]+)([^\s\d]+)/ ) {
715 push @queryList, $line;
716 $self->{'_match_indent'} = CORE::length $1;
717 $aln_row_len = (CORE::length $1) + (CORE::length $2);
718 $matchLine = 1;
719 } elsif( $matchLine ) {
720 # Pad the match line with spaces if necessary.
721 $length_diff = $aln_row_len - CORE::length $line;
722 $length_diff and $line .= ' 'x $length_diff;
723 push @matchList, $line;
724 $matchLine = 0;
725 } elsif( $line =~ /^Sbjct/ ) {
726 push @sbjctList, $line;
729 # Storing the query and sbjct lists in case they are needed later.
730 # We could make this conditional to save memory.
731 $self->{'_queryList'} = \@queryList;
732 $self->{'_sbjctList'} = \@sbjctList;
734 # Storing the match list in case it is needed later.
735 $self->{'_matchList'} = \@matchList;
737 if(not defined ($self->{'_numIdentical'})) {
738 my $id_str = $self->_id_str;
739 $self->throw( -text => "Can't parse match statistics. Possibly a new or unrecognized Blast format. ($id_str)");
742 if(!scalar @queryList or !scalar @sbjctList) {
743 my $id_str = $self->_id_str;
744 $self->throw( "Can't find query or sbjct alignment lines. Possibly unrecognized Blast format. ($id_str)");
749 #=head2 _set_score_stats (Private method)
751 # Usage : called automatically by _set_data()
752 # Purpose : Sets various score statistics obtained from the HSP listing.
753 # Argument : String with any of the following formats:
754 # : blast2: Score = 30.1 bits (66), Expect = 9.2
755 # : blast2: Score = 158.2 bits (544), Expect(2) = e-110
756 # : blast1: Score = 410 (144.3 bits), Expect = 1.7e-40, P = 1.7e-40
757 # : blast1: Score = 55 (19.4 bits), Expect = 5.3, Sum P(3) = 0.99
758 # Throws : Exception if the stats cannot be parsed, probably due to a change
759 # : in the Blast report format.
761 #See Also : L<_set_data()|_set_data>
763 #=cut
765 #--------------------
766 sub _set_score_stats {
767 #--------------------
768 my ($self, $data) = @_;
770 my ($expect, $p);
772 if($data =~ /Score = +([\d.e+-]+) bits \(([\d.e+-]+)\), +Expect = +([\d.e+-]+)/) {
773 # blast2 format n = 1
774 $self->bits($1);
775 $self->score($2);
776 $expect = $3;
777 } elsif($data =~ /Score = +([\d.e+-]+) bits \(([\d.e+-]+)\), +Expect\((\d+)\) = +([\d.e+-]+)/) {
778 # blast2 format n > 1
779 $self->bits($1);
780 $self->score($2);
781 $self->{'_n'} = $3;
782 $expect = $4;
784 } elsif($data =~ /Score = +([\d.e+-]+) \(([\d.e+-]+) bits\), +Expect = +([\d.e+-]+), P = +([\d.e-]+)/) {
785 # blast1 format, n = 1
786 $self->score($1);
787 $self->bits($2);
788 $expect = $3;
789 $p = $4;
791 } elsif($data =~ /Score = +([\d.e+-]+) \(([\d.e+-]+) bits\), +Expect = +([\d.e+-]+), +Sum P\((\d+)\) = +([\d.e-]+)/) {
792 # blast1 format, n > 1
793 $self->score($1);
794 $self->bits($2);
795 $expect = $3;
796 $self->{'_n'} = $4;
797 $p = $5;
799 } else {
800 my $id_str = $self->_id_str;
801 $self->throw(-class => 'Bio::Root::Exception',
802 -text => "Can't parse score statistics: unrecognized format. ($id_str)",
803 -value => $data);
805 $expect = "1$expect" if $expect =~ /^e/i;
806 $p = "1$p" if defined $p and $p=~ /^e/i;
808 $self->{'_expect'} = $expect;
809 $self->{'_p'} = $p || undef;
810 $self->significance( $p || $expect );
814 #=head2 _set_match_stats (Private method)
816 # Usage : Private method; called automatically by _set_data()
817 # Purpose : Sets various matching statistics obtained from the HSP listing.
818 # Argument : blast2: Identities = 23/74 (31%), Positives = 29/74 (39%), Gaps = 17/74 (22%)
819 # : blast2: Identities = 57/98 (58%), Positives = 74/98 (75%)
820 # : blast1: Identities = 87/204 (42%), Positives = 126/204 (61%)
821 # : blast1: Identities = 87/204 (42%), Positives = 126/204 (61%), Frame = -3
822 # : WU-blast: Identities = 310/553 (56%), Positives = 310/553 (56%), Strand = Minus / Plus
823 # Throws : Exception if the stats cannot be parsed, probably due to a change
824 # : in the Blast report format.
825 # Comments : The "Gaps = " data in the HSP header has a different meaning depending
826 # : on the type of Blast: for BLASTP, this number is the total number of
827 # : gaps in query+sbjct; for TBLASTN, it is the number of gaps in the
828 # : query sequence only. Thus, it is safer to collect the data
829 # : separately by examining the actual sequence strings as is done
830 # : in _set_seq().
832 #See Also : L<_set_data()|_set_data>, L<_set_seq()|_set_seq>
834 #=cut
836 #--------------------
837 sub _set_match_stats {
838 #--------------------
839 my ($self, $data) = @_;
841 if($data =~ m!Identities = (\d+)/(\d+)!) {
842 # blast1 or 2 format
843 $self->{'_numIdentical'} = $1;
844 $self->{'_totalLength'} = $2;
847 if($data =~ m!Positives = (\d+)/(\d+)!) {
848 # blast1 or 2 format
849 $self->{'_numConserved'} = $1;
850 $self->{'_totalLength'} = $2;
853 if($data =~ m!Frame = ([\d+-]+)!) {
854 $self->frame($1);
857 # Strand data is not always present in this line.
858 # _set_seq() will also set strand information.
859 if($data =~ m!Strand = (\w+) / (\w+)!) {
860 $self->{'_queryStrand'} = $1;
861 $self->{'_sbjctStrand'} = $2;
864 # if($data =~ m!Gaps = (\d+)/(\d+)!) {
865 # $self->{'_totalGaps'} = $1;
866 # } else {
867 # $self->{'_totalGaps'} = 0;
873 #=head2 _set_seq_data (Private method)
875 # Usage : called automatically when sequence data is requested.
876 # Purpose : Sets the HSP sequence data for both query and sbjct sequences.
877 # : Includes: start, stop, length, gaps, and raw sequence.
878 # Argument : n/a
879 # Throws : Propagates any exception thrown by _set_match_seq()
880 # Comments : Uses raw data stored by _set_data() during object construction.
881 # : These data are not always needed, so it is conditionally
882 # : executed only upon demand by methods such as gaps(), _set_residues(),
883 # : etc. _set_seq() does the dirty work.
885 #See Also : L<_set_seq()|_set_seq>
887 #=cut
889 #-----------------
890 sub _set_seq_data {
891 #-----------------
892 my $self = shift;
894 $self->_set_seq('query', @{$self->{'_queryList'}});
895 $self->_set_seq('sbjct', @{$self->{'_sbjctList'}});
897 # Liberate some memory.
898 @{$self->{'_queryList'}} = @{$self->{'_sbjctList'}} = ();
899 undef $self->{'_queryList'};
900 undef $self->{'_sbjctList'};
902 $self->{'_set_seq_data'} = 1;
907 #=head2 _set_seq (Private method)
909 # Usage : called automatically by _set_seq_data()
910 # : $hsp_obj->($seq_type, @data);
911 # Purpose : Sets sequence information for both the query and sbjct sequences.
912 # : Directly counts the number of gaps in each sequence (if gapped Blast).
913 # Argument : $seq_type = 'query' or 'sbjct'
914 # : @data = all seq lines with the form:
915 # : Query: 61 SPHNVKDRKEQNGSINNAISPTATANTSGSQQINIDSALRDRSSNVAAQPSLSDASSGSN 120
916 # Throws : Exception if data strings cannot be parsed, probably due to a change
917 # : in the Blast report format.
918 # Comments : Uses first argument to determine which data members to set
919 # : making this method sensitive data member name changes.
920 # : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
921 # Warning : Sequence endpoints are normalized so that start < end. This affects HSPs
922 # : for TBLASTN/X hits on the minus strand. Normalization facilitates use
923 # : of range information by methods such as match().
925 #See Also : L<_set_seq_data()|_set_seq_data>, L<matches()|matches>, L<range()|range>, L<start()|start>, L<end()|end>
927 #=cut
929 #-------------
930 sub _set_seq {
931 #-------------
932 my $self = shift;
933 my $seqType = shift;
934 my @data = @_;
935 my @ranges = ();
936 my @sequence = ();
937 my $numGaps = 0;
939 foreach( @data ) {
940 if( m/(\d+) *([^\d\s]+) *(\d+)/) {
941 push @ranges, ( $1, $3 ) ;
942 push @sequence, $2;
943 #print STDERR "_set_seq found sequence \"$2\"\n";
944 } else {
945 $self->warn("Bad sequence data: $_");
949 if( !(scalar(@sequence) and scalar(@ranges))) {
950 my $id_str = $self->_id_str;
951 $self->throw("Can't set sequence: missing data. Possibly unrecognized Blast format. ($id_str)");
954 # Sensitive to member name changes.
955 $seqType = "_\L$seqType\E";
956 $self->{$seqType.'Start'} = $ranges[0];
957 $self->{$seqType.'Stop'} = $ranges[ $#ranges ];
958 $self->{$seqType.'Seq'} = \@sequence;
960 $self->{$seqType.'Length'} = abs($ranges[ $#ranges ] - $ranges[0]) + 1;
962 # Adjust lengths for BLASTX, TBLASTN, TBLASTX sequences
963 # Converting nucl coords to amino acid coords.
965 my $prog = $self->algorithm;
966 if($prog eq 'TBLASTN' and $seqType eq '_sbjct') {
967 $self->{$seqType.'Length'} /= 3;
968 } elsif($prog eq 'BLASTX' and $seqType eq '_query') {
969 $self->{$seqType.'Length'} /= 3;
970 } elsif($prog eq 'TBLASTX') {
971 $self->{$seqType.'Length'} /= 3;
974 if( $prog ne 'BLASTP' ) {
975 $self->{$seqType.'Strand'} = 'Plus' if $prog =~ /BLASTN/;
976 $self->{$seqType.'Strand'} = 'Plus' if ($prog =~ /BLASTX/ and $seqType eq '_query');
977 # Normalize sequence endpoints so that start < end.
978 # Reverse complement or 'minus strand' HSPs get flipped here.
979 if($self->{$seqType.'Start'} > $self->{$seqType.'Stop'}) {
980 ($self->{$seqType.'Start'}, $self->{$seqType.'Stop'}) =
981 ($self->{$seqType.'Stop'}, $self->{$seqType.'Start'});
982 $self->{$seqType.'Strand'} = 'Minus';
986 ## Count number of gaps in each seq. Only need to do this for gapped Blasts.
987 # if($self->{'_gapped'}) {
988 my $seqstr = join('', @sequence);
989 $seqstr =~ s/\s//g;
990 my $num_gaps = CORE::length($seqstr) - $self->{$seqType.'Length'};
991 $self->{$seqType.'Gaps'} = $num_gaps if $num_gaps > 0;
996 #=head2 _set_residues (Private method)
998 # Usage : called automatically when residue data is requested.
999 # Purpose : Sets the residue numbers representing the identical and
1000 # : conserved positions. These data are obtained by analyzing the
1001 # : symbols between query and sbjct lines of the alignments.
1002 # Argument : n/a
1003 # Throws : Propagates any exception thrown by _set_seq_data() and _set_match_seq().
1004 # Comments : These data are not always needed, so it is conditionally
1005 # : executed only upon demand by methods such as seq_inds().
1006 # : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
1008 #See Also : L<_set_seq_data()|_set_seq_data>, L<_set_match_seq()|_set_match_seq>, seq_inds()
1010 #=cut
1012 #------------------
1013 sub _set_residues {
1014 #------------------
1015 my $self = shift;
1016 my @sequence = ();
1018 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1020 # Using hashes to avoid saving duplicate residue numbers.
1021 my %identicalList_query = ();
1022 my %identicalList_sbjct = ();
1023 my %conservedList_query = ();
1024 my %conservedList_sbjct = ();
1026 my $aref = $self->_set_match_seq() if not ref $self->{'_matchSeq'};
1027 $aref ||= $self->{'_matchSeq'};
1028 my $seqString = join('', @$aref );
1030 my $qseq = join('',@{$self->{'_querySeq'}});
1031 my $sseq = join('',@{$self->{'_sbjctSeq'}});
1032 my $resCount_query = $self->{'_queryStop'} || 0;
1033 my $resCount_sbjct = $self->{'_sbjctStop'} || 0;
1035 my $prog = $self->algorithm;
1036 if($prog !~ /^BLASTP|^BLASTN/) {
1037 if($prog eq 'TBLASTN') {
1038 $resCount_sbjct /= 3;
1039 } elsif($prog eq 'BLASTX') {
1040 $resCount_query /= 3;
1041 } elsif($prog eq 'TBLASTX') {
1042 $resCount_query /= 3;
1043 $resCount_sbjct /= 3;
1047 my ($mchar, $schar, $qchar);
1048 while( $mchar = chop($seqString) ) {
1049 ($qchar, $schar) = (chop($qseq), chop($sseq));
1050 if( $mchar eq '+' ) {
1051 $conservedList_query{ $resCount_query } = 1;
1052 $conservedList_sbjct{ $resCount_sbjct } = 1;
1053 } elsif( $mchar ne ' ' ) {
1054 $identicalList_query{ $resCount_query } = 1;
1055 $identicalList_sbjct{ $resCount_sbjct } = 1;
1057 $resCount_query-- if $qchar ne $GAP_SYMBOL;
1058 $resCount_sbjct-- if $schar ne $GAP_SYMBOL;
1060 $self->{'_identicalRes_query'} = \%identicalList_query;
1061 $self->{'_conservedRes_query'} = \%conservedList_query;
1062 $self->{'_identicalRes_sbjct'} = \%identicalList_sbjct;
1063 $self->{'_conservedRes_sbjct'} = \%conservedList_sbjct;
1070 #=head2 _set_match_seq (Private method)
1072 # Usage : $hsp_obj->_set_match_seq()
1073 # Purpose : Set the 'match' sequence for the current HSP (symbols in between
1074 # : the query and sbjct lines.)
1075 # Returns : Array reference holding the match sequences lines.
1076 # Argument : n/a
1077 # Throws : Exception if the _matchList field is not set.
1078 # Comments : The match information is not always necessary. This method
1079 # : allows it to be conditionally prepared.
1080 # : Called by _set_residues>() and seq_str().
1082 #See Also : L<_set_residues()|_set_residues>, L<seq_str()|seq_str>
1084 #=cut
1086 #-------------------
1087 sub _set_match_seq {
1088 #-------------------
1089 my $self = shift;
1091 if( ! ref($self->{'_matchList'}) ) {
1092 my $id_str = $self->_id_str;
1093 $self->throw("Can't set HSP match sequence: No data ($id_str)");
1096 my @data = @{$self->{'_matchList'}};
1098 my(@sequence);
1099 foreach( @data ) {
1100 chomp($_);
1101 ## Remove leading spaces; (note: aln may begin with a space
1102 ## which is why we can't use s/^ +//).
1103 s/^ {$self->{'_match_indent'}}//;
1104 push @sequence, $_;
1106 # Liberate some memory.
1107 @{$self->{'_matchList'}} = undef;
1108 $self->{'_matchList'} = undef;
1110 $self->{'_matchSeq'} = \@sequence;
1112 return $self->{'_matchSeq'};
1116 =head2 n
1118 Usage : $hsp_obj->n()
1119 Purpose : Get the N value (num HSPs on which P/Expect is based).
1120 : This value is not defined with NCBI Blast2 with gapping.
1121 Returns : Integer or null string if not defined.
1122 Argument : n/a
1123 Throws : n/a
1124 Comments : The 'N' value is listed in parenthesis with P/Expect value:
1125 : e.g., P(3) = 1.2e-30 ---> (N = 3).
1126 : Not defined in NCBI Blast2 with gaps.
1127 : This typically is equal to the number of HSPs but not always.
1128 : To obtain the number of HSPs, use Bio::Search::Hit::BlastHit::num_hsps().
1130 See Also : L<Bio::SeqFeature::SimilarityPair::score()|Bio::SeqFeature::SimilarityPair>
1132 =cut
1134 #-----
1135 sub n { my $self = shift; $self->{'_n'} || ''; }
1136 #-----
1139 =head2 matches
1141 Usage : $hsp->matches([seq_type], [start], [stop]);
1142 Purpose : Get the total number of identical and conservative matches
1143 : in the query or sbjct sequence for the given HSP. Optionally can
1144 : report data within a defined interval along the seq.
1145 : (Note: 'conservative' matches are called 'positives' in the
1146 : Blast report.)
1147 Example : ($id,$cons) = $hsp_object->matches('hit');
1148 : ($id,$cons) = $hsp_object->matches('query',300,400);
1149 Returns : 2-element array of integers
1150 Argument : (1) seq_type = 'query' or 'hit' or 'sbjct' (default = query)
1151 : ('sbjct' is synonymous with 'hit')
1152 : (2) start = Starting coordinate (optional)
1153 : (3) stop = Ending coordinate (optional)
1154 Throws : Exception if the supplied coordinates are out of range.
1155 Comments : Relies on seq_str('match') to get the string of alignment symbols
1156 : between the query and sbjct lines which are used for determining
1157 : the number of identical and conservative matches.
1159 See Also : L<length()|length>, L<gaps()|gaps>, L<seq_str()|seq_str>, L<Bio::Search::Hit::BlastHit::_adjust_contigs()|Bio::Search::Hit::BlastHit>
1161 =cut
1163 #-----------
1164 sub matches {
1165 #-----------
1166 my( $self, %param ) = @_;
1167 my(@data);
1168 my($seqType, $beg, $end) = ($param{-SEQ}, $param{-START}, $param{-STOP});
1169 $seqType ||= 'query';
1170 $seqType = 'sbjct' if $seqType eq 'hit';
1172 my($start,$stop);
1174 if(!defined $beg && !defined $end) {
1175 ## Get data for the whole alignment.
1176 push @data, ($self->{'_numIdentical'}, $self->{'_numConserved'});
1177 } else {
1178 ## Get the substring representing the desired sub-section of aln.
1179 $beg ||= 0;
1180 $end ||= 0;
1181 ($start,$stop) = $self->range($seqType);
1182 if($beg == 0) { $beg = $start; $end = $beg+$end; }
1183 elsif($end == 0) { $end = $stop; $beg = $end-$beg; }
1185 if($end >= $stop) { $end = $stop; } ##ML changed from if (end >stop)
1186 else { $end += 1;} ##ML moved from commented position below, makes
1187 ##more sense here
1188 # if($end > $stop) { $end = $stop; }
1189 if($beg < $start) { $beg = $start; }
1190 # else { $end += 1;}
1192 # my $seq = substr($self->seq_str('match'), $beg-$start, ($end-$beg));
1194 ## ML: START fix for substr out of range error ------------------
1195 my $seq = "";
1196 my $prog = $self->algorithm;
1197 if (($prog eq 'TBLASTN') and ($seqType eq 'sbjct'))
1199 $seq = substr($self->seq_str('match'),
1200 int(($beg-$start)/3), int(($end-$beg+1)/3));
1202 } elsif (($prog eq 'BLASTX') and ($seqType eq 'query'))
1204 $seq = substr($self->seq_str('match'),
1205 int(($beg-$start)/3), int(($end-$beg+1)/3));
1206 } else {
1207 $seq = substr($self->seq_str('match'),
1208 $beg-$start, ($end-$beg));
1210 ## ML: End of fix for substr out of range error -----------------
1213 ## ML: debugging code
1214 ## This is where we get our exception. Try printing out the values going
1215 ## into this:
1217 # print STDERR
1218 # qq(*------------MY EXCEPTION --------------------\nSeq: ") ,
1219 # $self->seq_str("$seqType"), qq("\n),$self->rank,",( index:";
1220 # print STDERR $beg-$start, ", len: ", $end-$beg," ), (HSPRealLen:",
1221 # CORE::length $self->seq_str("$seqType");
1222 # print STDERR ", HSPCalcLen: ", $stop - $start +1 ," ),
1223 # ( beg: $beg, end: $end ), ( start: $start, stop: stop )\n";
1224 ## ML: END DEBUGGING CODE----------
1226 if(!CORE::length $seq) {
1227 my $id_str = $self->_id_str;
1228 $self->throw("Undefined $seqType sub-sequence ($beg,$end). Valid range = $start - $stop ($id_str)");
1230 ## Get data for a substring.
1231 # printf "Collecting HSP subsection data: beg,end = %d,%d; start,stop = %d,%d\n%s<---\n", $beg, $end, $start, $stop, $seq;
1232 # printf "Original match seq:\n%s\n",$self->seq_str('match');
1233 $seq =~ s/ //g; # remove space (no info).
1234 my $len_cons = CORE::length $seq;
1235 $seq =~ s/\+//g; # remove '+' characters (conservative substitutions)
1236 my $len_id = CORE::length $seq;
1237 push @data, ($len_id, $len_cons);
1238 # printf " HSP = %s\n id = %d; cons = %d\n", $self->rank, $len_id, $len_cons; <STDIN>;
1240 @data;
1244 =head2 num_identical
1246 Usage : $hsp_object->num_identical();
1247 Purpose : Get the number of identical positions within the given HSP.
1248 Example : $num_iden = $hsp_object->num_identical();
1249 Returns : integer
1250 Argument : n/a
1251 Throws : n/a
1253 See Also : L<num_conserved()|num_conserved>, L<frac_identical()|frac_identical>
1255 =cut
1257 #-------------------
1258 sub num_identical {
1259 #-------------------
1260 my( $self) = shift;
1262 $self->{'_numIdentical'};
1266 =head2 num_conserved
1268 Usage : $hsp_object->num_conserved();
1269 Purpose : Get the number of conserved positions within the given HSP.
1270 Example : $num_iden = $hsp_object->num_conserved();
1271 Returns : integer
1272 Argument : n/a
1273 Throws : n/a
1275 See Also : L<num_identical()|num_identical>, L<frac_conserved()|frac_conserved>
1277 =cut
1279 #-------------------
1280 sub num_conserved {
1281 #-------------------
1282 my( $self) = shift;
1284 $self->{'_numConserved'};
1289 =head2 range
1291 Usage : $hsp->range( [seq_type] );
1292 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
1293 : in the HSP alignment.
1294 Example : ($query_beg, $query_end) = $hsp->range('query');
1295 : ($hit_beg, $hit_end) = $hsp->range('hit');
1296 Returns : Two-element array of integers
1297 Argument : seq_type = string, 'query' or 'hit' or 'sbjct' (default = 'query')
1298 : ('sbjct' is synonymous with 'hit')
1299 Throws : n/a
1301 See Also : L<start()|start>, L<end()|end>
1303 =cut
1305 #----------
1306 sub range {
1307 #----------
1308 my ($self, $seqType) = @_;
1310 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1312 $seqType ||= 'query';
1313 $seqType = 'sbjct' if $seqType eq 'hit';
1315 ## Sensitive to member name changes.
1316 $seqType = "_\L$seqType\E";
1318 return ($self->{$seqType.'Start'},$self->{$seqType.'Stop'});
1321 =head2 start
1323 Usage : $hsp->start( [seq_type] );
1324 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
1325 : in the HSP alignment.
1326 : NOTE: Start will always be less than end.
1327 : To determine strand, use $hsp->strand()
1328 Example : $query_beg = $hsp->start('query');
1329 : $hit_beg = $hsp->start('hit');
1330 : ($query_beg, $hit_beg) = $hsp->start();
1331 Returns : scalar context: integer
1332 : array context without args: list of two integers
1333 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default= 'query')
1334 : ('sbjct' is synonymous with 'hit')
1335 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1336 Throws : n/a
1338 See Also : L<end()|end>, L<range()|range>
1340 =cut
1342 #----------
1343 sub start {
1344 #----------
1345 my ($self, $seqType) = @_;
1347 $seqType ||= (wantarray ? 'list' : 'query');
1348 $seqType = 'sbjct' if $seqType eq 'hit';
1350 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1352 if($seqType =~ /list|array/i) {
1353 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
1354 } else {
1355 ## Sensitive to member name changes.
1356 $seqType = "_\L$seqType\E";
1357 return $self->{$seqType.'Start'};
1361 =head2 end
1363 Usage : $hsp->end( [seq_type] );
1364 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
1365 : in the HSP alignment.
1366 : NOTE: Start will always be less than end.
1367 : To determine strand, use $hsp->strand()
1368 Example : $query_end = $hsp->end('query');
1369 : $hit_end = $hsp->end('hit');
1370 : ($query_end, $hit_end) = $hsp->end();
1371 Returns : scalar context: integer
1372 : array context without args: list of two integers
1373 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default= 'query')
1374 : ('sbjct' is synonymous with 'hit')
1375 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1376 Throws : n/a
1378 See Also : L<start()|start>, L<range()|range>, L<strand()|strand>
1380 =cut
1382 #----------
1383 sub end {
1384 #----------
1385 my ($self, $seqType) = @_;
1387 $seqType ||= (wantarray ? 'list' : 'query');
1388 $seqType = 'sbjct' if $seqType eq 'hit';
1390 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1392 if($seqType =~ /list|array/i) {
1393 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
1394 } else {
1395 ## Sensitive to member name changes.
1396 $seqType = "_\L$seqType\E";
1397 return $self->{$seqType.'Stop'};
1403 =head2 strand
1405 Usage : $hsp_object->strand( [seq_type] )
1406 Purpose : Get the strand of the query or sbjct sequence.
1407 Example : print $hsp->strand('query');
1408 : ($query_strand, $hit_strand) = $hsp->strand();
1409 Returns : -1, 0, or 1
1410 : -1 = Minus strand, +1 = Plus strand
1411 : Returns 0 if strand is not defined, which occurs
1412 : for BLASTP reports, and the query of TBLASTN
1413 : as well as the hit if BLASTX reports.
1414 : In scalar context without arguments, returns queryStrand value.
1415 : In array context without arguments, returns a two-element list
1416 : of strings (queryStrand, sbjctStrand).
1417 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1418 Argument : seq_type: 'query' or 'hit' or 'sbjct' or undef
1419 : ('sbjct' is synonymous with 'hit')
1420 Throws : n/a
1422 See Also : L<_set_seq()>, L<_set_match_stats()>
1424 =cut
1426 #-----------
1427 sub strand {
1428 #-----------
1429 my( $self, $seqType ) = @_;
1431 $seqType ||= (wantarray ? 'list' : 'query');
1432 $seqType = 'sbjct' if $seqType eq 'hit';
1434 ## Sensitive to member name format.
1435 $seqType = "_\L$seqType\E";
1437 # $seqType could be '_list'.
1438 $self->{'_queryStrand'} or $self->_set_seq_data() unless $self->{'_set_seq_data'};
1440 my $prog = $self->algorithm;
1442 if($seqType =~ /list|array/i) {
1443 my ($qstr, $hstr);
1444 if( $prog eq 'BLASTP') {
1445 $qstr = 0;
1446 $hstr = 0;
1448 elsif( $prog eq 'TBLASTN') {
1449 $qstr = 0;
1450 $hstr = $STRAND_SYMBOL{$self->{'_sbjctStrand'}};
1452 elsif( $prog eq 'BLASTX') {
1453 $qstr = $STRAND_SYMBOL{$self->{'_queryStrand'}};
1454 $hstr = 0;
1456 else {
1457 $qstr = $STRAND_SYMBOL{$self->{'_queryStrand'}} if defined $self->{'_queryStrand'};
1458 $hstr = $STRAND_SYMBOL{$self->{'_sbjctStrand'}} if defined $self->{'_sbjctStrand'};
1460 $qstr ||= 0;
1461 $hstr ||= 0;
1462 return ($qstr, $hstr);
1464 local $^W = 0;
1465 $STRAND_SYMBOL{$self->{$seqType.'Strand'}} || 0;
1469 =head2 seq
1471 Usage : $hsp->seq( [seq_type] );
1472 Purpose : Get the query or sbjct sequence as a Bio::Seq.pm object.
1473 Example : $seqObj = $hsp->seq('query');
1474 Returns : Object reference for a Bio::Seq.pm object.
1475 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = 'query').
1476 : ('sbjct' is synonymous with 'hit')
1477 Throws : Propagates any exception that occurs during construction
1478 : of the Bio::Seq.pm object.
1479 Comments : The sequence is returned in an array of strings corresponding
1480 : to the strings in the original format of the Blast alignment.
1481 : (i.e., same spacing).
1483 See Also : L<seq_str()|seq_str>, L<seq_inds()|seq_inds>, L<Bio::Seq>
1485 =cut
1487 #-------
1488 sub seq {
1489 #-------
1490 my($self,$seqType) = @_;
1491 $seqType ||= 'query';
1492 $seqType = 'sbjct' if $seqType eq 'hit';
1493 my $str = $self->seq_str($seqType);
1495 require Bio::Seq;
1497 Bio::Seq->new(-ID => $self->to_string,
1498 -SEQ => $str,
1499 -DESC => "$seqType sequence",
1503 =head2 seq_str
1505 Usage : $hsp->seq_str( seq_type );
1506 Purpose : Get the full query, sbjct, or 'match' sequence as a string.
1507 : The 'match' sequence is the string of symbols in between the
1508 : query and sbjct sequences.
1509 Example : $str = $hsp->seq_str('query');
1510 Returns : String
1511 Argument : seq_Type = 'query' or 'hit' or 'sbjct' or 'match'
1512 : ('sbjct' is synonymous with 'hit')
1513 Throws : Exception if the argument does not match an accepted seq_type.
1514 Comments : Calls _set_seq_data() to set the 'match' sequence if it has
1515 : not been set already.
1517 See Also : L<seq()|seq>, L<seq_inds()|seq_inds>, L<_set_match_seq()>
1519 =cut
1521 #------------
1522 sub seq_str {
1523 #------------
1524 my($self,$seqType) = @_;
1526 $seqType ||= 'query';
1527 $seqType = 'sbjct' if $seqType eq 'hit';
1528 ## Sensitive to member name changes.
1529 $seqType = "_\L$seqType\E";
1531 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1533 if($seqType =~ /sbjct|query/) {
1534 my $seq = join('',@{$self->{$seqType.'Seq'}});
1535 $seq =~ s/\s+//g;
1536 return $seq;
1538 } elsif( $seqType =~ /match/i) {
1539 # Only need to call _set_match_seq() if the match seq is requested.
1540 my $aref = $self->_set_match_seq() unless ref $self->{'_matchSeq'};
1541 $aref = $self->{'_matchSeq'};
1543 return join('',@$aref);
1545 } else {
1546 my $id_str = $self->_id_str;
1547 $self->throw(-class => 'Bio::Root::BadParameter',
1548 -text => "Invalid or undefined sequence type: $seqType ($id_str)\n" .
1549 "Valid types: query, sbjct, match",
1550 -value => $seqType);
1554 =head2 seq_inds
1556 Usage : $hsp->seq_inds( seq_type, class, collapse );
1557 Purpose : Get a list of residue positions (indices) for all identical
1558 : or conserved residues in the query or sbjct sequence.
1559 Example : @s_ind = $hsp->seq_inds('query', 'identical');
1560 : @h_ind = $hsp->seq_inds('hit', 'conserved');
1561 : @h_ind = $hsp->seq_inds('hit', 'conserved', 1);
1562 Returns : List of integers
1563 : May include ranges if collapse is true.
1564 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = query)
1565 : ('sbjct' is synonymous with 'hit')
1566 : class = 'identical' or 'conserved' (default = identical)
1567 : (can be shortened to 'id' or 'cons')
1568 : (actually, anything not 'id' will evaluate to 'conserved').
1569 : collapse = boolean, if true, consecutive positions are merged
1570 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
1571 : collapses to "1-5 7 9-11". This is useful for
1572 : consolidating long lists. Default = no collapse.
1573 Throws : n/a.
1574 Comments : Calls _set_residues() to set the 'match' sequence if it has
1575 : not been set already.
1577 See Also : L<seq()|seq>, L<_set_residues()>, L<Bio::Search::BlastUtils::collapse_nums()|Bio::Search::BlastUtils>, L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit>
1579 =cut
1581 #---------------
1582 sub seq_inds {
1583 #---------------
1584 my ($self, $seqType, $class, $collapse) = @_;
1586 $seqType ||= 'query';
1587 $class ||= 'identical';
1588 $collapse ||= 0;
1589 $seqType = 'sbjct' if $seqType eq 'hit';
1591 $self->_set_residues() unless defined $self->{'_identicalRes_query'};
1593 $seqType = ($seqType !~ /^q/i ? 'sbjct' : 'query');
1594 $class = ($class !~ /^id/i ? 'conserved' : 'identical');
1596 ## Sensitive to member name changes.
1597 $seqType = "_\L$seqType\E";
1598 $class = "_\L$class\E";
1600 my @ary = sort { $a <=> $b } keys %{ $self->{"${class}Res$seqType"}};
1602 require Bio::Search::BlastUtils if $collapse;
1604 return $collapse ? &Bio::Search::BlastUtils::collapse_nums(@ary) : @ary;
1608 =head2 get_aln
1610 Usage : $hsp->get_aln()
1611 Purpose : Get a Bio::SimpleAlign object constructed from the query + sbjct
1612 : sequences of the present HSP object.
1613 Example : $aln_obj = $hsp->get_aln();
1614 Returns : Object reference for a Bio::SimpleAlign.pm object.
1615 Argument : n/a.
1616 Throws : Propagates any exception ocurring during the construction of
1617 : the Bio::SimpleAlign object.
1618 Comments : Requires Bio::SimpleAlign.
1619 : The Bio::SimpleAlign object is constructed from the query + sbjct
1620 : sequence objects obtained by calling seq().
1621 : Gap residues are included (see $GAP_SYMBOL).
1623 See Also : L<seq()|seq>, L<Bio::SimpleAlign>
1625 =cut
1627 #------------
1628 sub get_aln {
1629 #------------
1630 my $self = shift;
1632 require Bio::SimpleAlign;
1633 require Bio::LocatableSeq;
1634 my $qseq = $self->seq('query');
1635 my $sseq = $self->seq('sbjct');
1637 my $type = $self->algorithm =~ /P$|^T/ ? 'amino' : 'dna';
1638 my $aln = Bio::SimpleAlign->new();
1639 $aln->add_seq(Bio::LocatableSeq->new(-seq => $qseq->seq(),
1640 -id => 'query_'.$qseq->display_id(),
1641 -start => 1,
1642 -end => CORE::length($qseq)));
1644 $aln->add_seq(Bio::LocatableSeq->new(-seq => $sseq->seq(),
1645 -id => 'hit_'.$sseq->display_id(),
1646 -start => 1,
1647 -end => CORE::length($sseq)));
1649 return $aln;
1654 __END__
1657 =head1 FOR DEVELOPERS ONLY
1659 =head2 Data Members
1661 Information about the various data members of this module is provided for those
1662 wishing to modify or understand the code. Two things to bear in mind:
1664 =over 4
1666 =item 1 Do NOT rely on these in any code outside of this module.
1668 All data members are prefixed with an underscore to signify that they are private.
1669 Always use accessor methods. If the accessor doesn't exist or is inadequate,
1670 create or modify an accessor (and let me know, too!).
1672 =item 2 This documentation may be incomplete and out of date.
1674 It is easy for these data member descriptions to become obsolete as
1675 this module is still evolving. Always double check this info and search
1676 for members not described here.
1678 =back
1680 An instance of Bio::Search::HSP::PsiBlastHSP.pm is a blessed reference to a hash containing
1681 all or some of the following fields:
1683 FIELD VALUE
1684 --------------------------------------------------------------
1685 (member names are mostly self-explanatory)
1687 _score :
1688 _bits :
1689 _p :
1690 _n : Integer. The 'N' value listed in parenthesis with P/Expect value:
1691 : e.g., P(3) = 1.2e-30 ---> (N = 3).
1692 : Not defined in NCBI Blast2 with gaps.
1693 : To obtain the number of HSPs, use Bio::Search::Hit::BlastHit::num_hsps().
1694 _expect :
1695 _queryLength :
1696 _queryGaps :
1697 _queryStart :
1698 _queryStop :
1699 _querySeq :
1700 _sbjctLength :
1701 _sbjctGaps :
1702 _sbjctStart :
1703 _sbjctStop :
1704 _sbjctSeq :
1705 _matchSeq : String. Contains the symbols between the query and sbjct lines
1706 which indicate identical (letter) and conserved ('+') matches
1707 or a mismatch (' ').
1708 _numIdentical :
1709 _numConserved :
1710 _identicalRes_query :
1711 _identicalRes_sbjct :
1712 _conservedRes_query :
1713 _conservedRes_sbjct :
1714 _match_indent : The number of leading space characters on each line containing
1715 the match symbols. _match_indent is 13 in this example:
1716 Query: 285 QNSAPWGLARISHRERLNLGSFNKYLYDDDAG
1717 Q +APWGLARIS G+ + Y YD+ AG
1718 ^^^^^^^^^^^^^
1721 =cut