2 # BioPerl module for Bio::SearchIO::blast
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Jason Stajich <jason@bioperl.org>
8 # Copyright Jason Stajich
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
15 # PSI-BLAST full parsing support. Rollout of new
16 # model which will remove Steve's old psiblast driver
18 # Megablast parsing fix as reported by Neil Saunders
20 # Support bl2seq parsing
22 # Parse more blast statistics, lambda, entropy, etc
23 # from WU-BLAST in frame-specific manner
24 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
25 # 20071104 - dmessina - added support for WUBLAST -echofilter
26 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
30 Bio::SearchIO::blast - Event generator for event based parsing of
35 # Do not use this object directly - it is used as part of the
36 # Bio::SearchIO system.
39 my $searchio = Bio::SearchIO->new(-format => 'blast',
40 -file => 't/data/ecolitst.bls');
41 while( my $result = $searchio->next_result ) {
42 while( my $hit = $result->next_hit ) {
43 while( my $hsp = $hit->next_hsp ) {
51 This object encapsulated the necessary methods for generating events
52 suitable for building Bio::Search objects from a BLAST report file.
53 Read the L<Bio::SearchIO> for more information about how to use this.
55 This driver can parse:
61 NCBI produced plain text BLAST reports from blastall, this also
62 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
63 BLAST output is parsed with the blastxml SearchIO driver
71 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
75 BLAST-like output from Paracel BTK output
81 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
82 doesn't report the algorithm used - I assume it is BLASTX by default -
83 you can supply the program type with -report_type in the SearchIO
86 my $parser = Bio::SearchIO->new(-format => 'blast',
87 -file => 'bl2seq.tblastn.report',
88 -report_type => 'tblastn');
90 This only really affects where the frame and strand information are
91 put - they will always be on the $hsp-E<gt>query instead of on the
92 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
93 produced reports. Hope that's clear...
99 User feedback is an integral part of the evolution of this and other
100 Bioperl modules. Send your comments and suggestions preferably to
101 the Bioperl mailing list. Your participation is much appreciated.
103 bioperl-l@bioperl.org - General discussion
104 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
108 Please direct usage questions or support issues to the mailing list:
110 I<bioperl-l@bioperl.org>
112 rather than to the module maintainer directly. Many experienced and
113 reponsive experts will be able look at the problem and quickly
114 address it. Please include a thorough description of the problem
115 with code and data examples if at all possible.
117 =head2 Reporting Bugs
119 Report bugs to the Bioperl bug tracking system to help us keep track
120 of the bugs and their resolution. Bug reports can be submitted via the
123 https://redmine.open-bio.org/projects/bioperl/
125 =head1 AUTHOR - Jason Stajich
127 Email Jason Stajich jason-at-bioperl.org
131 Steve Chervitz sac-at-bioperl.org
135 The rest of the documentation details each of the object methods.
136 Internal methods are usually preceded with a _
140 # Let the code begin...'
142 package Bio
::SearchIO
::blast
;
144 use Bio
::SearchIO
::IteratedSearchResultEventBuilder
;
146 use vars
qw(%MAPPING %MODEMAP
147 $DEFAULT_BLAST_WRITER_CLASS
155 use base qw(Bio::SearchIO);
160 # mapping of NCBI Blast terms to Bioperl hash keys
162 'BlastOutput' => 'result',
163 'Iteration' => 'iteration',
168 # This should really be done more intelligently, like with
172 'Hsp_bit-score' => 'HSP-bits',
173 'Hsp_score' => 'HSP-score',
174 'Hsp_evalue' => 'HSP-evalue',
176 'Hsp_pvalue' => 'HSP-pvalue',
177 'Hsp_query-from' => 'HSP-query_start',
178 'Hsp_query-to' => 'HSP-query_end',
179 'Hsp_hit-from' => 'HSP-hit_start',
180 'Hsp_hit-to' => 'HSP-hit_end',
181 'Hsp_positive' => 'HSP-conserved',
182 'Hsp_identity' => 'HSP-identical',
183 'Hsp_gaps' => 'HSP-hsp_gaps',
184 'Hsp_hitgaps' => 'HSP-hit_gaps',
185 'Hsp_querygaps' => 'HSP-query_gaps',
186 'Hsp_qseq' => 'HSP-query_seq',
187 'Hsp_hseq' => 'HSP-hit_seq',
188 'Hsp_midline' => 'HSP-homology_seq',
189 'Hsp_align-len' => 'HSP-hsp_length',
190 'Hsp_query-frame' => 'HSP-query_frame',
191 'Hsp_hit-frame' => 'HSP-hit_frame',
192 'Hsp_links' => 'HSP-links',
193 'Hsp_group' => 'HSP-hsp_group',
194 'Hsp_features' => 'HSP-hit_features',
196 'Hit_id' => 'HIT-name',
197 'Hit_len' => 'HIT-length',
198 'Hit_accession' => 'HIT-accession',
199 'Hit_def' => 'HIT-description',
200 'Hit_signif' => 'HIT-significance',
201 # For NCBI blast, the description line contains bits.
202 # For WU-blast, the description line contains score.
203 'Hit_score' => 'HIT-score',
204 'Hit_bits' => 'HIT-bits',
206 'Iteration_iter-num' => 'ITERATION-number',
207 'Iteration_converged' => 'ITERATION-converged',
209 'BlastOutput_program' => 'RESULT-algorithm_name',
210 'BlastOutput_version' => 'RESULT-algorithm_version',
211 'BlastOutput_algorithm-reference' => 'RESULT-algorithm_reference',
212 'BlastOutput_rid' => 'RESULT-rid',
213 'BlastOutput_query-def' => 'RESULT-query_name',
214 'BlastOutput_query-len' => 'RESULT-query_length',
215 'BlastOutput_query-acc' => 'RESULT-query_accession',
216 'BlastOutput_query-gi' => 'RESULT-query_gi',
217 'BlastOutput_querydesc' => 'RESULT-query_description',
218 'BlastOutput_db' => 'RESULT-database_name',
219 'BlastOutput_db-len' => 'RESULT-database_entries',
220 'BlastOutput_db-let' => 'RESULT-database_letters',
221 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
223 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
224 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
225 'Parameters_include' => { 'RESULT-parameters' => 'include' },
226 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
227 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
228 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
229 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
230 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
231 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
232 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
233 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
234 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
235 'Statistics_hsp-len' =>
236 { 'RESULT-statistics' => 'effective_hsplength' },
237 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
238 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
239 'Statistics_eff-spaceused' =>
240 { 'RESULT-statistics' => 'effectivespaceused' },
241 'Statistics_eff-dblen' =>
242 { 'RESULT-statistics' => 'effectivedblength' },
243 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
244 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
245 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
246 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
247 'Statistics_gapped_lambda' =>
248 { 'RESULT-statistics' => 'lambda_gapped' },
249 'Statistics_gapped_entropy' =>
250 { 'RESULT-statistics' => 'entropy_gapped' },
252 'Statistics_framewindow' =>
253 { 'RESULT-statistics' => 'frameshiftwindow' },
254 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
256 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
257 'Statistics_num_suc_extensions' =>
258 { 'RESULT-statistics' => 'num_successful_extensions' },
259 'Statistics_length_adjustment' => { 'RESULT-statistics' => 'length_adjustment' },
260 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping' =>
261 { 'RESULT-statistics' => 'number_of_hsps_better_than_expect_value_cutoff_without_gapping' },
262 'Statistics_number_of_hsps_gapped' => { 'RESULT-statistics' => 'number_of_hsps_gapped' },
263 'Statistics_number_of_hsps_successfully_gapped' => { 'RESULT-statistics' => 'number_of_hsps_successfully_gapped' },
266 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
267 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
268 'Statistics_noprocessors' =>
269 { 'RESULT-statistics' => 'no_of_processors' },
270 'Statistics_neighbortime' =>
271 { 'RESULT-statistics' => 'neighborhood_generate_time' },
272 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
273 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
276 # add WU-BLAST Frame-Based Statistics
277 for my $frame ( 0 .. 3 ) {
278 for my $strand ( '+', '-' ) {
280 qw(length efflength E S W T X X_gapped E2
284 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
285 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
287 for my $val (qw(lambda kappa entropy )) {
288 for my $type (qw(used computed gapped)) {
289 my $key = "Statistics_frame$strand$frame\_$val\_$type";
291 { 'RESULT-statistics' =>
292 "Frame$strand$frame\_$val\_$type" };
293 $MAPPING{$key} = $val;
301 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
302 S1_bits S2_bits num_extensions
303 num_successful_extensions
304 seqs_better_than_cutoff
306 search_cputime total_cputime
307 search_actualtime total_actualtime
308 no_of_processors ctxfactor)
311 my $key = "Statistics_$stats";
312 my $val = { 'RESULT-statistics' => $stats };
313 $MAPPING{$key} = $val;
316 # add WU-BLAST Parameters
318 qw(span span1 span2 links warnings notes hspsepsmax
319 hspsepqmax topcomboN topcomboE postsw cpus wordmask
320 filter sort_by_pvalue sort_by_count sort_by_highscore
321 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
325 my $key = "Parameters_$param";
326 my $val = { 'RESULT-parameters' => $param };
327 $MAPPING{$key} = $val;
330 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::SearchIO::Writer::HitTableWriter';
331 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
332 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
338 Usage : my $obj = Bio::SearchIO::blast->new(%args);
339 Function: Builds a new Bio::SearchIO::blast object
340 Returns : Bio::SearchIO::blast
341 Args : Key-value pairs:
342 -fh/-file => filehandle/filename to BLAST file
344 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
345 reports when you want to distinguish between
346 tblastn and blastx reports (this only controls
347 where the frame information is put - on the query
349 -inclusion_threshold => e-value threshold for inclusion in the
350 PSI-BLAST score matrix model (blastpgp)
351 -signif => float or scientific notation number to be used
352 as a P- or Expect value cutoff
353 -score => integer or scientific notation number to be used
354 as a blast score value cutoff
355 -bits => integer or scientific notation number to be used
356 as a bit score value cutoff
357 -hit_filter => reference to a function to be used for
358 filtering hits based on arbitrary criteria.
359 All hits of each BLAST report must satisfy
360 this criteria to be retained.
361 If a hit fails this test, it is ignored.
362 This function should take a
363 Bio::Search::Hit::BlastHit.pm object as its first
364 argument and return true
365 if the hit should be retained.
366 Sample filter function:
367 -hit_filter => sub { $hit = shift;
369 (Note: -filt_func is synonymous with -hit_filter)
370 -overlap => integer. The amount of overlap to permit between
371 adjacent HSPs when tiling HSPs. A reasonable value is 2.
372 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
374 The following criteria are not yet supported:
375 (these are probably best applied within this module rather than in the
376 event handler since they would permit the parser to take some shortcuts.)
378 -check_all_hits => boolean. Check all hits for significance against
379 significance criteria. Default = false.
380 If false, stops processing hits after the first
381 non-significant hit or the first hit that fails
382 the hit_filter call. This speeds parsing,
383 taking advantage of the fact that the hits
384 are processed in the order they appear in the report.
385 -min_query_len => integer to be used as a minimum for query sequence length.
386 Reports with query sequences below this length will
387 not be processed. Default = no minimum length.
388 -best => boolean. Only process the best hit of each report;
394 my ( $self, @args ) = @_;
395 $self->SUPER::_initialize
(@args);
397 # Blast reports require a specialized version of the SREB due to the
398 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
399 # issue here is that we want to set new default object factories if none are
402 my $handler = Bio
::SearchIO
::IteratedSearchResultEventBuilder
->new(@args);
403 $self->attach_EventHandler($handler);
405 # 2006-04-26 move this to the attach_handler function in this module so we
406 # can really reset the handler
407 # Optimization: caching
408 # the EventHandler since it is used a lot during the parse.
410 # $self->{'_handler_cache'} = $handler;
412 my ($rpttype ) = $self->_rearrange(
419 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
422 sub attach_EventHandler
{
423 my ($self,$handler) = @_;
425 $self->SUPER::attach_EventHandler
($handler);
427 # Optimization: caching the EventHandler since it is used a lot
430 $self->{'_handler_cache'} = $handler;
437 Usage : my $hit = $searchio->next_result;
438 Function: Returns the next Result from a search
439 Returns : Bio::Search::Result::ResultI object
446 my $v = $self->verbose;
449 $self->{'_seentop'} = 0; # start next report at top
451 my ( $reporttype, $seenquery, $reportline, $reportversion );
452 my ( $seeniteration, $found_again );
453 my $incl_threshold = $self->inclusion_threshold;
455 $self->start_document(); # let the fun begin...
457 my $gapped_stats = 0; # for switching between gapped/ungapped
459 local $_ = "\n"; #consistency
461 while ( defined( $_ = $self->_readline ) ) {
462 next if (/^\s+$/); # skip empty lines
463 next if (/CPU time:/);
465 next if (/[*]+\s+No hits found\s+[*]+/);
467 /^((?:\S+?)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
468 # RPSBLAST, MEGABLAST
469 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
472 ($reporttype, $reportversion) = ($1, $2);
473 # need to keep track of whether this is WU-BLAST
474 if ($reportversion && $reportversion =~ m{WashU$}) {
475 $self->{'_wublast'}++;
477 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
478 if ( $self->{'_seentop'} ) {
479 # This handles multi-result input streams
480 $self->_pushback($_);
483 $self->_start_blastoutput;
484 if ($reporttype =~ /RPS-BLAST/) {
485 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
487 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
490 'Name' => 'BlastOutput_program',
491 'Data' => $reporttype
497 'Name' => 'BlastOutput_version',
498 'Data' => $reportversion
503 'Name' => 'BlastOutput_inclusion-threshold',
504 'Data' => $incl_threshold
508 # parse the BLAST algorithm reference
509 elsif(/^Reference:\s+(.*)$/) {
510 # want to preserve newlines for the BLAST algorithm reference
511 my $algorithm_reference = "$1\n";
512 $_ = $self->_readline;
513 # while the current line, does not match an empty line, a RID:, a
514 # Database:, or a query definition line (Query=) we are still
515 # looking at the algorithm_reference, append it to what we parsed so
517 while($_ !~ /^$/ && $_ !~ /^RID:/ && $_ !~ /^Database:/ && $_ !~ /^Query=/) {
518 $algorithm_reference .= "$_";
519 $_ = $self->_readline;
521 # if we exited the while loop, we saw an empty line, a RID:, or a
522 # Database:, so push it back
523 $self->_pushback($_);
526 'Name' => 'BlastOutput_algorithm-reference',
527 'Data' => $algorithm_reference
531 # parse BLAST RID (Request ID)
532 elsif(/^RID:\s+(.*)$/) {
536 'Name' => 'BlastOutput_rid',
541 # added Windows workaround for bug 1985
542 elsif (/^(Searching|Results from round)/) {
543 next unless $1 =~ /Results from round/;
544 $self->debug("blast.pm: Possible psi blast iterations found...\n");
546 $self->in_element('hsp')
547 && $self->end_element( { 'Name' => 'Hsp' } );
548 $self->in_element('hit')
549 && $self->end_element( { 'Name' => 'Hit' } );
550 if ( defined $seeniteration ) {
551 $self->within_element('iteration')
552 && $self->end_element( { 'Name' => 'Iteration' } );
553 $self->start_element( { 'Name' => 'Iteration' } );
556 $self->start_element( { 'Name' => 'Iteration' } );
560 elsif (/^Query=\s*(.*)$/) {
562 $self->debug("blast.pm: Query= found...$_\n");
564 if ( defined $seenquery ) {
565 $self->_pushback($_);
566 $self->_pushback($reportline) if $reportline;
569 if ( !defined $reporttype ) {
570 $self->_start_blastoutput;
571 if ( defined $seeniteration ) {
572 $self->in_element('iteration')
573 && $self->end_element( { 'Name' => 'Iteration' } );
574 $self->start_element( { 'Name' => 'Iteration' } );
577 $self->start_element( { 'Name' => 'Iteration' } );
582 $_ = $self->_readline;
583 while ( defined($_) ) {
585 $self->_pushback($_);
588 # below line fixes length issue with BLAST v2.2.13; still works
590 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
597 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ?
" $_" : $_;
598 $q =~ s/\s+/ /g; # this catches the newline as well
602 $_ = $self->_readline;
605 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
608 'Name' => 'BlastOutput_query-def',
614 'Name' => 'BlastOutput_query-len',
618 defined $desc && $desc =~ s/\s+$//;
621 'Name' => 'BlastOutput_querydesc',
625 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
626 $version = defined($version) && length($version) ?
".$version" : "";
629 'Name' => 'BlastOutput_query-acc',
630 'Data' => "$acc$version"
634 # these elements are dropped with some multiquery reports; add
638 'Name' => 'BlastOutput_db-len',
639 'Data' => $self->{'_blsdb_length'}
641 ) if $self->{'_blsdb_length'};
644 'Name' => 'BlastOutput_db-let',
645 'Data' => $self->{'_blsdb_letters'}
647 ) if $self->{'_blsdb_letters'};
650 'Name' => 'BlastOutput_db',
651 'Data' => $self->{'_blsdb'}
653 ) if $self->{'_blsdb_letters'};
655 # added check for WU-BLAST -echofilter option (bug 2388)
656 elsif (/^>Unfiltered[+-]1$/) {
657 # skip all of the lines of unfiltered sequence
658 while($_ !~ /^Database:/) {
659 $self->debug("Bypassing features line: $_");
660 $_ = $self->_readline;
662 $self->_pushback($_);
664 elsif (/Sequences producing significant alignments:/) {
665 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
668 # PSI-BLAST parsing needs to be fixed to specifically look
669 # for old vs new per iteration, as sorting based on duplication
670 # leads to bugs, see bug 1986
672 # The next line is not necessarily whitespace in psiblast reports.
673 # Also note that we must look for the end of this section by testing
674 # for a line with a leading >. Blank lines occur with this section
676 if ( !$self->in_element('iteration') ) {
677 $self->start_element( { 'Name' => 'Iteration' } );
680 # changed 8/28/2008 to exit hit table if blank line is found after an
685 while ( defined( my $descline = $self->_readline() ) ) {
686 if ($descline =~ m{^\s*$}) {
687 last DESCLINE
if $seen_block;
690 # any text match is part of block...
692 # GCG multiline oddness...
693 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
694 my ($id, $nextline) = ($1, $self->_readline);
695 $nextline =~ s{^!}{};
696 $descline = "$id $nextline";
698 # NCBI style hit table (no N)
699 if ($descline =~ /(?
<!cor
) # negative lookahead
700 (\d
*\
.?
(?
:[\
+\
-eE
]+)?\d
+) # number (float or scientific notation)
702 (\d
*\
.?
(?
:[\
+\
-eE
]+)?\d
+) # number (float or scientific notation)
705 my ( $score, $evalue ) = ($1, $2);
707 # Some data clean-up so e-value will appear numeric to perl
708 $evalue =~ s/^e/1e/i;
710 # This to handle no-HSP case
711 my @line = split ' ',$descline;
713 # we want to throw away the score, evalue
714 pop @line, pop @line;
716 # and N if it is present (of course they are not
717 # really in that order, but it doesn't matter
718 if ($3) { pop @line }
720 # add the last 2 entries s.t. we can reconstruct
721 # a minimal Hit object at the end of the day
722 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
723 } elsif ($descline =~ /^CONVERGED/i) {
726 'Name' => 'Iteration_converged',
731 $self->_pushback($descline); # Catch leading > (end of section)
736 elsif (/Sequences producing High-scoring Segment Pairs:/) {
738 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
740 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
741 $_ = $self->_readline();
744 if ( !$self->in_element('iteration') ) {
745 $self->start_element( { 'Name' => 'Iteration' } );
748 while ( defined( $_ = $self->_readline() )
752 pop @line; # throw away first number which is for 'N'col
754 # add the last 2 entries to array s.t. we can reconstruct
755 # a minimal Hit object at the end of the day
757 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
761 elsif (/^Database:\s*(.+?)\s*$/) {
763 $self->debug("blast.pm: Database: $1\n");
765 while ( defined( $_ = $self->_readline ) ) {
767 /^\s
+(\
-?
[\d\
,]+|\S
+)\s
+sequences\
;
768 \s
+(\
-?
[\d
,]+|\S
+)\s
+ # Deal with NCBI 2.2.8 OSX problems
772 my ( $s, $l ) = ( $1, $2 );
777 'Name' => 'BlastOutput_db-len',
783 'Name' => 'BlastOutput_db-let',
787 # cache for next round in cases with multiple queries
788 $self->{'_blsdb'} = $db;
789 $self->{'_blsdb_length'} = $s;
790 $self->{'_blsdb_letters'} = $l;
800 'Name' => 'BlastOutput_db',
806 # move inside of a hit
807 elsif (/^>\s*(\S+)\s*(.*)?/) {
810 $self->debug("blast.pm: Hit: $1\n");
811 $self->in_element('hsp')
812 && $self->end_element( { 'Name' => 'Hsp' } );
813 $self->in_element('hit')
814 && $self->end_element( { 'Name' => 'Hit' } );
816 # special case when bl2seq reports don't have a leading
818 if ( !$self->within_element('result') ) {
819 $self->_start_blastoutput;
820 $self->start_element( { 'Name' => 'Iteration' } );
822 elsif ( !$self->within_element('iteration') ) {
823 $self->start_element( { 'Name' => 'Iteration' } );
825 $self->start_element( { 'Name' => 'Hit' } );
829 $self->debug("Starting a hit: $1 $2\n");
836 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
839 'Name' => 'Hit_accession',
843 # add hit significance (from the hit table)
844 # this is where Bug 1986 went awry
846 # Changed for Bug2409; hit->significance and hit->score/bits derived
847 # from HSPs, not hit table unless necessary
850 while (my $v = shift @hit_signifs) {
851 my $tableid = $v->[2];
852 if ($tableid !~ m{\Q$id\E}) {
853 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
859 while ( defined( $_ = $self->_readline() ) ) {
862 if (/Length\s*=\s*([\d,]+)/) {
874 if ($restofline !~ /\s$/) { # bug #3235
875 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
880 $restofline =~ s/\s+/ /g;
884 'Data' => $restofline
888 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
892 ( $self->in_element('hit') || $self->in_element('hsp') )
893 && # paracel genewise BTK
894 m
/Score\s
*=\s
*(\S
+)\s
*bits\s
* # Bit score
895 (?
:\
((\d
+)\
))?
, # Raw score
896 \s
+Log\
-Length\sScore\s
*=\s
*(\d
+) # Log-Length score
900 $self->in_element('hsp')
901 && $self->end_element( { 'Name' => 'Hsp' } );
902 $self->start_element( { 'Name' => 'Hsp' } );
904 $self->debug( "Got paracel genewise HSP score=$1\n");
906 # Some data clean-up so e-value will appear numeric to perl
907 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
908 $evalue =~ s/^e/1e/i;
911 'Name' => 'Hsp_score',
917 'Name' => 'Hsp_bit-score',
923 'Name' => 'Hsp_evalue',
929 ( $self->in_element('hit') || $self->in_element('hsp') )
930 && # paracel hframe BTK
931 m
/Score\s
*=\s
*([^,\s
]+), # Raw score
932 \s
*Expect\s
*=\s
*([^,\s
]+), # E-value
933 \s
*P
(?
:\
(\S
+\
))?\s
*=\s
*([^,\s
]+) # P-value
937 $self->in_element('hsp')
938 && $self->end_element( { 'Name' => 'Hsp' } );
939 $self->start_element( { 'Name' => 'Hsp' } );
941 $self->debug( "Got paracel hframe HSP score=$1\n");
943 # Some data clean-up so e-value will appear numeric to perl
944 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
945 $evalue = "1$evalue" if $evalue =~ /^e/;
946 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
950 'Name' => 'Hsp_score',
956 'Name' => 'Hsp_evalue',
962 'Name' => 'Hsp_pvalue',
968 ( $self->in_element('hit') || $self->in_element('hsp') )
970 m
/Score\s
*=\s
*(\S
+)\s
* # Bit score
971 \
(([\d\
.]+)\s
*bits\
), # Raw score
972 \s
*Expect\s
*=\s
*([^,\s
]+), # E-value
974 P
(?
:\
(\d
+\
))?\s
*=\s
*([^,\s
]+) # P-value
975 (?
:\s
*,\s
+Group\s
*\
=\s
*(\d
+))?
# HSP Group
978 { # wu-blast HSP parse
979 $self->in_element('hsp')
980 && $self->end_element( { 'Name' => 'Hsp' } );
981 $self->start_element( { 'Name' => 'Hsp' } );
983 # Some data clean-up so e-value will appear numeric to perl
984 my ( $score, $bits, $evalue, $pvalue, $group ) =
985 ( $1, $2, $3, $4, $5 );
986 $evalue =~ s/^e/1e/i;
987 $pvalue =~ s/^e/1e/i;
991 'Name' => 'Hsp_score',
997 'Name' => 'Hsp_bit-score',
1003 'Name' => 'Hsp_evalue',
1009 'Name' => 'Hsp_pvalue',
1014 if ( defined $group ) {
1017 'Name' => 'Hsp_group',
1025 ( $self->in_element('hit') || $self->in_element('hsp') )
1026 && # ncbi blast, works with 2.2.17
1027 m/^\sFeatures\s\w+\sthis\spart/xmso
1028 # If the line begins with "Features in/flanking this part of subject sequence:"
1031 $self->in_element('hsp')
1032 && $self->end_element( { 'Name' => 'Hsp' } );
1034 $_ = $self->_readline;
1035 while($_ !~ /^\s*$/) {
1038 $_ = $self->_readline;
1040 $self->_pushback($_);
1041 $featline =~ s{(?:^\s+|\s+^)}{}g;
1042 $featline =~ s{\n}{;}g;
1043 $self->start_element( { 'Name' => 'Hsp' } );
1046 'Name' => 'Hsp_features',
1050 $self->{'_seen_hsp_features'} = 1;
1053 ( $self->in_element('hit') || $self->in_element('hsp') )
1054 && # ncbi blast, works with 2.2.17
1055 m
/Score\s
*=\s
*(\S
+)\s
*bits\s
* # Bit score
1056 (?
:\
((\d
+)\
))?
, # Missing for BLAT pseudo-BLAST fmt
1057 \s
*Expect
(?
:\
((\d
+\
+?
)\
))?\s
*=\s
*([^,\s
]+) # E-value
1060 { # parse NCBI blast HSP
1061 if( !$self->{'_seen_hsp_features'} ) {
1062 $self->in_element('hsp')
1063 && $self->end_element( { 'Name' => 'Hsp' } );
1064 $self->start_element( { 'Name' => 'Hsp' } );
1067 # Some data clean-up so e-value will appear numeric to perl
1068 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
1069 $evalue =~ s/^e/1e/i;
1072 'Name' => 'Hsp_score',
1078 'Name' => 'Hsp_bit-score',
1084 'Name' => 'Hsp_evalue',
1094 $score = '' unless defined $score; # deal with BLAT which
1095 # has no score only bits
1096 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1099 $self->in_element('hsp')
1100 && m/Identities\s*=\s*(\d+)\s*\/\s
*(\d
+)\s
*[\d\
%\
(\
)]+\s
*
1101 (?
:,\s
*Positives\s
*=\s
*(\d
+)\
/(\d
+)\s
*[\d\
%\
(\
)]+\s
*)?
# pos only valid for Protein alignments
1102 (?
:\
,\s
*Gaps\s
*=\s
*(\d
+)\
/(\d
+))?
# Gaps
1108 'Name' => 'Hsp_identity',
1114 'Name' => 'Hsp_align-len',
1121 'Name' => 'Hsp_positive',
1129 'Name' => 'Hsp_positive',
1137 'Name' => 'Hsp_gaps',
1143 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1144 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1146 if (/(Frame\s*=\s*.+)$/) {
1148 # handle wu-blast Frame listing on same line
1149 $self->_pushback($1);
1152 elsif ( $self->in_element('hsp')
1153 && /Strand\s*=\s*(Plus|Minus)\s*\/\s
*(Plus
|Minus
)/i
)
1156 # consume this event ( we infer strand from start/end)
1157 if (!defined($reporttype)) {
1158 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1159 $bl2seq_fix = 1; # special case to resubmit the algorithm
1164 elsif ( $self->in_element('hsp')
1165 && /Links\s*=\s*(\S+)/ox )
1169 'Name' => 'Hsp_links',
1174 elsif ( $self->in_element('hsp')
1175 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s
*([\
+\
-][1-3]))?
/ )
1178 # this is for bl2seq only
1179 unless ( defined $reporttype ) {
1181 if ( $1 && $2 ) { $reporttype = 'TBLASTX' }
1183 $reporttype = 'BLASTX';
1185 # we can't distinguish between BLASTX and TBLASTN straight from the report }
1187 $self->{'_reporttype'} = $reporttype;
1190 my ( $queryframe, $hitframe );
1191 if ( $reporttype eq 'TBLASTX' ) {
1192 ( $queryframe, $hitframe ) = ( $1, $2 );
1193 $hitframe =~ s/\/\s*//g
;
1195 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1196 ( $hitframe, $queryframe ) = ( $1, 0 );
1198 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1199 ( $queryframe, $hitframe ) = ( $1, 0 );
1200 # though NCBI doesn't report it, this is a special BLASTX-like
1201 # RPS-BLAST; should be handled differently
1202 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1205 'Name' => 'BlastOutput_program',
1206 'Data' => 'RPS-BLAST(BLASTX)'
1213 'Name' => 'Hsp_query-frame',
1214 'Data' => $queryframe
1220 'Name' => 'Hsp_hit-frame',
1225 elsif (/^Parameters:/
1226 || /^\s+Database:\s+?/
1230 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1233 # Note: Lambda check was necessary to parse
1234 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1235 $self->debug("blast.pm: found parameters section \n");
1237 $self->in_element('hsp')
1238 && $self->end_element( { 'Name' => 'Hsp' } );
1239 $self->in_element('hit')
1240 && $self->end_element( { 'Name' => 'Hit' } );
1242 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1243 # and still want to construct minimal Hit objects
1244 $self->_cleanup_hits(\
@hit_signifs) if scalar(@hit_signifs);
1245 $self->within_element('iteration')
1246 && $self->end_element( { 'Name' => 'Iteration' } );
1248 next if /^\s+Subset/;
1249 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ?
'ncbi' : 'wublast';
1250 if (/^\s*Histogram/) {
1256 # default is that gaps are allowed
1259 'Name' => 'Parameters_allowgaps',
1263 while ( defined( $_ = $self->_readline ) ) {
1265 /^((?:\S+)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
1266 # RPSBLAST, MEGABLAST
1267 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1270 $self->_pushback($_);
1272 # let's handle this in the loop
1276 $self->_pushback($_);
1277 $self->_pushback($reportline) if $reportline;
1281 # here is where difference between wublast and ncbiblast
1282 # is better handled by different logic
1283 if ( /Number of Sequences:\s+([\d\,]+)/i
1284 || /of sequences in database:\s+(\-?[\d,]+)/i )
1290 'Name' => 'Statistics_db-len',
1295 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1300 'Name' => 'Statistics_db-let',
1305 elsif ( $blast eq 'btk' ) {
1308 elsif ( $blast eq 'wublast' ) {
1314 'Name' => 'Parameters_expect',
1322 'Name' => 'Parameters_allowgaps',
1327 elsif (/ctxfactor=(\S+)/) {
1330 'Name' => 'Statistics_ctxfactor',
1336 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1341 'Name' => "Parameters_$1",
1346 elsif (/(\S+)=(\S+)/) {
1349 'Name' => "Parameters_$1",
1354 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1355 my $firstgapinfo = 1;
1357 while ( defined($_) && !/^\s+$/ ) {
1361 && s/Q=(\d+),R=(\d+)\s+//x )
1367 'Name' => 'Parameters_gap-open',
1373 'Name' => 'Parameters_gap-extend',
1385 next if $type eq 'n/a';
1387 warn "fields is empty for $type\n";
1393 "Statistics_frame$frame\_$type",
1394 'Data' => shift @fields
1400 my ( $frameo, $matid, $matrix, @fields ) =
1402 if ( !defined $frame ) {
1404 # keep some sort of default feature I guess
1405 # even though this is sort of wrong
1408 'Name' => 'Parameters_matrix',
1414 'Name' => 'Statistics_lambda',
1415 'Data' => $fields[0]
1420 'Name' => 'Statistics_kappa',
1421 'Data' => $fields[1]
1426 'Name' => 'Statistics_entropy',
1427 'Data' => $fields[2]
1442 my $f = $fields[$ii];
1443 next unless defined $f; # deal with n/a
1444 if ( $f eq 'same' ) {
1445 $f = $fields[ $ii - 3 ];
1451 "Statistics_frame$frame\_$type",
1460 $_ = $self->_readline;
1464 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1466 while ( defined($_) && !/^\s+/ ) {
1470 if ( @fields <= 3 ) {
1471 for my $type (qw(X_gapped E2_gapped S2)) {
1472 last unless @fields;
1476 "Statistics_frame$frame\_$type",
1477 'Data' => shift @fields
1493 "Statistics_frame$frame\_$type",
1494 'Data' => shift @fields
1499 $_ = $self->_readline;
1503 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1504 if ( $1 eq 'states in' ) {
1507 'Name' => 'Statistics_DFA_states',
1512 elsif ( $1 eq 'size of' ) {
1515 'Name' => 'Statistics_DFA_size',
1522 m
/^\s
+Time to generate neighborhood
:\s
+
1528 'Name' => 'Statistics_neighbortime',
1533 elsif (/processors\s+used:\s+(\d+)/) {
1536 'Name' => 'Statistics_noprocessors',
1542 m
/^\s
+(\S
+)\s
+cpu\s
+time:\s
+# cputype
1543 (\S
+\s
+\S
+\s
+\S
+) # cputime
1544 \s
+Elapsed
:\s
+(\S
+)/x
1547 my $cputype = lc($1);
1550 'Name' => "Statistics_$cputype\_cputime",
1556 'Name' => "Statistics_$cputype\_actualtime",
1561 elsif (/^\s+Start:/) {
1562 my ( $junk, $start, $stime, $end, $etime ) =
1563 split( /\s+(Start|End)\:\s+/, $_ );
1567 'Name' => 'Statistics_starttime',
1574 'Name' => 'Statistics_endtime',
1579 elsif (/^\s+Database:\s+(.+)$/) {
1582 'Name' => 'Parameters_full_dbpath',
1588 elsif (/^\s+Posted:\s+(.+)/) {
1593 'Name' => 'Statistics_posted_date',
1599 elsif ( $blast eq 'ncbi' ) {
1601 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1604 'Name' => 'Parameters_matrix',
1613 $_ = $self->_readline;
1615 my ( $lambda, $kappa, $entropy ) = split;
1616 if ($gapped_stats) {
1619 'Name' => "Statistics_gapped_lambda",
1625 'Name' => "Statistics_gapped_kappa",
1631 'Name' => "Statistics_gapped_entropy",
1639 'Name' => "Statistics_lambda",
1645 'Name' => "Statistics_kappa",
1651 'Name' => "Statistics_entropy",
1657 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/oxi) {
1660 'Name' => 'Statistics_eff-spaceused',
1665 elsif (m/effective\s+search\s+space:\s+(\d+)/oxi) {
1668 'Name' => 'Statistics_eff-space',
1674 m
/Gap\s
+Penalties
:\s
+Existence
:\s
+(\d
+)\
,
1675 \s
+Extension
:\s
+(\d
+)/ox
1680 'Name' => 'Parameters_gap-open',
1686 'Name' => 'Parameters_gap-extend',
1691 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1694 'Name' => 'Statistics_hsp-len',
1699 elsif (/Number\s+of\s+HSP's\s+better\s+than\s+(\S+)\s+without\s+gapping:\s+(\d+)/) {
1702 'Name' => 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping',
1707 elsif (/Number\s+of\s+HSP's\s+gapped:\s+(\d+)/) {
1710 'Name' => 'Statistics_number_of_hsps_gapped',
1715 elsif (/Number\s+of\s+HSP's\s+successfully\s+gapped:\s+(\d+)/) {
1718 'Name' => 'Statistics_number_of_hsps_successfully_gapped',
1723 elsif (/Length\s+adjustment:\s+(\d+)/) {
1726 'Name' => 'Statistics_length_adjustment',
1731 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/i) {
1736 'Name' => 'Statistics_query-len',
1741 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/i) {
1746 'Name' => 'Statistics_eff-dblen',
1752 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1759 'Name' => "Statistics_$1",
1766 'Name' => "Statistics_$1_bits",
1773 m
/frameshift\s
+window\
,
1774 \s
+decay\s
+const
:\s
+(\d
+)\
,\s
+([\
.\d
]+)/x
1779 'Name' => 'Statistics_framewindow',
1785 'Name' => 'Statistics_decay',
1790 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1793 'Name' => 'Statistics_hit_to_db',
1798 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1801 'Name' => 'Statistics_num_extensions',
1807 m
/^Number\s
+of\s
+successful\s
+extensions
:\s
+
1813 'Name' => 'Statistics_num_suc_extensions',
1819 m
/^Number\s
+of\s
+sequences\s
+better\s
+than\s
+
1825 'Name' => 'Parameters_expect',
1831 'Name' => 'Statistics_seqs_better_than_cutoff',
1836 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1841 'Name' => 'Statistics_posted_date',
1846 elsif ( !/^\s+$/ ) {
1847 #$self->debug( "unmatched stat $_");
1852 } elsif ( $self->in_element('hsp') ) {
1853 $self->debug("blast.pm: Processing HSP\n");
1854 # let's read 3 lines at a time;
1855 # bl2seq hackiness... Not sure I like
1856 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1863 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1864 # $self->debug("$i: $_") if $v;
1865 if ( ( $i == 0 && /^\s+$/) ||
1866 /^\s*(?:Lambda|Minus|Plus|Score)/i )
1868 $self->_pushback($_) if defined $_;
1869 $self->end_element( { 'Name' => 'Hsp' } );
1873 if (/^((Query|Sbjct):?\s+(\-?\d+)?\s*)(\S+)\s+(\-?\d+)?/) {
1874 my ( $full, $type, $start, $str, $end ) =
1875 ( $1, $2, $3, $4, $5 );
1877 if ( $str eq '-' ) {
1878 $i = 3 if $type eq 'Sbjct';
1881 $data{$type} = $str;
1883 $len = length($full);
1884 $self->{"\_$type"}->{'begin'} = $start
1885 unless $self->{"_$type"}->{'begin'};
1886 $self->{"\_$type"}->{'end'} = $end;
1887 } elsif (/^((Query|Sbjct):?\s+(\-?0+)\s*)/) {
1888 # Bug from NCBI's BLAST: unaligned output
1889 $_ = $self->_readline() for 0..1;
1892 $self->throw("no data for midline $_")
1893 unless ( defined $_ && defined $len );
1894 $data{'Mid'} = substr( $_, $len );
1896 $_ = $self->_readline();
1900 'Name' => 'Hsp_qseq',
1901 'Data' => $data{'Query'}
1906 'Name' => 'Hsp_hseq',
1907 'Data' => $data{'Sbjct'}
1912 'Name' => 'Hsp_midline',
1913 'Data' => $data{'Mid'}
1918 #$self->debug("blast.pm: unrecognized line $_");
1922 $self->debug("blast.pm: End of BlastOutput\n");
1923 if ( $self->{'_seentop'} ) {
1924 $self->within_element('hsp')
1925 && $self->end_element( { 'Name' => 'Hsp' } );
1926 $self->within_element('hit')
1927 && $self->end_element( { 'Name' => 'Hit' } );
1928 # cleanup extra hits
1929 $self->_cleanup_hits(\
@hit_signifs) if scalar(@hit_signifs);
1930 $self->within_element('iteration')
1931 && $self->end_element( { 'Name' => 'Iteration' } );
1935 'Name' => 'BlastOutput_program',
1936 'Data' => $reporttype
1940 $self->end_element( { 'Name' => 'BlastOutput' } );
1942 return $self->end_document();
1945 # Private method for internal use only.
1946 sub _start_blastoutput
{
1948 $self->start_element( { 'Name' => 'BlastOutput' } );
1949 $self->{'_seentop'} = 1;
1950 $self->{'_result_count'}++;
1951 $self->{'_handler_rc'} = undef;
1956 Title : _will_handle
1957 Usage : Private method. For internal use only.
1958 if( $self->_will_handle($type) ) { ... }
1959 Function: Provides an optimized way to check whether or not an element of a
1960 given type is to be handled.
1961 Returns : Reference to EventHandler object if the element type is to be handled.
1962 undef if the element type is not to be handled.
1963 Args : string containing type of element.
1971 Using the cached pointer to the EventHandler to minimize repeated
1976 Caching the will_handle status for each type that is encountered so
1977 that it only need be checked by calling
1978 handler-E<gt>will_handle($type) once.
1982 This does not lead to a major savings by itself (only 5-10%). In
1983 combination with other optimizations, or for large parse jobs, the
1984 savings good be significant.
1986 To test against the unoptimized version, remove the parentheses from
1987 around the third term in the ternary " ? : " operator and add two
1988 calls to $self-E<gt>_eventHandler().
1993 my ( $self, $type ) = @_;
1994 my $handler = $self->{'_handler_cache'};
1996 defined( $self->{'_will_handle_cache'}->{$type} )
1997 ?
$self->{'_will_handle_cache'}->{$type}
1998 : ( $self->{'_will_handle_cache'}->{$type} =
1999 $handler->will_handle($type) );
2001 return $will_handle ?
$handler : undef;
2004 =head2 start_element
2006 Title : start_element
2007 Usage : $eventgenerator->start_element
2008 Function: Handles a start element event
2010 Args : hashref with at least 2 keys 'Data' and 'Name'
2015 my ( $self, $data ) = @_;
2017 # we currently don't care about attributes
2018 my $nm = $data->{'Name'};
2019 my $type = $MODEMAP{$nm};
2021 my $handler = $self->_will_handle($type);
2023 my $func = sprintf( "start_%s", lc $type );
2024 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
2027 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
2028 # "Bio::SearchIO::InternalParserError ".
2029 # "\nCan't handle elements of type \'$type.\'"
2032 unshift @
{ $self->{'_elements'} }, $type;
2033 if ( $type eq 'result' ) {
2034 $self->{'_values'} = {};
2035 $self->{'_result'} = undef;
2037 # cleanup some things
2038 if ( defined $self->{'_values'} ) {
2040 grep { /^\U$type\-/ }
2041 keys %{ $self->{'_values'} }
2044 delete $self->{'_values'}->{$k};
2054 Usage : $eventgenerator->end_element
2055 Function: Handles an end element event
2056 Returns : hashref with an element's worth of data
2057 Args : hashref with at least 2 keys 'Data' and 'Name'
2063 my ( $self, $data ) = @_;
2065 my $nm = $data->{'Name'};
2068 # cache these (TODO: we should probably cache all cross-report data)
2069 if ( $nm eq 'BlastOutput_program' ) {
2070 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
2071 $self->{'_reporttype'} = uc $1;
2073 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
2076 if ( $nm eq 'BlastOutput_version' ) {
2077 $self->{'_reportversion'} = $self->{'_last_data'};
2080 # Hsps are sort of weird, in that they end when another
2081 # object begins so have to detect this in end_element for now
2082 if ( $nm eq 'Hsp' ) {
2083 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq Hsp_features)) {
2087 'Data' => $self->{'_last_hspdata'}->{$_}
2089 ) if defined $self->{'_last_hspdata'}->{$_};
2091 $self->{'_last_hspdata'} = {};
2094 'Name' => 'Hsp_query-from',
2095 'Data' => $self->{'_Query'}->{'begin'}
2100 'Name' => 'Hsp_query-to',
2101 'Data' => $self->{'_Query'}->{'end'}
2107 'Name' => 'Hsp_hit-from',
2108 'Data' => $self->{'_Sbjct'}->{'begin'}
2113 'Name' => 'Hsp_hit-to',
2114 'Data' => $self->{'_Sbjct'}->{'end'}
2118 # } elsif( $nm eq 'Iteration' ) {
2119 # Nothing special needs to be done here.
2121 if ( $type = $MODEMAP{$nm} ) {
2122 my $handler = $self->_will_handle($type);
2124 my $func = sprintf( "end_%s", lc $type );
2125 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2127 shift @
{ $self->{'_elements'} };
2130 elsif ( $MAPPING{$nm} ) {
2131 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2133 # this is where we shove in the data from the
2134 # hashref info about params or statistics
2135 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2136 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2137 $self->{'_last_data'};
2140 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2144 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2146 $self->{'_last_data'} = ''; # remove read data if we are at
2148 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2149 $self->{'_seen_hsp_features'} = 0;
2156 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2157 Function: Convenience method that calls start_element, characters, end_element
2159 Args : Hash ref with the keys 'Name' and 'Data'
2165 my ( $self, $data ) = @_;
2166 # Note that start element isn't needed for character data
2167 # Not too SAX-y, though
2168 #$self->start_element($data);
2169 $self->characters($data);
2170 $self->end_element($data);
2176 Usage : $eventgenerator->characters($str)
2177 Function: Send a character events
2185 my ( $self, $data ) = @_;
2186 if ( $self->in_element('hsp')
2187 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2189 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2190 if defined $data->{'Data'};
2192 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2193 $self->{'_last_data'} = $data->{'Data'};
2196 =head2 within_element
2198 Title : within_element
2199 Usage : if( $eventgenerator->within_element($element) ) {}
2200 Function: Test if we are within a particular element
2201 This is different than 'in' because within can be tested
2204 Args : string element name
2206 See Also: L<in_element>
2210 sub within_element
{
2211 my ( $self, $name ) = @_;
2213 if ( !defined $name && !defined $self->{'_elements'}
2214 || scalar @
{ $self->{'_elements'} } == 0 );
2215 foreach ( @
{ $self->{'_elements'} } ) {
2216 if ( $_ eq $name ) {
2226 Usage : if( $eventgenerator->in_element($element) ) {}
2227 Function: Test if we are in a particular element
2228 This is different than 'within_element' because within
2229 can be tested for a whole block.
2231 Args : string element name
2233 See Also: L<within_element>
2238 my ( $self, $name ) = @_;
2239 return 0 if !defined $self->{'_elements'}->[0];
2240 return ( $self->{'_elements'}->[0] eq $name );
2243 =head2 start_document
2245 Title : start_document
2246 Usage : $eventgenerator->start_document
2247 Function: Handle a start document event
2254 sub start_document
{
2256 $self->{'_lasttype'} = '';
2257 $self->{'_values'} = {};
2258 $self->{'_result'} = undef;
2259 $self->{'_elements'} = [];
2264 Title : end_document
2265 Usage : $eventgenerator->end_document
2266 Function: Handles an end document event
2267 Returns : Bio::Search::Result::ResultI object
2274 my ( $self, @args ) = @_;
2276 #$self->debug("blast.pm: end_document\n");
2277 return $self->{'_result'};
2281 my ( $self, $blast, @args ) = @_;
2283 if ( not defined( $self->writer ) ) {
2284 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2285 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2287 $self->SUPER::write_result
( $blast, @args );
2292 return $self->{'_result_count'};
2295 sub report_count
{ shift->result_count }
2297 =head2 inclusion_threshold
2299 Title : inclusion_threshold
2300 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2301 : $isreb->inclusion_threshold(1e-5);
2302 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2303 score matrix model (blastpgp) that was used for generating the reports
2305 Returns : number (real)
2306 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2307 Args : number (real) (e.g., 0.0001 or 1e-4 )
2311 =head2 max_significance
2313 Usage : $obj->max_significance();
2314 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2315 This is the value of the -signif parameter supplied to new().
2316 Hits with P or E-value above this are skipped.
2317 Returns : Scientific notation number with this format: 1.0e-05.
2318 Argument : Scientific notation number or float (when setting)
2319 Comments : Screening of significant hits uses the data provided on the
2320 : description line. For NCBI BLAST1 and WU-BLAST, this data
2321 : is P-value. for NCBI BLAST2 it is an Expect value.
2327 Synonym for L<max_significance()|max_significance>
2333 Usage : $obj->min_score();
2334 Purpose : Set/Get the Blast score used as screening cutoff.
2335 This is the value of the -score parameter supplied to new().
2336 Hits with scores below this are skipped.
2337 Returns : Integer or scientific notation number.
2338 Argument : Integer or scientific notation number (when setting)
2339 Comments : Screening of significant hits uses the data provided on the
2344 =head2 min_query_length
2346 Usage : $obj->min_query_length();
2347 Purpose : Gets the query sequence length used as screening criteria.
2348 This is the value of the -min_query_len parameter supplied to new().
2349 Hits with sequence length below this are skipped.
2355 =head2 best_hit_only
2357 Title : best_hit_only
2358 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2359 Purpose : Set/Get the indicator for whether or not to process only
2360 : the best BlastHit.
2361 Returns : Boolean (1 | 0)
2362 Argument : Boolean (1 | 0) (when setting)
2366 =head2 check_all_hits
2368 Title : check_all_hits
2369 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2370 Purpose : Set/Get the indicator for whether or not to process all hits.
2371 : If false, the parser will stop processing hits after the
2372 : the first non-significance hit or the first hit that fails
2374 Returns : Boolean (1 | 0)
2375 Argument : Boolean (1 | 0) (when setting)
2379 # general private method used to make minimal hits from leftover
2380 # data in the hit table
2383 my ($self, $hits) = @_;
2384 while ( my $v = shift @
{ $hits }) {
2385 next unless defined $v;
2386 $self->start_element( { 'Name' => 'Hit' } );
2395 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2398 'Name' => 'Hit_accession',
2405 'Name' => 'Hit_signif',
2409 if (exists $self->{'_wublast'}) {
2412 'Name' => 'Hit_score',
2419 'Name' => 'Hit_bits',
2428 'Name' => 'Hit_def',
2432 $self->end_element( { 'Name' => 'Hit' } );
2444 The following information is added
in hopes of increasing the
2445 maintainability of this code
. It runs the risk of becoming obsolete as
2446 the code gets updated
. As always
, double check against the actual
2447 source
. If you find any discrepencies
, please correct them
.
2448 [ This documentation added on
3 Jun
2003. ]
2450 The logic is the brainchild of Jason Stajich
, documented by Steve
2451 Chervitz
. Jason
: please check it over
and modify as you see fit
.
2454 Elmo wants to know
: How does this module unmarshall data from the input stream?
2455 (i
.e
., how does information from a raw input file get added to
2456 the correct Bioperl object?
)
2460 This answer is specific to SearchIO
::blast
, but may apply to other
2461 SearchIO
.pm subclasses as well
. The following description gives the
2462 basic idea
. The actual processing is a little more complex
for
2463 certain types of data
(HSP
, Report Parameters
).
2465 You can think of blast
::next_result
() as faking a SAX XML parser
,
2466 making a non
-XML document behave like its XML
. The overhead to
do this
2467 is quite substantial
(~650 lines of code instead of
~80 in
2470 0. First
, add a key
=> value pair
for the datum of interest to
%MAPPING
2472 'Foo_bar' => 'Foo-bar',
2474 1. next_result
() collects the datum of interest from the input stream
,
2475 and calls element
().
2477 $self->element({ 'Name' => 'Foo_bar',
2478 'Data' => $foobar});
2480 2. The element
() method is a convenience method that calls start_element
(),
2481 characters
(), and end_element
().
2483 3. start_element
() checks to see
if the event handler can handle a start_xxx
(),
2484 where xxx
= the
'Name' parameter passed into element
(), and calls start_xxx
()
2485 if so
. Otherwise
, start_element
() does
not do anything
.
2487 Data that will have such an event handler are
defined in %MODEMAP.
2488 Typically
, there are only handler methods
for the main parts of
2489 the search result
(e
.g
., Result
, Iteration
, Hit
, HSP
),
2490 which have corresponding Bioperl modules
. So
in this example
,
2491 there was an earlier call such as
$self->element({'Name'=>'Foo'})
2492 and the Foo_bar datum is meant to ultimately go into a Foo object
.
2494 The start_foo
() method
in the handler will typically
do any
2495 data initialization necessary to prepare
for creating a new Foo object
.
2496 Example
: SearchResultEventBuilder
::start_result
()
2498 4. characters
() takes the value of the
'Data' key from the hashref argument
in
2499 the elements
() call
and saves it
in a
local data member
:
2501 $self->{'_last_data'} = $data->{'Data'};
2503 5. end_element
() is like start_element
() in that it does the check
for whether
2504 the event handler can handle end_xxx
() and if so
, calls it
, passing
in
2505 the data collected from all of the characters
() calls that occurred
2506 since the start_xxx
() call
.
2508 If there isn
't any special handler for the data type specified by 'Name
',
2509 end_element() will place the data saved by characters() into another
2510 local data member that saves it in a hash with a key defined by %MAPPING.
2512 $nm = $data->{'Name
'};
2513 $self->{'_values
'}->{$MAPPING{$nm}} = $self->{'_last_data
'};
2515 In this case, $MAPPING{$nm} is 'Foo
-bar
'.
2517 end_element() finishes by resetting the local data member used by
2518 characters(). (i.e., $self->{'_last_data
'} = '';)
2520 6. When the next_result() method encounters the end of the Foo element in the
2521 input stream. It will invoke $self->end_element({'Name
'=>'Foo
'}).
2522 end_element() then sends all of the data in the $self->{'_values
'} hash.
2523 Note that $self->{'_values
'} is cleaned out during start_element(),
2524 keeping it at a resonable size.
2526 In the event handler, the end_foo() method takes the hash from end_element()
2527 and creates a new hash containing the same data, but having keys lacking
2528 the 'Foo
' prefix (e.g., 'Foo
-bar
' becomes '-bar
'). The handler's end_foo
()
2529 method then creates the Foo object
, passing
in this new hash as an argument
.
2530 Example
: SearchResultEventBuilder
::end_result
()
2532 7. Objects created from the data
in the search result are managed by
2533 the event handler which adds them to a ResultI object
(using API methods
2534 for that object
). The ResultI object gets passed back to
2535 SearchIO
::end_element
() when it calls end_result
().
2537 The ResultI object is then saved
in an internal data member of the
2538 SearchIO object
, which returns it at the end of next_result
()
2539 by calling end_document
().
2541 (Technical Note
: All objects created by end_xxx
() methods
in the event
2542 handler are returned to SearchIO
::end_element
(), but the SearchIO object
2543 only cares about the ResultI objects
.)
2545 (Sesame Street aficionados note
: This answer was NOT
given by Mr
. Noodle
;-P
)