3 # BioPerl module for Bio::SearchIO::blast
5 # Cared for by Jason Stajich <jason@bioperl.org>
7 # Copyright Jason Stajich
9 # You may distribute this module under the same terms as perl itself
11 # POD documentation - main docs before the code
14 # PSI-BLAST full parsing support. Rollout of new
15 # model which will remove Steve's old psiblast driver
17 # Megablast parsing fix as reported by Neil Saunders
19 # Support bl2seq parsing
21 # Parse more blast statistics, lambda, entropy, etc
22 # from WU-BLAST in frame-specific manner
23 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
24 # 20071104 - dmessina - added support for WUBLAST -echofilter
25 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
29 Bio::SearchIO::blast - Event generator for event based parsing of
34 # Do not use this object directly - it is used as part of the
35 # Bio::SearchIO system.
38 my $searchio = Bio::SearchIO->new(-format => 'blast',
39 -file => 't/data/ecolitst.bls');
40 while( my $result = $searchio->next_result ) {
41 while( my $hit = $result->next_hit ) {
42 while( my $hsp = $hit->next_hsp ) {
50 This object encapsulated the necessary methods for generating events
51 suitable for building Bio::Search objects from a BLAST report file.
52 Read the L<Bio::SearchIO> for more information about how to use this.
54 This driver can parse:
60 NCBI produced plain text BLAST reports from blastall, this also
61 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
62 BLAST output is parsed with the blastxml SearchIO driver
70 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
74 BLAST-like output from Paracel BTK output
80 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
81 doesn't report the algorithm used - I assume it is BLASTX by default -
82 you can supply the program type with -report_type in the SearchIO
85 my $parser = Bio::SearchIO->new(-format => 'blast',
86 -file => 'bl2seq.tblastn.report',
87 -report_type => 'tblastn');
89 This only really affects where the frame and strand information are
90 put - they will always be on the $hsp-E<gt>query instead of on the
91 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
92 produced reports. Hope that's clear...
98 User feedback is an integral part of the evolution of this and other
99 Bioperl modules. Send your comments and suggestions preferably to
100 the Bioperl mailing list. Your participation is much appreciated.
102 bioperl-l@bioperl.org - General discussion
103 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
105 =head2 Reporting Bugs
107 Report bugs to the Bioperl bug tracking system to help us keep track
108 of the bugs and their resolution. Bug reports can be submitted via the
111 http://bugzilla.open-bio.org/
113 =head1 AUTHOR - Jason Stajich
115 Email Jason Stajich jason-at-bioperl.org
119 Steve Chervitz sac-at-bioperl.org
123 The rest of the documentation details each of the object methods.
124 Internal methods are usually preceded with a _
128 # Let the code begin...'
130 package Bio
::SearchIO
::blast
;
132 use Bio
::SearchIO
::IteratedSearchResultEventBuilder
;
134 use vars
qw(%MAPPING %MODEMAP
135 $DEFAULT_BLAST_WRITER_CLASS
143 use base qw(Bio::SearchIO);
148 # mapping of NCBI Blast terms to Bioperl hash keys
150 'BlastOutput' => 'result',
151 'Iteration' => 'iteration',
156 # This should really be done more intelligently, like with
160 'Hsp_bit-score' => 'HSP-bits',
161 'Hsp_score' => 'HSP-score',
162 'Hsp_evalue' => 'HSP-evalue',
164 'Hsp_pvalue' => 'HSP-pvalue',
165 'Hsp_query-from' => 'HSP-query_start',
166 'Hsp_query-to' => 'HSP-query_end',
167 'Hsp_hit-from' => 'HSP-hit_start',
168 'Hsp_hit-to' => 'HSP-hit_end',
169 'Hsp_positive' => 'HSP-conserved',
170 'Hsp_identity' => 'HSP-identical',
171 'Hsp_gaps' => 'HSP-hsp_gaps',
172 'Hsp_hitgaps' => 'HSP-hit_gaps',
173 'Hsp_querygaps' => 'HSP-query_gaps',
174 'Hsp_qseq' => 'HSP-query_seq',
175 'Hsp_hseq' => 'HSP-hit_seq',
176 'Hsp_midline' => 'HSP-homology_seq',
177 'Hsp_align-len' => 'HSP-hsp_length',
178 'Hsp_query-frame' => 'HSP-query_frame',
179 'Hsp_hit-frame' => 'HSP-hit_frame',
180 'Hsp_links' => 'HSP-links',
181 'Hsp_group' => 'HSP-hsp_group',
183 'Hit_id' => 'HIT-name',
184 'Hit_len' => 'HIT-length',
185 'Hit_accession' => 'HIT-accession',
186 'Hit_def' => 'HIT-description',
187 'Hit_signif' => 'HIT-significance',
188 # For NCBI blast, the description line contains bits.
189 # For WU-blast, the description line contains score.
190 'Hit_score' => 'HIT-score',
191 'Hit_bits' => 'HIT-bits',
193 'Iteration_iter-num' => 'ITERATION-number',
194 'Iteration_converged' => 'ITERATION-converged',
196 'BlastOutput_program' => 'RESULT-algorithm_name',
197 'BlastOutput_version' => 'RESULT-algorithm_version',
198 'BlastOutput_query-def' => 'RESULT-query_name',
199 'BlastOutput_query-len' => 'RESULT-query_length',
200 'BlastOutput_query-acc' => 'RESULT-query_accession',
201 'BlastOutput_query-gi' => 'RESULT-query_gi',
202 'BlastOutput_querydesc' => 'RESULT-query_description',
203 'BlastOutput_db' => 'RESULT-database_name',
204 'BlastOutput_db-len' => 'RESULT-database_entries',
205 'BlastOutput_db-let' => 'RESULT-database_letters',
206 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
208 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
209 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
210 'Parameters_include' => { 'RESULT-parameters' => 'include' },
211 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
212 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
213 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
214 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
215 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
216 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
217 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
218 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
219 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
220 'Statistics_hsp-len' =>
221 { 'RESULT-statistics' => 'effective_hsplength' },
222 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
223 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
224 'Statistics_eff-spaceused' =>
225 { 'RESULT-statistics' => 'effectivespaceused' },
226 'Statistics_eff-dblen' =>
227 { 'RESULT-statistics' => 'effectivedblength' },
228 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
229 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
230 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
231 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
232 'Statistics_gapped_lambda' =>
233 { 'RESULT-statistics' => 'lambda_gapped' },
234 'Statistics_gapped_entropy' =>
235 { 'RESULT-statistics' => 'entropy_gapped' },
237 'Statistics_framewindow' =>
238 { 'RESULT-statistics' => 'frameshiftwindow' },
239 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
241 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
242 'Statistics_num_suc_extensions' =>
243 { 'RESULT-statistics' => 'num_successful_extensions' },
246 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
247 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
248 'Statistics_noprocessors' =>
249 { 'RESULT-statistics' => 'no_of_processors' },
250 'Statistics_neighbortime' =>
251 { 'RESULT-statistics' => 'neighborhood_generate_time' },
252 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
253 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
256 # add WU-BLAST Frame-Based Statistics
257 for my $frame ( 0 .. 3 ) {
258 for my $strand ( '+', '-' ) {
260 qw(length efflength E S W T X X_gapped E2
264 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
265 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
267 for my $val (qw(lambda kappa entropy )) {
268 for my $type (qw(used computed gapped)) {
269 my $key = "Statistics_frame$strand$frame\_$val\_$type";
271 { 'RESULT-statistics' =>
272 "Frame$strand$frame\_$val\_$type" };
273 $MAPPING{$key} = $val;
281 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
282 S1_bits S2_bits num_extensions
283 num_successful_extensions
284 seqs_better_than_cutoff
286 search_cputime total_cputime
287 search_actualtime total_actualtime
288 no_of_processors ctxfactor)
291 my $key = "Statistics_$stats";
292 my $val = { 'RESULT-statistics' => $stats };
293 $MAPPING{$key} = $val;
296 # add WU-BLAST Parameters
298 qw(span span1 span2 links warnings notes hspsepsmax
299 hspsepqmax topcomboN topcomboE postsw cpus wordmask
300 filter sort_by_pvalue sort_by_count sort_by_highscore
301 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
305 my $key = "Parameters_$param";
306 my $val = { 'RESULT-parameters' => $param };
307 $MAPPING{$key} = $val;
310 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::Search::Writer::HitTableWriter';
311 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
312 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
318 Usage : my $obj = Bio::SearchIO::blast->new(%args);
319 Function: Builds a new Bio::SearchIO::blast object
320 Returns : Bio::SearchIO::blast
321 Args : Key-value pairs:
322 -fh/-file => filehandle/filename to BLAST file
324 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
325 reports when you want to distinguish between
326 tblastn and blastx reports (this only controls
327 where the frame information is put - on the query
329 -inclusion_threshold => e-value threshold for inclusion in the
330 PSI-BLAST score matrix model (blastpgp)
331 -signif => float or scientific notation number to be used
332 as a P- or Expect value cutoff
333 -score => integer or scientific notation number to be used
334 as a blast score value cutoff
335 -bits => integer or scientific notation number to be used
336 as a bit score value cutoff
337 -hit_filter => reference to a function to be used for
338 filtering hits based on arbitrary criteria.
339 All hits of each BLAST report must satisfy
340 this criteria to be retained.
341 If a hit fails this test, it is ignored.
342 This function should take a
343 Bio::Search::Hit::BlastHit.pm object as its first
344 argument and return true
345 if the hit should be retained.
346 Sample filter function:
347 -hit_filter => sub { $hit = shift;
349 (Note: -filt_func is synonymous with -hit_filter)
350 -overlap => integer. The amount of overlap to permit between
351 adjacent HSPs when tiling HSPs. A reasonable value is 2.
352 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
354 The following criteria are not yet supported:
355 (these are probably best applied within this module rather than in the
356 event handler since they would permit the parser to take some shortcuts.)
358 -check_all_hits => boolean. Check all hits for significance against
359 significance criteria. Default = false.
360 If false, stops processing hits after the first
361 non-significant hit or the first hit that fails
362 the hit_filter call. This speeds parsing,
363 taking advantage of the fact that the hits
364 are processed in the order they appear in the report.
365 -min_query_len => integer to be used as a minimum for query sequence length.
366 Reports with query sequences below this length will
367 not be processed. Default = no minimum length.
368 -best => boolean. Only process the best hit of each report;
374 my ( $self, @args ) = @_;
375 $self->SUPER::_initialize
(@args);
377 # Blast reports require a specialized version of the SREB due to the
378 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
379 # issue here is that we want to set new default object factories if none are
382 my $handler = Bio
::SearchIO
::IteratedSearchResultEventBuilder
->new(@args);
383 $self->attach_EventHandler($handler);
385 # 2006-04-26 move this to the attach_handler function in this module so we
386 # can really reset the handler
387 # Optimization: caching
388 # the EventHandler since it is used a lot during the parse.
390 # $self->{'_handler_cache'} = $handler;
392 my ( $min_qlen, $check_all, $overlap, $best, $rpttype ) = $self->_rearrange(
394 qw(MIN_LENGTH CHECK_ALL_HITS
401 defined $min_qlen && $self->min_query_length($min_qlen);
402 defined $best && $self->best_hit_only($best);
403 defined $check_all && $self->check_all_hits($check_all);
404 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
407 sub attach_EventHandler
{
408 my ($self,$handler) = @_;
410 $self->SUPER::attach_EventHandler
($handler);
412 # Optimization: caching the EventHandler since it is used a lot
415 $self->{'_handler_cache'} = $handler;
422 Usage : my $hit = $searchio->next_result;
423 Function: Returns the next Result from a search
424 Returns : Bio::Search::Result::ResultI object
431 my $v = $self->verbose;
434 $self->{'_seentop'} = 0; # start next report at top
435 $self->{'_seentop'} = 0;
436 my ( $reporttype, $seenquery, $reportline );
437 my ( $seeniteration, $found_again );
438 my $incl_threshold = $self->inclusion_threshold;
440 $self->start_document(); # let the fun begin...
442 my $gapped_stats = 0; # for switching between gapped/ungapped
444 local $_ = "\n"; #consistency
446 while ( defined( $_ = $self->_readline ) ) {
447 next if (/^\s+$/); # skip empty lines
448 next if (/CPU time:/);
451 /^((?:\S+?)?BLAST[NPX])\s+(.+)$/i # NCBI BLAST, PSIBLAST
452 # RPSBLAST, MEGABLAST
453 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
456 ($reporttype, my $reportversion) = ($1, $2);
457 # need to keep track of whether this is WU-BLAST
458 if ($reportversion && $reportversion =~ m{WashU$}) {
459 $self->{'_wublast'}++;
461 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
462 if ( $self->{'_seentop'} ) {
463 # This handles multi-result input streams
464 $self->_pushback($_);
467 $self->_start_blastoutput;
468 if ($reporttype =~ /RPS-BLAST/) {
469 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
471 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
474 'Name' => 'BlastOutput_program',
475 'Data' => $reporttype
481 'Name' => 'BlastOutput_version',
482 'Data' => $reportversion
487 'Name' => 'BlastOutput_inclusion-threshold',
488 'Data' => $incl_threshold
492 # added Windows workaround for bug 1985
493 elsif (/^(Searching|Results from round)/) {
494 next unless $1 =~ /Results from round/;
495 $self->debug("blast.pm: Possible psi blast iterations found...\n");
497 $self->in_element('hsp')
498 && $self->end_element( { 'Name' => 'Hsp' } );
499 $self->in_element('hit')
500 && $self->end_element( { 'Name' => 'Hit' } );
501 if ( defined $seeniteration ) {
502 $self->within_element('iteration')
503 && $self->end_element( { 'Name' => 'Iteration' } );
504 $self->start_element( { 'Name' => 'Iteration' } );
507 $self->start_element( { 'Name' => 'Iteration' } );
511 elsif (/^Query=\s*(.*)$/) {
513 $self->debug("blast.pm: Query= found...$_\n");
515 if ( defined $seenquery ) {
516 $self->_pushback($reportline) if $reportline;
517 $self->_pushback($_);
521 if ( !defined $reporttype ) {
522 $self->_start_blastoutput;
523 if ( defined $seeniteration ) {
524 $self->in_element('iteration')
525 && $self->end_element( { 'Name' => 'Iteration' } );
526 $self->start_element( { 'Name' => 'Iteration' } );
529 $self->start_element( { 'Name' => 'Iteration' } );
535 $_ = $self->_readline;
536 while ( defined($_) ) {
538 $self->_pushback($_);
541 # below line fixes length issue with BLAST v2.2.13; still works
543 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
550 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ?
" $_" : $_;
551 $q =~ s/\s+/ /g; # this catches the newline as well
555 $_ = $self->_readline;
558 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
561 'Name' => 'BlastOutput_query-def',
567 'Name' => 'BlastOutput_query-len',
571 defined $desc && $desc =~ s/\s+$//;
574 'Name' => 'BlastOutput_querydesc',
578 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
579 $version = defined($version) && length($version) ?
".$version" : "";
580 $acc = '' unless defined($acc);
583 'Name' => 'BlastOutput_query-acc',
584 'Data' => "$acc$version"
588 # added check for WU-BLAST -echofilter option (bug 2388)
589 elsif (/^>Unfiltered[+-]1$/) {
590 # skip all of the lines of unfiltered sequence
591 while($_ !~ /^Database:/) {
592 $self->debug("Bypassing features line: $_");
593 $_ = $self->_readline;
595 $self->_pushback($_);
597 elsif (/Sequences producing significant alignments:/) {
598 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
601 # PSI-BLAST parsing needs to be fixed to specifically look
602 # for old vs new per iteration, as sorting based on duplication
603 # leads to bugs, see bug 1986
605 # The next line is not necessarily whitespace in psiblast reports.
606 # Also note that we must look for the end of this section by testing
607 # for a line with a leading >. Blank lines occur with this section
609 if ( !$self->in_element('iteration') ) {
610 $self->start_element( { 'Name' => 'Iteration' } );
612 # these elements are dropped with some multiquery reports; add
616 'Name' => 'BlastOutput_db-len',
617 'Data' => $self->{'_blsdb_length'}
619 ) if $self->{'_blsdb_length'};
622 'Name' => 'BlastOutput_db-let',
623 'Data' => $self->{'_blsdb_letters'}
625 ) if $self->{'_blsdb_letters'};
628 'Name' => 'BlastOutput_db',
629 'Data' => $self->{'_blsdb'}
631 ) if $self->{'_blsdb_letters'};
633 # changed 7/15/2007 to only check certain lines, end while if something doesn't match
636 while ( defined( my $descline = $self->_readline() ) ) {
637 next if $descline =~ m{^\s*$};
638 # GCG multiline oddness...
639 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
640 my ($id, $nextline) = ($1, $self->_readline);
641 $nextline =~ s{^!}{};
642 $descline = "$id $nextline";
644 # NCBI style hit table (no N)
645 if ($descline =~ /(?
<!cor
) # negative lookahead
646 (\d
+\
.?
(?
:[\
+\
-eE
]+)?\d
+) # number (float or scientific notation)
648 ([\de
]+\
.?\d
*(?
:[\
+\
-eE
]+)?\d
*) # number (float or scientific notation)
651 my ( $score, $evalue ) = ($1, $2);
653 # Some data clean-up so e-value will appear numeric to perl
654 $evalue =~ s/^e/1e/i;
656 # This to handle no-HSP case
657 my @line = split ' ',$descline;
659 # we want to throw away the score, evalue
660 pop @line, pop @line;
662 # and N if it is present (of course they are not
663 # really in that order, but it doesn't matter
664 if ($3) { pop @line }
666 # add the last 2 entries s.t. we can reconstruct
667 # a minimal Hit object at the end of the day
668 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
669 } elsif ($descline =~ /^CONVERGED/i) {
672 'Name' => 'Iteration_converged',
677 $self->_pushback($descline); # Catch leading > (end of section)
682 elsif (/Sequences producing High-scoring Segment Pairs:/) {
684 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
686 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
687 $_ = $self->_readline();
690 if ( !$self->in_element('iteration') ) {
691 $self->start_element( { 'Name' => 'Iteration' } );
694 while ( defined( $_ = $self->_readline() )
698 pop @line; # throw away first number which is for 'N'col
700 # add the last 2 entries to array s.t. we can reconstruct
701 # a minimal Hit object at the end of the day
703 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
707 elsif (/^Database:\s*(.+)$/) {
709 $self->debug("blast.pm: Database: $1\n");
711 while ( defined( $_ = $self->_readline ) ) {
713 /^\s
+(\
-?
[\d\
,]+|\S
+)\s
+sequences\
;
714 \s
+(\
-?
[\d
,]+|\S
+)\s
+ # Deal with NCBI 2.2.8 OSX problems
718 my ( $s, $l ) = ( $1, $2 );
723 'Name' => 'BlastOutput_db-len',
729 'Name' => 'BlastOutput_db-let',
733 # cache for next round in cases with multiple queries
734 $self->{'_blsdb'} = $db;
735 $self->{'_blsdb_length'} = $s;
736 $self->{'_blsdb_letters'} = $l;
746 'Name' => 'BlastOutput_db',
751 # bypasses this NCBI blast 2.2.13 extra output for now...
752 # Features in/flanking this part of subject sequence:
753 elsif (/^\sFeatures\s\w+\sthis\spart\sof\ssubject\ssequence:/xmso) {
754 # junk following lines up to start of HSP
755 while($_ !~ /^\sScore\s=/) {
756 $self->debug("Bypassing features line: $_");
757 $_ = $self->_readline;
759 $self->_pushback($_);
762 # move inside of a hit
763 elsif (/^>\s*(\S+)\s*(.*)?/) {
766 $self->debug("blast.pm: Hit: $1\n");
767 $self->in_element('hsp')
768 && $self->end_element( { 'Name' => 'Hsp' } );
769 $self->in_element('hit')
770 && $self->end_element( { 'Name' => 'Hit' } );
772 # special case when bl2seq reports don't have a leading
774 if ( !$self->within_element('result') ) {
775 $self->_start_blastoutput;
776 $self->start_element( { 'Name' => 'Iteration' } );
778 elsif ( !$self->within_element('iteration') ) {
779 $self->start_element( { 'Name' => 'Iteration' } );
781 $self->start_element( { 'Name' => 'Hit' } );
785 $self->debug("Starting a hit: $1 $2\n");
792 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
795 'Name' => 'Hit_accession',
799 # add hit significance (from the hit table)
800 # this is where Bug 1986 went awry
802 # Changed for Bug2409; hit->significance and hit->score/bits derived
803 # from HSPs, not hit table unless necessary
806 while (my $v = shift @hit_signifs) {
807 my $tableid = $v->[2];
808 if ($tableid !~ m{\Q$id\E}) {
809 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
815 while ( defined( $_ = $self->_readline() ) ) {
818 if (/Length\s*=\s*([\d,]+)/) {
830 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
834 $restofline =~ s/\s+/ /g;
838 'Data' => $restofline
842 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
846 ( $self->in_element('hit') || $self->in_element('hsp') )
847 && # paracel genewise BTK
848 m
/Score\s
*=\s
*(\S
+)\s
*bits\s
* # Bit score
849 (?
:\
((\d
+)\
))?
, # Raw score
850 \s
+Log\
-Length\sScore\s
*=\s
*(\d
+) # Log-Length score
854 $self->in_element('hsp')
855 && $self->end_element( { 'Name' => 'Hsp' } );
856 $self->start_element( { 'Name' => 'Hsp' } );
858 $self->debug( "Got paracel genewise HSP score=$1\n");
860 # Some data clean-up so e-value will appear numeric to perl
861 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
862 $evalue =~ s/^e/1e/i;
865 'Name' => 'Hsp_score',
871 'Name' => 'Hsp_bit-score',
877 'Name' => 'Hsp_evalue',
883 ( $self->in_element('hit') || $self->in_element('hsp') )
884 && # paracel hframe BTK
885 m
/Score\s
*=\s
*([^,\s
]+), # Raw score
886 \s
*Expect\s
*=\s
*([^,\s
]+), # E-value
887 \s
*P
(?
:\
(\S
+\
))?\s
*=\s
*([^,\s
]+) # P-value
891 $self->in_element('hsp')
892 && $self->end_element( { 'Name' => 'Hsp' } );
893 $self->start_element( { 'Name' => 'Hsp' } );
895 $self->debug( "Got paracel hframe HSP score=$1\n");
897 # Some data clean-up so e-value will appear numeric to perl
898 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
899 $evalue = "1$evalue" if $evalue =~ /^e/;
900 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
904 'Name' => 'Hsp_score',
910 'Name' => 'Hsp_evalue',
916 'Name' => 'Hsp_pvalue',
922 ( $self->in_element('hit') || $self->in_element('hsp') )
924 m
/Score\s
*=\s
*(\S
+)\s
* # Bit score
925 \
(([\d\
.]+)\s
*bits\
), # Raw score
926 \s
*Expect\s
*=\s
*([^,\s
]+), # E-value
928 P
(?
:\
(\d
+\
))?\s
*=\s
*([^,\s
]+) # P-value
929 (?
:\s
*,\s
+Group\s
*\
=\s
*(\d
+))?
# HSP Group
932 { # wu-blast HSP parse
933 $self->in_element('hsp')
934 && $self->end_element( { 'Name' => 'Hsp' } );
935 $self->start_element( { 'Name' => 'Hsp' } );
937 # Some data clean-up so e-value will appear numeric to perl
938 my ( $score, $bits, $evalue, $pvalue, $group ) =
939 ( $1, $2, $3, $4, $5 );
940 $evalue =~ s/^e/1e/i;
941 $pvalue =~ s/^e/1e/i;
945 'Name' => 'Hsp_score',
951 'Name' => 'Hsp_bit-score',
957 'Name' => 'Hsp_evalue',
963 'Name' => 'Hsp_pvalue',
968 if ( defined $group ) {
971 'Name' => 'Hsp_group',
979 ( $self->in_element('hit') || $self->in_element('hsp') )
980 && # ncbi blast, works with 2.2.17
981 m
/Score\s
*=\s
*(\S
+)\s
*bits\s
* # Bit score
982 (?
:\
((\d
+)\
))?
, # Missing for BLAT pseudo-BLAST fmt
983 \s
*Expect
(?
:\
((\d
+\
+?
)\
))?\s
*=\s
*([^,\s
]+) # E-value
986 { # parse NCBI blast HSP
987 $self->in_element('hsp')
988 && $self->end_element( { 'Name' => 'Hsp' } );
990 # Some data clean-up so e-value will appear numeric to perl
991 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
992 $evalue =~ s/^e/1e/i;
993 $self->start_element( { 'Name' => 'Hsp' } );
996 'Name' => 'Hsp_score',
1002 'Name' => 'Hsp_bit-score',
1008 'Name' => 'Hsp_evalue',
1018 $score = '' unless defined $score; # deal with BLAT which
1019 # has no score only bits
1020 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1023 $self->in_element('hsp')
1024 && m/Identities\s*=\s*(\d+)\s*\/\s
*(\d
+)\s
*[\d\
%\
(\
)]+\s
*
1025 (?
:,\s
*Positives\s
*=\s
*(\d
+)\
/(\d
+)\s
*[\d\
%\
(\
)]+\s
*)?
# pos only valid for Protein alignments
1026 (?
:\
,\s
*Gaps\s
*=\s
*(\d
+)\
/(\d
+))?
# Gaps
1032 'Name' => 'Hsp_identity',
1038 'Name' => 'Hsp_align-len',
1045 'Name' => 'Hsp_positive',
1053 'Name' => 'Hsp_positive',
1061 'Name' => 'Hsp_gaps',
1067 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1068 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1070 if (/(Frame\s*=\s*.+)$/) {
1072 # handle wu-blast Frame listing on same line
1073 $self->_pushback($1);
1076 elsif ( $self->in_element('hsp')
1077 && /Strand\s*=\s*(Plus|Minus)\s*\/\s
*(Plus
|Minus
)/i
)
1080 # consume this event ( we infer strand from start/end)
1081 unless ($reporttype) {
1082 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1083 $bl2seq_fix = 1; # special case to resubmit the algorithm
1088 elsif ( $self->in_element('hsp')
1089 && /Links\s*=\s*(\S+)/ox )
1093 'Name' => 'Hsp_links',
1098 elsif ( $self->in_element('hsp')
1099 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s
*([\
+\
-][1-3]))?
/ )
1102 # this is for bl2seq only
1103 unless ( defined $reporttype ) {
1105 if ( $1 && $2 ) { $reporttype = 'TBLASTX' }
1107 $reporttype = 'BLASTX';
1109 # we can't distinguish between BLASTX and TBLASTN straight from the report }
1111 $self->{'_reporttype'} = $reporttype;
1114 my ( $queryframe, $hitframe );
1115 if ( $reporttype eq 'TBLASTX' ) {
1116 ( $queryframe, $hitframe ) = ( $1, $2 );
1117 $hitframe =~ s/\/\s*//g
;
1119 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1120 ( $hitframe, $queryframe ) = ( $1, 0 );
1122 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1123 ( $queryframe, $hitframe ) = ( $1, 0 );
1124 # though NCBI doesn't report it, this is a special BLASTX-like
1125 # RPS-BLAST; should be handled differently
1126 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1129 'Name' => 'BlastOutput_program',
1130 'Data' => 'RPS-BLAST(BLASTX)'
1137 'Name' => 'Hsp_query-frame',
1138 'Data' => $queryframe
1144 'Name' => 'Hsp_hit-frame',
1149 elsif (/^Parameters:/
1150 || /^\s+Database:\s+?/
1154 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1157 # Note: Lambda check was necessary to parse
1158 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1159 $self->debug("blast.pm: found parameters section \n");
1161 $self->in_element('hsp')
1162 && $self->end_element( { 'Name' => 'Hsp' } );
1163 $self->in_element('hit')
1164 && $self->end_element( { 'Name' => 'Hit' } );
1166 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1167 # and still want to construct minimal Hit objects
1168 $self->_cleanup_hits(\
@hit_signifs) if scalar(@hit_signifs);
1169 $self->within_element('iteration')
1170 && $self->end_element( { 'Name' => 'Iteration' } );
1172 next if /^\s+Subset/;
1173 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ?
'ncbi' : 'wublast';
1174 if (/^\s*Histogram/) {
1180 # default is that gaps are allowed
1183 'Name' => 'Parameters_allowgaps',
1187 while ( defined( $_ = $self->_readline ) ) {
1189 /^((?:\S+)?BLAST[NPX])\s+(.+)$/i # NCBI BLAST, PSIBLAST
1190 # RPSBLAST, MEGABLAST
1191 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1194 $self->_pushback($_);
1196 # let's handle this in the loop
1200 $self->_pushback($reportline) if $reportline;
1201 $self->_pushback($_);
1205 # here is where difference between wublast and ncbiblast
1206 # is better handled by different logic
1207 if ( /Number of Sequences:\s+([\d\,]+)/i
1208 || /of sequences in database:\s+(\-?[\d,]+)/i )
1214 'Name' => 'Statistics_db-len',
1219 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1224 'Name' => 'Statistics_db-let',
1229 elsif ( $blast eq 'btk' ) {
1232 elsif ( $blast eq 'wublast' ) {
1238 'Name' => 'Parameters_expect',
1246 'Name' => 'Parameters_allowgaps',
1251 elsif (/ctxfactor=(\S+)/) {
1254 'Name' => 'Statistics_ctxfactor',
1260 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1265 'Name' => "Parameters_$1",
1270 elsif (/(\S+)=(\S+)/) {
1273 'Name' => "Parameters_$1",
1278 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1279 my $firstgapinfo = 1;
1281 while ( defined($_) && !/^\s+$/ ) {
1285 && s/Q=(\d+),R=(\d+)\s+//x )
1291 'Name' => 'Parameters_gap-open',
1297 'Name' => 'Parameters_gap-extend',
1309 next if $type eq 'n/a';
1311 warn "fields is empty for $type\n";
1317 "Statistics_frame$frame\_$type",
1318 'Data' => shift @fields
1324 my ( $frameo, $matid, $matrix, @fields ) =
1326 if ( !defined $frame ) {
1328 # keep some sort of default feature I guess
1329 # even though this is sort of wrong
1332 'Name' => 'Parameters_matrix',
1338 'Name' => 'Statistics_lambda',
1339 'Data' => $fields[0]
1344 'Name' => 'Statistics_kappa',
1345 'Data' => $fields[1]
1350 'Name' => 'Statistics_entropy',
1351 'Data' => $fields[2]
1366 my $f = $fields[$ii];
1367 next unless defined $f; # deal with n/a
1368 if ( $f eq 'same' ) {
1369 $f = $fields[ $ii - 3 ];
1375 "Statistics_frame$frame\_$type",
1384 $_ = $self->_readline;
1388 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1390 while ( defined($_) && !/^\s+/ ) {
1394 if ( @fields <= 3 ) {
1395 for my $type (qw(X_gapped E2_gapped S2)) {
1396 last unless @fields;
1400 "Statistics_frame$frame\_$type",
1401 'Data' => shift @fields
1417 "Statistics_frame$frame\_$type",
1418 'Data' => shift @fields
1423 $_ = $self->_readline;
1427 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1428 if ( $1 eq 'states in' ) {
1431 'Name' => 'Statistics_DFA_states',
1436 elsif ( $1 eq 'size of' ) {
1439 'Name' => 'Statistics_DFA_size',
1446 m
/^\s
+Time to generate neighborhood
:\s
+
1452 'Name' => 'Statistics_neighbortime',
1457 elsif (/processors\s+used:\s+(\d+)/) {
1460 'Name' => 'Statistics_noprocessors',
1466 m
/^\s
+(\S
+)\s
+cpu\s
+time:\s
+# cputype
1467 (\S
+\s
+\S
+\s
+\S
+) # cputime
1468 \s
+Elapsed
:\s
+(\S
+)/x
1471 my $cputype = lc($1);
1474 'Name' => "Statistics_$cputype\_cputime",
1480 'Name' => "Statistics_$cputype\_actualtime",
1485 elsif (/^\s+Start:/) {
1486 my ( $junk, $start, $stime, $end, $etime ) =
1487 split( /\s+(Start|End)\:\s+/, $_ );
1491 'Name' => 'Statistics_starttime',
1498 'Name' => 'Statistics_endtime',
1503 elsif (/^\s+Database:\s+(.+)$/) {
1506 'Name' => 'Parameters_full_dbpath',
1512 elsif (/^\s+Posted:\s+(.+)/) {
1517 'Name' => 'Statistics_posted_date',
1523 elsif ( $blast eq 'ncbi' ) {
1525 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1528 'Name' => 'Parameters_matrix',
1537 $_ = $self->_readline;
1539 my ( $lambda, $kappa, $entropy ) = split;
1540 if ($gapped_stats) {
1543 'Name' => "Statistics_gapped_lambda",
1549 'Name' => "Statistics_gapped_kappa",
1555 'Name' => "Statistics_gapped_entropy",
1563 'Name' => "Statistics_lambda",
1569 'Name' => "Statistics_kappa",
1575 'Name' => "Statistics_entropy",
1581 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/ox) {
1584 'Name' => 'Statistics_eff-spaceused',
1589 elsif (m/effective\s+search\s+space:\s+(\d+)/ox) {
1592 'Name' => 'Statistics_eff-space',
1598 m
/Gap\s
+Penalties
:\s
+Existence
:\s
+(\d
+)\
,
1599 \s
+Extension
:\s
+(\d
+)/ox
1604 'Name' => 'Parameters_gap-open',
1610 'Name' => 'Parameters_gap-extend',
1615 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1618 'Name' => 'Statistics_hsp-len',
1623 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/) {
1628 'Name' => 'Statistics_query-len',
1633 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/) {
1638 'Name' => 'Statistics_eff-dblen',
1644 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1651 'Name' => "Statistics_$1",
1658 'Name' => "Statistics_$1_bits",
1665 m
/frameshift\s
+window\
,
1666 \s
+decay\s
+const
:\s
+(\d
+)\
,\s
+([\
.\d
]+)/x
1671 'Name' => 'Statistics_framewindow',
1677 'Name' => 'Statistics_decay',
1682 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1685 'Name' => 'Statistics_hit_to_db',
1690 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1693 'Name' => 'Statistics_num_extensions',
1699 m
/^Number\s
+of\s
+successful\s
+extensions
:\s
+
1705 'Name' => 'Statistics_num_suc_extensions',
1711 m
/^Number\s
+of\s
+sequences\s
+better\s
+than\s
+
1717 'Name' => 'Parameters_expect',
1723 'Name' => 'Statistics_seqs_better_than_cutoff',
1728 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1733 'Name' => 'Statistics_posted_date',
1738 elsif ( !/^\s+$/ ) {
1739 #$self->debug( "unmatched stat $_");
1744 } elsif ( $self->in_element('hsp') ) {
1745 $self->debug("blast.pm: Processing HSP\n");
1746 # let's read 3 lines at a time;
1747 # bl2seq hackiness... Not sure I like
1748 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1755 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1756 # $self->debug("$i: $_") if $v;
1757 if ( ( $i == 0 && /^\s+$/) ||
1758 /^\s*(?:Lambda|Minus|Plus|Score)/i )
1760 $self->_pushback($_) if defined $_;
1761 $self->end_element( { 'Name' => 'Hsp' } );
1765 if (/^((Query|Sbjct):?\s+(\-?\d+)\s*)(\S+)\s+(\-?\d+)/) {
1766 my ( $full, $type, $start, $str, $end ) =
1767 ( $1, $2, $3, $4, $5 );
1769 if ( $str eq '-' ) {
1770 $i = 3 if $type eq 'Sbjct';
1773 $data{$type} = $str;
1775 $len = length($full);
1776 $self->{"\_$type"}->{'begin'} = $start
1777 unless $self->{"_$type"}->{'begin'};
1778 $self->{"\_$type"}->{'end'} = $end;
1780 $self->throw("no data for midline $_")
1781 unless ( defined $_ && defined $len );
1782 $data{'Mid'} = substr( $_, $len );
1784 $_ = $self->_readline();
1788 'Name' => 'Hsp_qseq',
1789 'Data' => $data{'Query'}
1794 'Name' => 'Hsp_hseq',
1795 'Data' => $data{'Sbjct'}
1800 'Name' => 'Hsp_midline',
1801 'Data' => $data{'Mid'}
1806 #$self->debug("blast.pm: unrecognized line $_");
1810 $self->debug("blast.pm: End of BlastOutput\n");
1811 if ( $self->{'_seentop'} ) {
1812 $self->within_element('hsp')
1813 && $self->end_element( { 'Name' => 'Hsp' } );
1814 $self->within_element('hit')
1815 && $self->end_element( { 'Name' => 'Hit' } );
1816 # cleanup extra hits
1817 $self->_cleanup_hits(\
@hit_signifs) if scalar(@hit_signifs);
1818 $self->within_element('iteration')
1819 && $self->end_element( { 'Name' => 'Iteration' } );
1823 'Name' => 'BlastOutput_program',
1824 'Data' => $reporttype
1828 $self->end_element( { 'Name' => 'BlastOutput' } );
1830 return $self->end_document();
1833 # Private method for internal use only.
1834 sub _start_blastoutput
{
1836 $self->start_element( { 'Name' => 'BlastOutput' } );
1837 $self->{'_seentop'} = 1;
1838 $self->{'_result_count'}++;
1839 $self->{'_handler_rc'} = undef;
1844 Title : _will_handle
1845 Usage : Private method. For internal use only.
1846 if( $self->_will_handle($type) ) { ... }
1847 Function: Provides an optimized way to check whether or not an element of a
1848 given type is to be handled.
1849 Returns : Reference to EventHandler object if the element type is to be handled.
1850 undef if the element type is not to be handled.
1851 Args : string containing type of element.
1859 Using the cached pointer to the EventHandler to minimize repeated
1864 Caching the will_handle status for each type that is encountered so
1865 that it only need be checked by calling
1866 handler-E<gt>will_handle($type) once.
1870 This does not lead to a major savings by itself (only 5-10%). In
1871 combination with other optimizations, or for large parse jobs, the
1872 savings good be significant.
1874 To test against the unoptimized version, remove the parentheses from
1875 around the third term in the ternary " ? : " operator and add two
1876 calls to $self-E<gt>_eventHandler().
1881 my ( $self, $type ) = @_;
1882 my $handler = $self->{'_handler_cache'};
1884 defined( $self->{'_will_handle_cache'}->{$type} )
1885 ?
$self->{'_will_handle_cache'}->{$type}
1886 : ( $self->{'_will_handle_cache'}->{$type} =
1887 $handler->will_handle($type) );
1889 return $will_handle ?
$handler : undef;
1892 =head2 start_element
1894 Title : start_element
1895 Usage : $eventgenerator->start_element
1896 Function: Handles a start element event
1898 Args : hashref with at least 2 keys 'Data' and 'Name'
1903 my ( $self, $data ) = @_;
1905 # we currently don't care about attributes
1906 my $nm = $data->{'Name'};
1907 my $type = $MODEMAP{$nm};
1909 my $handler = $self->_will_handle($type);
1911 my $func = sprintf( "start_%s", lc $type );
1912 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
1915 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
1916 # "Bio::SearchIO::InternalParserError ".
1917 # "\nCan't handle elements of type \'$type.\'"
1920 unshift @
{ $self->{'_elements'} }, $type;
1921 if ( $type eq 'result' ) {
1922 $self->{'_values'} = {};
1923 $self->{'_result'} = undef;
1925 # cleanup some things
1926 if ( defined $self->{'_values'} ) {
1928 grep { /^\U$type\-/ }
1929 keys %{ $self->{'_values'} }
1932 delete $self->{'_values'}->{$k};
1942 Usage : $eventgenerator->end_element
1943 Function: Handles an end element event
1944 Returns : hashref with an element's worth of data
1945 Args : hashref with at least 2 keys 'Data' and 'Name'
1951 my ( $self, $data ) = @_;
1953 my $nm = $data->{'Name'};
1954 my $type = $MODEMAP{$nm};
1956 if ( $nm eq 'BlastOutput_program' ) {
1957 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
1958 $self->{'_reporttype'} = uc $1;
1960 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1963 # Hsps are sort of weird, in that they end when another
1964 # object begins so have to detect this in end_element for now
1965 if ( $nm eq 'Hsp' ) {
1966 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq)) {
1970 'Data' => $self->{'_last_hspdata'}->{$_}
1974 $self->{'_last_hspdata'} = {};
1977 'Name' => 'Hsp_query-from',
1978 'Data' => $self->{'_Query'}->{'begin'}
1983 'Name' => 'Hsp_query-to',
1984 'Data' => $self->{'_Query'}->{'end'}
1990 'Name' => 'Hsp_hit-from',
1991 'Data' => $self->{'_Sbjct'}->{'begin'}
1996 'Name' => 'Hsp_hit-to',
1997 'Data' => $self->{'_Sbjct'}->{'end'}
2001 # } elsif( $nm eq 'Iteration' ) {
2002 # Nothing special needs to be done here.
2004 if ( $type = $MODEMAP{$nm} ) {
2005 my $handler = $self->_will_handle($type);
2007 my $func = sprintf( "end_%s", lc $type );
2008 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2010 shift @
{ $self->{'_elements'} };
2013 elsif ( $MAPPING{$nm} ) {
2015 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2017 # this is where we shove in the data from the
2018 # hashref info about params or statistics
2019 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2020 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2021 $self->{'_last_data'};
2024 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2028 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2030 $self->{'_last_data'} = ''; # remove read data if we are at
2032 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2039 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2040 Function: Convenience method that calls start_element, characters, end_element
2042 Args : Hash ref with the keys 'Name' and 'Data'
2048 my ( $self, $data ) = @_;
2049 # Note that start element isn't needed for character data
2050 # Not too SAX-y, though
2051 #$self->start_element($data);
2052 $self->characters($data);
2053 $self->end_element($data);
2059 Usage : $eventgenerator->characters($str)
2060 Function: Send a character events
2068 my ( $self, $data ) = @_;
2069 if ( $self->in_element('hsp')
2070 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2072 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2073 if defined $data->{'Data'};
2075 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2076 $self->{'_last_data'} = $data->{'Data'};
2079 =head2 within_element
2081 Title : within_element
2082 Usage : if( $eventgenerator->within_element($element) ) {}
2083 Function: Test if we are within a particular element
2084 This is different than 'in' because within can be tested
2087 Args : string element name
2089 See Also: L<in_element>
2093 sub within_element
{
2094 my ( $self, $name ) = @_;
2096 if ( !defined $name && !defined $self->{'_elements'}
2097 || scalar @
{ $self->{'_elements'} } == 0 );
2098 foreach ( @
{ $self->{'_elements'} } ) {
2099 if ( $_ eq $name ) {
2109 Usage : if( $eventgenerator->in_element($element) ) {}
2110 Function: Test if we are in a particular element
2111 This is different than 'within_element' because within
2112 can be tested for a whole block.
2114 Args : string element name
2116 See Also: L<within_element>
2121 my ( $self, $name ) = @_;
2122 return 0 if !defined $self->{'_elements'}->[0];
2123 return ( $self->{'_elements'}->[0] eq $name );
2126 =head2 start_document
2128 Title : start_document
2129 Usage : $eventgenerator->start_document
2130 Function: Handle a start document event
2137 sub start_document
{
2139 $self->{'_lasttype'} = '';
2140 $self->{'_values'} = {};
2141 $self->{'_result'} = undef;
2142 $self->{'_elements'} = [];
2147 Title : end_document
2148 Usage : $eventgenerator->end_document
2149 Function: Handles an end document event
2150 Returns : Bio::Search::Result::ResultI object
2157 my ( $self, @args ) = @_;
2159 #$self->debug("blast.pm: end_document\n");
2160 return $self->{'_result'};
2164 my ( $self, $blast, @args ) = @_;
2166 if ( not defined( $self->writer ) ) {
2167 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2168 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2170 $self->SUPER::write_result
( $blast, @args );
2175 return $self->{'_result_count'};
2178 sub report_count
{ shift->result_count }
2180 =head2 inclusion_threshold
2182 Title : inclusion_threshold
2183 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2184 : $isreb->inclusion_threshold(1e-5);
2185 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2186 score matrix model (blastpgp) that was used for generating the reports
2188 Returns : number (real)
2189 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2190 Args : number (real) (e.g., 0.0001 or 1e-4 )
2194 # Delegates to the event handler.
2195 sub inclusion_threshold
{
2196 shift->_eventHandler->inclusion_threshold(@_);
2199 =head2 max_significance
2201 Usage : $obj->max_significance();
2202 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2203 This is the value of the -signif parameter supplied to new().
2204 Hits with P or E-value above this are skipped.
2205 Returns : Scientific notation number with this format: 1.0e-05.
2206 Argument : Scientific notation number or float (when setting)
2207 Comments : Screening of significant hits uses the data provided on the
2208 : description line. For NCBI BLAST1 and WU-BLAST, this data
2209 : is P-value. for NCBI BLAST2 it is an Expect value.
2213 sub max_significance
{ shift->{'_handler_cache'}->max_significance(@_) }
2217 Synonym for L<max_significance()|max_significance>
2221 sub signif
{ shift->max_significance(@_) }
2225 Usage : $obj->min_score();
2226 Purpose : Set/Get the Blast score used as screening cutoff.
2227 This is the value of the -score parameter supplied to new().
2228 Hits with scores below this are skipped.
2229 Returns : Integer or scientific notation number.
2230 Argument : Integer or scientific notation number (when setting)
2231 Comments : Screening of significant hits uses the data provided on the
2236 sub min_score
{ shift->{'_handler_cache'}->max_significance(@_) }
2238 =head2 min_query_length
2240 Usage : $obj->min_query_length();
2241 Purpose : Gets the query sequence length used as screening criteria.
2242 This is the value of the -min_query_len parameter supplied to new().
2243 Hits with sequence length below this are skipped.
2249 sub min_query_length
{
2252 my $min_qlen = shift;
2253 if ( $min_qlen =~ /\D/ or $min_qlen <= 0 ) {
2255 -class => 'Bio::Root::BadParameter',
2256 -text
=> "Invalid minimum query length value: $min_qlen\n"
2257 . "Value must be an integer > 0. Value not set.",
2261 $self->{'_confirm_qlength'} = 1;
2262 $self->{'_min_query_length'} = $min_qlen;
2265 return $self->{'_min_query_length'};
2268 =head2 best_hit_only
2270 Title : best_hit_only
2271 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2272 Purpose : Set/Get the indicator for whether or not to process only
2273 : the best BlastHit.
2274 Returns : Boolean (1 | 0)
2275 Argument : Boolean (1 | 0) (when setting)
2281 if (@_) { $self->{'_best'} = shift; }
2285 =head2 check_all_hits
2287 Title : check_all_hits
2288 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2289 Purpose : Set/Get the indicator for whether or not to process all hits.
2290 : If false, the parser will stop processing hits after the
2291 : the first non-significance hit or the first hit that fails
2293 Returns : Boolean (1 | 0)
2294 Argument : Boolean (1 | 0) (when setting)
2298 sub check_all_hits
{
2300 if (@_) { $self->{'_check_all'} = shift; }
2301 $self->{'_check_all'};
2304 # commented out, using common base class util method
2305 #=head2 _get_accession_version
2307 # Title : _get_accession_version
2308 # Usage : my ($acc,$ver) = &_get_accession_version($id)
2309 # Function:Private function to get an accession,version pair
2310 # for an ID (if it is in NCBI format)
2311 # Returns : 2-pule of accession, version
2312 # Args : ID string to process
2317 #sub _get_accession_version {
2320 # # handle case when this is accidently called as a class method
2321 # if ( ref($id) && $id->isa('Bio::SearchIO') ) {
2324 # return unless defined $id;
2325 # my ( $acc, $version );
2326 # if ( $id =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|(.*)/ ) {
2327 # ( $acc, $version ) = split /\./, $2;
2329 # elsif ( $id =~ /(pir|prf|pat|gnl)\|(.*)\|(.*)/ ) {
2330 # ( $acc, $version ) = split /\./, $3;
2334 # #punt, not matching the db's at ftp://ftp.ncbi.nih.gov/blast/db/README
2335 # #Database Name Identifier Syntax
2336 # #============================ ========================
2337 # #GenBank gb|accession|locus
2338 # #EMBL Data Library emb|accession|locus
2339 # #DDBJ, DNA Database of Japan dbj|accession|locus
2340 # #NBRF PIR pir||entry
2341 # #Protein Research Foundation prf||name
2342 # #SWISS-PROT sp|accession|entry name
2343 # #Brookhaven Protein Data Bank pdb|entry|chain
2344 # #Patents pat|country|number
2345 # #GenInfo Backbone Id bbs|number
2346 # #General database identifier gnl|database|identifier
2347 # #NCBI Reference Sequence ref|accession|locus
2348 # #Local Sequence identifier lcl|identifier
2351 # return ( $acc, $version );
2354 # general private method used to make minimal hits from leftover
2355 # data in the hit table
2358 my ($self, $hits) = @_;
2359 while ( my $v = shift @
{ $hits }) {
2360 next unless defined $v;
2361 $self->start_element( { 'Name' => 'Hit' } );
2370 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2373 'Name' => 'Hit_accession',
2380 'Name' => 'Hit_signif',
2384 if (exists $self->{'_wublast'}) {
2387 'Name' => 'Hit_score',
2394 'Name' => 'Hit_bits',
2403 'Name' => 'Hit_def',
2407 $self->end_element( { 'Name' => 'Hit' } );
2419 The following information is added
in hopes of increasing the
2420 maintainability of this code
. It runs the risk of becoming obsolete as
2421 the code gets updated
. As always
, double check against the actual
2422 source
. If you find any discrepencies
, please correct them
.
2423 [ This documentation added on
3 Jun
2003. ]
2425 The logic is the brainchild of Jason Stajich
, documented by Steve
2426 Chervitz
. Jason
: please check it over
and modify as you see fit
.
2429 Elmo wants to know
: How does this module unmarshall data from the input stream?
2430 (i
.e
., how does information from a raw input file get added to
2431 the correct Bioperl object?
)
2435 This answer is specific to SearchIO
::blast
, but may apply to other
2436 SearchIO
.pm subclasses as well
. The following description gives the
2437 basic idea
. The actual processing is a little more complex
for
2438 certain types of data
(HSP
, Report Parameters
).
2440 You can think of blast
::next_result
() as faking a SAX XML parser
,
2441 making a non
-XML document behave like its XML
. The overhead to
do this
2442 is quite substantial
(~650 lines of code instead of
~80 in
2445 0. First
, add a key
=> value pair
for the datum of interest to
%MAPPING
2447 'Foo_bar' => 'Foo-bar',
2449 1. next_result
() collects the datum of interest from the input stream
,
2450 and calls element
().
2452 $self->element({ 'Name' => 'Foo_bar',
2453 'Data' => $foobar});
2455 2. The element
() method is a convenience method that calls start_element
(),
2456 characters
(), and end_element
().
2458 3. start_element
() checks to see
if the event handler can handle a start_xxx
(),
2459 where xxx
= the
'Name' parameter passed into element
(), and calls start_xxx
()
2460 if so
. Otherwise
, start_element
() does
not do anything
.
2462 Data that will have such an event handler are
defined in %MODEMAP.
2463 Typically
, there are only handler methods
for the main parts of
2464 the search result
(e
.g
., Result
, Iteration
, Hit
, HSP
),
2465 which have corresponding Bioperl modules
. So
in this example
,
2466 there was an earlier call such as
$self->element({'Name'=>'Foo'})
2467 and the Foo_bar datum is meant to ultimately go into a Foo object
.
2469 The start_foo
() method
in the handler will typically
do any
2470 data initialization necessary to prepare
for creating a new Foo object
.
2471 Example
: SearchResultEventBuilder
::start_result
()
2473 4. characters
() takes the value of the
'Data' key from the hashref argument
in
2474 the elements
() call
and saves it
in a
local data member
:
2476 $self->{'_last_data'} = $data->{'Data'};
2478 5. end_element
() is like start_element
() in that it does the check
for whether
2479 the event handler can handle end_xxx
() and if so
, calls it
, passing
in
2480 the data collected from all of the characters
() calls that occurred
2481 since the start_xxx
() call
.
2483 If there isn
't any special handler for the data type specified by 'Name
',
2484 end_element() will place the data saved by characters() into another
2485 local data member that saves it in a hash with a key defined by %MAPPING.
2487 $nm = $data->{'Name
'};
2488 $self->{'_values
'}->{$MAPPING{$nm}} = $self->{'_last_data
'};
2490 In this case, $MAPPING{$nm} is 'Foo
-bar
'.
2492 end_element() finishes by resetting the local data member used by
2493 characters(). (i.e., $self->{'_last_data
'} = '';)
2495 6. When the next_result() method encounters the end of the Foo element in the
2496 input stream. It will invoke $self->end_element({'Name
'=>'Foo
'}).
2497 end_element() then sends all of the data in the $self->{'_values
'} hash.
2498 Note that $self->{'_values
'} is cleaned out during start_element(),
2499 keeping it at a resonable size.
2501 In the event handler, the end_foo() method takes the hash from end_element()
2502 and creates a new hash containing the same data, but having keys lacking
2503 the 'Foo
' prefix (e.g., 'Foo
-bar
' becomes '-bar
'). The handler's end_foo
()
2504 method then creates the Foo object
, passing
in this new hash as an argument
.
2505 Example
: SearchResultEventBuilder
::end_result
()
2507 7. Objects created from the data
in the search result are managed by
2508 the event handler which adds them to a ResultI object
(using API methods
2509 for that object
). The ResultI object gets passed back to
2510 SearchIO
::end_element
() when it calls end_result
().
2512 The ResultI object is then saved
in an internal data member of the
2513 SearchIO object
, which returns it at the end of next_result
()
2514 by calling end_document
().
2516 (Technical Note
: All objects created by end_xxx
() methods
in the event
2517 handler are returned to SearchIO
::end_element
(), but the SearchIO object
2518 only cares about the ResultI objects
.)
2520 (Sesame Street aficionados note
: This answer was NOT
given by Mr
. Noodle
;-P
)